with one click
talon
// Operate Talon, the Rust infrastructure watchdog daemon that supervises the system-bus worker and monitors k8s. ADR-0159.
// Operate Talon, the Rust infrastructure watchdog daemon that supervises the system-bus worker and monitors k8s. ADR-0159.
Run a comprehensive health check of the joelclaw system — k8s cluster, worker, Inngest, Redis, Typesense/OTEL, tests, TypeScript, repo sync, memory pipeline, pi-tools, git config, active loops, disk, stale tests. Outputs a 1-10 score with per-component breakdown. Use when: 'system health', 'health check', 'is everything working', 'system status', 'how's the system', 'check everything', or at session start to orient.
Operate the joelclaw Kubernetes cluster — Talos Linux on Colima (Mac Mini). Deploy services, check health, debug pods, recover from restarts, add ports, manage Helm releases, inspect logs, fix networking. Triggers on: 'kubectl', 'pods', 'deploy to k8s', 'cluster health', 'restart pod', 'helm install', 'talosctl', 'colima', 'nodeport', 'flannel', 'port mapping', 'k8s down', 'cluster not working', 'add a port', 'PVC', 'storage', any k8s/Talos/Colima infrastructure task. Also triggers on service-specific deploy: 'deploy redis', 'redeploy inngest', 'livekit helm', 'pds not responding'.
Deploy the system-bus-worker to the joelclaw Kubernetes cluster from local machine. Use when syncing changes in packages/system-bus to k8s, especially because the GitHub Actions deploy job targets a non-existent self-hosted runner and cannot complete deploys automatically.
Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise. Use when user wants to stress-test a plan against their project's language and documented decisions.
Search captured agent Runs and raw local/remote Pi sessions, especially dark-wizard sessions, using the joelclaw sessions bridge. Use when the user asks to search sessions, find prior dark-wizard/Panda/pi/codex/claude context, recover conversation history, verify session indexing, or bypass stale rag_search_sessions/Typesense results.
Create new joelclaw skills with the idiomatic process — repo-canonical, symlinked, git-tracked, slogged. Triggers on 'add a skill', 'create skill', 'new skill', 'canonical skill', 'make a skill for', or any request to formalize a process or domain into a reusable skill.
| name | talon |
| description | Operate Talon, the Rust infrastructure watchdog daemon that supervises the system-bus worker and monitors k8s. ADR-0159. |
Compiled Rust binary that supervises the system-bus worker AND monitors the full k8s infrastructure stack. ADR-0159.
talon validate # Parse/validate config + services files, print summary JSON
talon --check # Single probe cycle, print results, exit
talon --status # Current state machine position
talon --dry-run # Print loaded config, exit
talon --worker-only # Supervisor only, no infra probes
talon # Full daemon mode (worker + probes + escalation)
| What | Where |
|---|---|
| Binary | ~/.local/bin/talon |
| Source | ~/Code/joelhooks/joelclaw/infra/talon/src/ |
| Config | ~/.config/talon/config.toml |
| Service monitors | ~/.joelclaw/talon/services.toml |
| Default config | ~/Code/joelhooks/joelclaw/infra/talon/config.default.toml |
| Default services template | ~/Code/joelhooks/joelclaw/infra/talon/services.default.toml |
| Voice stale cleanup | ~/Code/joelhooks/joelclaw/infra/voice-agent/cleanup-stale.sh |
| State | ~/.local/state/talon/state.json |
| Probe results | ~/.local/state/talon/last-probe.json |
| Log | ~/.local/state/talon/talon.log (JSON lines, 10MB rotation) |
| Launchd plist | ~/Code/joelhooks/joelclaw/infra/launchd/com.joel.talon.plist |
| RBAC guard manifest | ~/Code/joelhooks/joelclaw/k8s/apiserver-kubelet-client-rbac.yaml |
| Worker stdout | ~/.local/log/system-bus-worker.log |
| Worker stderr | ~/.local/log/system-bus-worker.err |
| Talon launchd log | ~/.local/log/talon.err |
export PATH="$HOME/.cargo/bin:$PATH"
cd ~/Code/joelhooks/joelclaw/infra/talon
cargo build --release
cp target/release/talon ~/.local/bin/talon
talon (single binary)
├── Worker Supervisor Thread (only when external launchd supervisor is not loaded)
│ ├── Kill orphan on port 3111
│ ├── Spawn bun (child process)
│ ├── Signal forwarding (SIGTERM → bun)
│ ├── Health poll every 30s
│ ├── PUT sync after healthy startup
│ └── Crash recovery: exponential backoff 1s→30s
│
├── Infrastructure Probe Loop (main thread, 60s)
│ ├── Colima VM alive?
│ ├── Docker socket responding?
│ ├── Talos container running?
│ ├── k8s API reachable?
│ ├── Node Ready + schedulable?
│ ├── Flannel daemonset ready?
│ ├── Redis PONG?
│ ├── Inngest /health 200?
│ ├── Typesense /health ok?
│ └── Worker /api/inngest 200?
│
└── Escalation (on failure)
├── Tier 1a: bridge-heal (force-cycle Colima on localhost↔VM split-brain)
├── Tier 1b: k8s-reboot-heal.sh (300s timeout, RBAC drift guard, VM `br_netfilter` repair, warmup-aware post-Colima invariants including deployment readiness + ImagePullBackOff pod reset, then voice-agent stale cleanup + launchd kickstart via `infra/voice-agent/cleanup-stale.sh`)
├── Tier 2: pi agent (cloud model, 10min cooldown, bounded by `agent.timeout_secs`; subprocess output uses temp files and timeout kills the whole process group so a stuck pi child cannot freeze Talon's health loop)
├── Tier 3: pi agent (Ollama local, network-down fallback, same process-group timeout guard)
└── Tier 4: Telegram + iMessage SOS fan-out (15min critical threshold)
healthy → degraded (1 critical probe failure)
degraded → failed (3 consecutive failures)
failed → investigating (agent spawned)
investigating → healthy (probes pass again)
investigating → critical (agent failed to fix)
critical → sos (SOS sent via Telegram + iMessage)
any → healthy (all probes pass)
| Probe | Command | Critical? |
|---|---|---|
| colima | colima status | Yes |
| docker | docker ps (Colima socket) | Yes |
| talos_container | docker inspect joelclaw-controlplane-1 | Yes |
| k8s_api | kubectl get nodes | Yes |
| node_ready | kubectl jsonpath for Ready condition | Yes |
| node_schedulable | kubectl jsonpath for spec (taints/cordon) | Yes |
| flannel | kubectl -n kube-system get daemonset kube-flannel -o jsonpath=... | No |
| redis | kubectl exec redis-0 -- redis-cli ping | Yes |
| kubelet_proxy_rbac | kubectl auth can-i --as=<apiserver-kubelet-client*> {get,create} nodes --subresource=proxy | Yes |
| vm:docker | ssh -F ~/.colima/_lima/colima/ssh.config lima-colima docker ps | No |
| vm:k8s_api | ssh ... python socket probe :6443 | No |
| vm:redis | ssh ... python socket probe :6379 | No |
| vm:inngest | ssh ... python socket probe :8288 | No |
| vm:typesense | ssh ... python socket probe :8108 | No |
| inngest | curl localhost:8288/health | No |
| typesense | curl localhost:8108/health | No |
| worker | curl localhost:3111/api/inngest | No |
Critical probes trigger escalation immediately. Non-critical need 3 consecutive failures.
VM probes are witness probes only. They let Talon classify "service alive in VM but dead on localhost" as a Colima bridge split-brain and run bridge-heal instead of full recovery first.
Add probes in ~/.joelclaw/talon/services.toml without rebuilding talon:
[launchd.gateway]
label = "com.joel.gateway"
critical = true
timeout_secs = 5
[http.gateway_slack]
url = "http://127.0.0.1:3018/health/slack"
critical = true
critical_after_consecutive_failures = 3
timeout_secs = 5
[launchd.voice_agent]
label = "com.joel.voice-agent"
critical = false
timeout_secs = 5
[script.gateway_telegram_409]
command = "test $(tail -20 /tmp/joelclaw/gateway.err 2>/dev/null | grep -c '409: Conflict') -lt 5"
critical = true
critical_after_consecutive_failures = 3
timeout_secs = 5
[script.colima_orphan_usernet]
command = "test $(pgrep -f 'limactl usernet' | wc -l) -le 2"
critical = true
critical_after_consecutive_failures = 2
timeout_secs = 5
[script.k8s_disk_pressure]
command = "! kubectl get nodes -o jsonpath='{.items[0].spec.taints}' 2>/dev/null | grep -q disk-pressure"
critical = true
critical_after_consecutive_failures = 1
timeout_secs = 10
launchd.<name> passes when launchctl list <label> reports a non-zero PID, or when launchctl print system/<label> / launchctl print gui/$(id -u)/<label> reports state = running. This matters because com.joel.gateway is a system LaunchDaemon while Talon itself is a user LaunchAgent.http.<name> passes on HTTP 200script.<name> passes on exit code 0, fails on non-zero (runs via sh -c)critical = true escalates when the probe is marked critical (or after debounce if configured)critical_after_consecutive_failures = N debounces critical alerts for dynamic probes (default 1 = immediate)http.gateway_slack uses gateway endpoint GET /health/slack, fails (503) when Slack channel is not started, and should be debounced (recommended 3 cycles)http://127.0.0.1:8081/ for voice_agent by default — root returns 503 when idle and causes false SOS noisevoice_agent now clears stale uv/main.py listeners on :8081 before launchctl kickstart to avoid bind conflicts after force-cyclesservices.toml mtime changes (no restart required)kill -HUP $(launchctl print gui/$(id -u)/com.joel.talon | awk '/pid =/{print $3; exit}') forces immediate reloadRecent dynamic probes added for the 2026-03-17 Colima/Restate incident:
script.redis_aof_health — critical after 3 failures; checks aof_last_bgrewrite_status:ok to catch Redis AOF rewrite/persistence corruption.script.colima_vm_uptime — critical after 2 failures; requires VM uptime >120s to catch Colima crash loops after force-cycles.script.restate_worker_ready — critical after 3 failures; verifies the restate-worker pod reports Ready=true before workloads are trusted.script.kvm_device_present — non-critical witness probe; records whether /dev/kvm is present inside Colima for nested-virt / Firecracker diagnosis.GET http://127.0.0.1:9999/health returns Talon state JSON[health] in ~/.config/talon/config.toml[escalation]:
sos_telegram_chat_idsos_telegram_secret_name (defaults to telegram_bot_token)secrets lease <name> --ttl ... (no --raw). If you still see curl: (3) URL rejected: Malformed input to a URL function, redeploy the latest Talon binary.sos_recipientTalon is active as com.joel.talon:
launchctl print gui/$(id -u)/com.joel.talon | rg "state =|pid =|program =|last exit code ="
Reload binary/config after deploy:
launchctl kickstart -k gui/$(id -u)/com.joel.talon
Single owner for worker supervision is mandatory:
com.joel.system-bus-worker is loaded, Talon auto-disables its internal worker supervisor to prevent port-3111 thrash.com.joel.system-bus-worker is a system LaunchDaemon, so verify it with launchctl print system/...; launchctl list <label> only checks the current user bootstrap domain and can lie by omission.launchctl print system/com.joel.system-bus-worker | rg "state =|pid =|program =|last exit code ="
Legacy services should stay disabled when fully cut over:
launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.joel.k8s-reboot-heal.plist
# Validate config + service monitor files
talon validate | python3 -m json.tool
# Check what talon sees right now
talon --check | python3 -m json.tool
# Check state machine
talon --status | python3 -m json.tool
# Broken-pipe robustness smoke test (should exit 0)
talon --check | head -n 1 >/dev/null
# Check health endpoint payload
curl -sS http://127.0.0.1:9999/health | python3 -m json.tool
# Check talon's own logs
tail -20 ~/.local/state/talon/talon.log | python3 -m json.tool
# Check launchd
launchctl list | grep talon
tail -50 ~/.local/log/talon.err
# Manual probe test
DOCKER_HOST=unix:///Users/joel/.colima/default/docker.sock docker inspect --format '{{.State.Status}}' joelclaw-controlplane-1
kubectl exec -n joelclaw redis-0 -- redis-cli ping
kubectl auth can-i --as=apiserver-kubelet-client get nodes --subresource=proxy --all-namespaces
kubectl auth can-i --as=apiserver-kubelet-client create nodes --subresource=proxy --all-namespaces
ssh -F ~/.colima/_lima/colima/ssh.config lima-colima 'curl -sS http://127.0.0.1:8288/health'
# Force bridge repair (same behavior Talon uses for split-brain)
colima stop --force && colima start
# Manual voice-agent stale cleanup (same post-gate step k8s-reboot-heal runs)
~/Code/joelhooks/joelclaw/infra/voice-agent/cleanup-stale.sh
Talon now monitors failure modes discovered during the Firecracker development incident:
| Probe | What it detects | Critical? |
|---|---|---|
script:redis_aof_health | Corrupted Redis AOF from VM crash mid-write | Yes (after 3) |
script:colima_vm_uptime | VM crash-loop (uptime < 120s = just restarted) | Yes (after 2) |
script:restate_worker_ready | Restate worker pod not 1/1 Ready | Yes (after 3) |
script:kvm_available | Whether /dev/kvm exists (nested virt status) | No (informational) |
nestedVirtualization ON + heavy Docker build
→ Colima VZ VM crash (silent, no crash report)
→ Docker daemon restart → Talos container killed
→ Redis mid-write → AOF corruption → crash-loop
→ Restate mid-journal → stale invocations → infinite retries
→ Lima socket forwarding broken → docker CLI dead on macOS
Talon detects each stage:
colima_vm_uptime < 120s → VM just crashedredis probe fails → Redis downredis_aof_health fails → AOF corrupted (needs manual fix)restate_worker_ready fails → worker can't start (may be /dev/kvm mount or image pull)Talon cannot auto-fix Redis AOF corruption (requires redis-check-aof --fix). It WILL escalate to the pi agent (Tier 2) which should load the k8s skill's Redis AOF Recovery procedure.
infra/k8s-reboot-heal.sh: Tier 1 heal scriptinfra/worker-supervisor/: Original standalone worker supervisor (superseded)