| name | cortex-troubleshoot |
| description | Troubleshoot cortex connection failures, missing logs, unhealthy containers, restart loops, or vague "logs aren't working" reports. |
Cortex Troubleshooting Skill
Diagnose cortex problems systematically. Use the binary's observability counters and existing diagnostic tooling rather than guessing — the codebase exposes most state needed to localize a failure.
Decision tree — pick the right diagnostic
Match the user's report against one of these branches and follow only that branch. Don't run every check; that's what cortex-dr is for and it overwhelms when the failure is narrow.
Branch A — "MCP can't connect" / "Failed to reconnect" / "401 / 404 from /mcp"
Most common cause: empty / wrong $CLAUDE_PLUGIN_OPTION_SERVER_URL, mismatched $CLAUDE_PLUGIN_OPTION_API_TOKEN, or service not running.
- Is anything listening on the MCP port?
ss -tlnp | grep -E ":$CLAUDE_PLUGIN_OPTION_MCP_PORT" — if empty, the service is down → branch C
- Is the URL Claude Code is using sane?
Read
~/.claude/settings.json, find the pluginConfigs key that starts with cortex@, and inspect options.server_url — empty string is a known footgun (the .mcp.json substitution produces a literal /mcp). Check non-empty, has scheme, no trailing /mcp.
- Does observed auth match configured auth?
Run
curl -sS -o /dev/null -w '%{http_code}' "$CLAUDE_PLUGIN_OPTION_SERVER_URL/mcp".
- If
$CLAUDE_PLUGIN_OPTION_NO_AUTH is true or no bearer/OAuth auth is configured, 200 or MCP protocol-level 400/405 can be normal route evidence.
- If bearer or OAuth auth is enabled, expect
401 for an unauthenticated request.
- If
404, the route is wrong or a different server owns that port. If connection refused, branch C.
- If
200 while auth is intended to be enabled, flag it as an auth configuration mismatch.
- Token roundtrip in bearer mode:
curl -sS -X POST -H "Authorization: Bearer $CLAUDE_PLUGIN_OPTION_API_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json, text/event-stream" -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-06-18","capabilities":{},"clientInfo":{"name":"curl","version":"0"}}}' "$CLAUDE_PLUGIN_OPTION_SERVER_URL/mcp". 401 = wrong token. 200 with valid response = server fine, problem is in Claude Code's MCP client config. For OAuth mode, use the OAuth client flow instead of bearer-token curl. Note: verify the MCP protocol version string (2025-06-18) matches the current spec if this test fails unexpectedly.
Branch B — "No logs from " / "host X stopped sending" / "missing entries"
- Does the host appear in the hosts list at all?
Call MCP tool:
cortex action=hosts. If host is absent, no logs ever arrived → check forwarding config on <host>. If present with old last_seen, forwarding stopped → check rsyslog/forwarder on host.
- Is the listener actually accepting connections?
ss -tlnp | grep -E ":(${CLAUDE_PLUGIN_OPTION_SYSLOG_HOST_PORT:-$CLAUDE_PLUGIN_OPTION_SYSLOG_PORT})\\b" should show our process or container port publish. From <host>: nc -zv <our_host> "${CLAUDE_PLUGIN_OPTION_SYSLOG_HOST_PORT:-$CLAUDE_PLUGIN_OPTION_SYSLOG_PORT}" should connect.
- Recent forwarding errors on the host?
ssh <host> "sudo journalctl -t rsyslogd -n 30 --no-pager" — look for omfwd errors (DNS resolution, peer closed, EOF on TCP). Common patterns we've seen: stale forwarder pointing at a dead host, idle TCP timeout flapping, missing rsyslog drop-in.
- Drop-in present and correct?
ssh <host> "cat /etc/rsyslog.d/99-cortex.conf 2>/dev/null" should contain *.* @@<our_host>:<externally reachable syslog port> (TCP), usually ${CLAUDE_PLUGIN_OPTION_SYSLOG_HOST_PORT:-$CLAUDE_PLUGIN_OPTION_SYSLOG_PORT} — if missing or wrong, use cortex-deploy-dropins.
- For Docker container logs: if user expected logs from a container in
$CLAUDE_PLUGIN_OPTION_FLEET_HOSTS but doesn't see them, check $CLAUDE_PLUGIN_OPTION_DOCKER_INGEST_ENABLED. If false, ingest is off entirely. If true, verify the docker-socket-proxy on that host is reachable: curl -sS http://<host>:2375/_ping should return OK.
Branch C — Service down / crashing / unhealthy
- Get current state:
docker ps --filter name=cortex --format '{{.Status}}'
- If recently restarted / crashing — get the actual error: use
cortex-logs for the last 100 lines, or run docker compose logs manually. Look for: panic messages, port-bind errors (address already in use), DB lock errors, OOM kills.
- Common service-failure causes (ranked by frequency in this plugin's history):
- Port
$CLAUDE_PLUGIN_OPTION_SYSLOG_PORT or $CLAUDE_PLUGIN_OPTION_MCP_PORT held by another process. First identify the owner with ss -tulpn/lsof/fuser; only kill or restart anything after the user approves the specific process and impact.
- Database lock (another
cortex stdio process holds it). pgrep -af "cortex" to list candidates; only kill stragglers after approval.
- Docker image missing/stale:
docker compose pull to refresh.
- If healthcheck failing but
/health works manually: Container is unhealthy because the healthcheck command inside the image is wrong/can't run. Compare image version to what you expect — docker inspect cortex | jq '.[0].Config.Image'.
Branch D — "Something's off" / vague / user doesn't know
Use cortex-dr for the comprehensive preflight and health check. Its PASS / WARN / FAIL output narrows the problem to a specific check. Then re-enter this skill on the failing check's category.
Use observability counters
The binary exposes runtime counters via cortex action=stats and /health. Useful signals:
total_logs not increasing → ingest pipeline is broken, not just MCP
write_blocked: true → storage budget tripped, oldest logs being purged but can't keep up; check $CLAUDE_PLUGIN_OPTION_MAX_DB_SIZE_MB vs disk free
phantom_fts_rows growing → retention purges aren't merging FTS5 cleanly; usually self-recovers
last_ingest_at minutes-stale → forwarders aren't reaching us
- Newer counters in
RuntimeObservability (since v0.13.0): UDP/TCP packets, ingest queue depth, writer flush failures — pull these via the /health endpoint or stats action and use to localize "ingest path" vs "writer path" failures
Don't over-fix
- For a single-host symptom, don't restart the whole stack — just fix that host's forwarder.
- For an MCP-only failure with healthy ingest, don't touch the listener config.
- If the immediate problem is a missing config, prefer
cortex-redeploy over manual Docker commands.
When to escalate to the user
- After a confident diagnosis, propose the fix and ask before applying it for anything destructive: changing settings.json, killing processes, deleting files, switching deploy modes.
- If checks return inconsistent state (e.g. listener says ours, but the binary says it isn't writing), surface the inconsistency rather than guessing.
- If the failure looks like an upstream bug (panic, deadlock, repeated crash on the same input), gather the journalctl/docker-logs output and stop — don't try multiple fix attempts on suspected source-code bugs.