تشغيل أي مهارة في Manus بنقرة واحدة

cortex-troubleshoot

النجوم١

التفرعات٢

آخر تحديث٤ يونيو ٢٠٢٦ في ٠٦:٢٥

Troubleshoot cortex connection failures, missing logs, unhealthy containers, restart loops, or vague "logs aren't working" reports.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

jmagar

jmagar/cortex

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

مديرو الشبكات وأنظمة الحاسوبمهن الحاسوب والرياضيات·SOC 15-1244

مستكشف الملفات

2 ملفات

SKILL.md

readonly

name	cortex-troubleshoot
description	Troubleshoot cortex connection failures, missing logs, unhealthy containers, restart loops, or vague "logs aren't working" reports.

Cortex Troubleshooting Skill

Diagnose cortex problems systematically. Use the binary's observability counters and existing diagnostic tooling rather than guessing — the codebase exposes most state needed to localize a failure.

Decision tree — pick the right diagnostic

Match the user's report against one of these branches and follow only that branch. Don't run every check; that's what cortex-dr is for and it overwhelms when the failure is narrow.

Branch A — "MCP can't connect" / "Failed to reconnect" / "401 / 404 from /mcp"

Most common cause: empty / wrong $CLAUDE_PLUGIN_OPTION_SERVER_URL, mismatched $CLAUDE_PLUGIN_OPTION_API_TOKEN, or service not running.

Is anything listening on the MCP port? ss -tlnp | grep -E ":$CLAUDE_PLUGIN_OPTION_MCP_PORT" — if empty, the service is down → branch C
Is the URL Claude Code is using sane? Read ~/.claude/settings.json, find the pluginConfigs key that starts with cortex@, and inspect options.server_url — empty string is a known footgun (the .mcp.json substitution produces a literal /mcp). Check non-empty, has scheme, no trailing /mcp.
Does observed auth match configured auth? Run curl -sS -o /dev/null -w '%{http_code}' "$CLAUDE_PLUGIN_OPTION_SERVER_URL/mcp".
- If $CLAUDE_PLUGIN_OPTION_NO_AUTH is true or no bearer/OAuth auth is configured, 200 or MCP protocol-level 400/405 can be normal route evidence.
- If bearer or OAuth auth is enabled, expect 401 for an unauthenticated request.
- If 404, the route is wrong or a different server owns that port. If connection refused, branch C.
- If 200 while auth is intended to be enabled, flag it as an auth configuration mismatch.
Token roundtrip in bearer mode: curl -sS -X POST -H "Authorization: Bearer $CLAUDE_PLUGIN_OPTION_API_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json, text/event-stream" -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-06-18","capabilities":{},"clientInfo":{"name":"curl","version":"0"}}}' "$CLAUDE_PLUGIN_OPTION_SERVER_URL/mcp". 401 = wrong token. 200 with valid response = server fine, problem is in Claude Code's MCP client config. For OAuth mode, use the OAuth client flow instead of bearer-token curl. Note: verify the MCP protocol version string (2025-06-18) matches the current spec if this test fails unexpectedly.

Branch B — "No logs from " / "host X stopped sending" / "missing entries"

Does the host appear in the hosts list at all? Call MCP tool: cortex action=hosts. If host is absent, no logs ever arrived → check forwarding config on <host>. If present with old last_seen, forwarding stopped → check rsyslog/forwarder on host.
Is the listener actually accepting connections? ss -tlnp | grep -E ":(${CLAUDE_PLUGIN_OPTION_SYSLOG_HOST_PORT:-$CLAUDE_PLUGIN_OPTION_SYSLOG_PORT})\\b" should show our process or container port publish. From <host>: nc -zv <our_host> "${CLAUDE_PLUGIN_OPTION_SYSLOG_HOST_PORT:-$CLAUDE_PLUGIN_OPTION_SYSLOG_PORT}" should connect.
Recent forwarding errors on the host? ssh <host> "sudo journalctl -t rsyslogd -n 30 --no-pager" — look for omfwd errors (DNS resolution, peer closed, EOF on TCP). Common patterns we've seen: stale forwarder pointing at a dead host, idle TCP timeout flapping, missing rsyslog drop-in.
Drop-in present and correct? ssh <host> "cat /etc/rsyslog.d/99-cortex.conf 2>/dev/null" should contain *.* @@<our_host>:<externally reachable syslog port> (TCP), usually ${CLAUDE_PLUGIN_OPTION_SYSLOG_HOST_PORT:-$CLAUDE_PLUGIN_OPTION_SYSLOG_PORT} — if missing or wrong, use cortex-deploy-dropins.
For Docker container logs: if user expected logs from a container in $CLAUDE_PLUGIN_OPTION_FLEET_HOSTS but doesn't see them, check $CLAUDE_PLUGIN_OPTION_DOCKER_INGEST_ENABLED. If false, ingest is off entirely. If true, verify the docker-socket-proxy on that host is reachable: curl -sS http://<host>:2375/_ping should return OK.

Branch C — Service down / crashing / unhealthy

Get current state: docker ps --filter name=cortex --format '{{.Status}}'
If recently restarted / crashing — get the actual error: use cortex-logs for the last 100 lines, or run docker compose logs manually. Look for: panic messages, port-bind errors (address already in use), DB lock errors, OOM kills.
Common service-failure causes (ranked by frequency in this plugin's history):
1. Port $CLAUDE_PLUGIN_OPTION_SYSLOG_PORT or $CLAUDE_PLUGIN_OPTION_MCP_PORT held by another process. First identify the owner with ss -tulpn/lsof/fuser; only kill or restart anything after the user approves the specific process and impact.
2. Database lock (another cortex stdio process holds it). pgrep -af "cortex" to list candidates; only kill stragglers after approval.
3. Docker image missing/stale: docker compose pull to refresh.
If healthcheck failing but /health works manually: Container is unhealthy because the healthcheck command inside the image is wrong/can't run. Compare image version to what you expect — docker inspect cortex | jq '.[0].Config.Image'.

Branch D — "Something's off" / vague / user doesn't know

Use cortex-dr for the comprehensive preflight and health check. Its PASS / WARN / FAIL output narrows the problem to a specific check. Then re-enter this skill on the failing check's category.

Use observability counters

The binary exposes runtime counters via cortex action=stats and /health. Useful signals:

total_logs not increasing → ingest pipeline is broken, not just MCP
write_blocked: true → storage budget tripped, oldest logs being purged but can't keep up; check $CLAUDE_PLUGIN_OPTION_MAX_DB_SIZE_MB vs disk free
phantom_fts_rows growing → retention purges aren't merging FTS5 cleanly; usually self-recovers
last_ingest_at minutes-stale → forwarders aren't reaching us
Newer counters in RuntimeObservability (since v0.13.0): UDP/TCP packets, ingest queue depth, writer flush failures — pull these via the /health endpoint or stats action and use to localize "ingest path" vs "writer path" failures

Don't over-fix

For a single-host symptom, don't restart the whole stack — just fix that host's forwarder.
For an MCP-only failure with healthy ingest, don't touch the listener config.
If the immediate problem is a missing config, prefer cortex-redeploy over manual Docker commands.

When to escalate to the user

After a confident diagnosis, propose the fix and ask before applying it for anything destructive: changing settings.json, killing processes, deleting files, switching deploy modes.
If checks return inconsistent state (e.g. listener says ours, but the binary says it isn't writing), surface the inconsistency rather than guessing.
If the failure looks like an upstream bug (panic, deadlock, repeated crash on the same input), gather the journalctl/docker-logs output and stop — don't try multiple fix attempts on suspected source-code bugs.

المزيد من هذا المستودع

نفس المستودع

cortex-report

jmagar/cortex

This skill should be used when the user asks for a homelab health report, syslog summary, fleet status report, log analysis summary, 'what happened in the last 24 hours', 'show me this week's errors', 'summarize recent activity', or any time-bounded log analysis that should produce a written markdown report.

2026-06-161

cortex

jmagar/cortex

This skill should be used when the user asks to "search logs", "check errors", "tail logs", "show recent logs", "find log entries", "correlate events", "list hosts", "log stats", "syslog", "check homelab logs", or mentions system logs, syslog, log analysis, or log intelligence across homelab hosts.

2026-06-161

cortex-redeploy

jmagar/cortex

Re-run the cortex plugin setup hook with the current userConfig and verify the Docker Compose deployment. Use when the user asks to redeploy cortex, apply plugin config changes immediately, rerun the setup hook, refresh the Docker deployment, or recover after an automated SessionStart/ConfigChange hook did not run.

2026-06-101

cortex-deploy-dropins

jmagar/cortex

Deploy rsyslog forwarding drop-ins to configured fleet hosts over SSH. Use when configuring fleet forwarding, repairing missing rsyslog forwarding, or updating forwarding after server_url or syslog port changes.

2026-06-041

cortex-dr

jmagar/cortex

Run a comprehensive cortex health check covering environment, config quality, storage, ports, service status, HTTP health, MCP actions, listener reachability, Docker ingest, and fleet rsyslog forwarding. Use when the user asks for syslog doctor, deployment diagnostics, first-run preflight, health check, sanity check, or broad deployment verification.

2026-06-041

cortex-frustration-assessment

jmagar/cortex

This skill should be used after running cortex action=abuse_investigate to analyze the resulting evidence bundle. Use when the user asks to assess frustration incidents, evaluate abuse signals, analyze agent or user friction, produce a frustration report, or follow up on abuse_investigate results.

2026-06-041

name	cortex-troubleshoot
description	Troubleshoot cortex connection failures, missing logs, unhealthy containers, restart loops, or vague "logs aren't working" reports.

Cortex Troubleshooting Skill

Diagnose cortex problems systematically. Use the binary's observability counters and existing diagnostic tooling rather than guessing — the codebase exposes most state needed to localize a failure.

Decision tree — pick the right diagnostic

Match the user's report against one of these branches and follow only that branch. Don't run every check; that's what cortex-dr is for and it overwhelms when the failure is narrow.

Branch A — "MCP can't connect" / "Failed to reconnect" / "401 / 404 from /mcp"

Most common cause: empty / wrong $CLAUDE_PLUGIN_OPTION_SERVER_URL, mismatched $CLAUDE_PLUGIN_OPTION_API_TOKEN, or service not running.

Is anything listening on the MCP port? ss -tlnp | grep -E ":$CLAUDE_PLUGIN_OPTION_MCP_PORT" — if empty, the service is down → branch C
Is the URL Claude Code is using sane? Read ~/.claude/settings.json, find the pluginConfigs key that starts with cortex@, and inspect options.server_url — empty string is a known footgun (the .mcp.json substitution produces a literal /mcp). Check non-empty, has scheme, no trailing /mcp.
Does observed auth match configured auth? Run curl -sS -o /dev/null -w '%{http_code}' "$CLAUDE_PLUGIN_OPTION_SERVER_URL/mcp".
- If $CLAUDE_PLUGIN_OPTION_NO_AUTH is true or no bearer/OAuth auth is configured, 200 or MCP protocol-level 400/405 can be normal route evidence.
- If bearer or OAuth auth is enabled, expect 401 for an unauthenticated request.
- If 404, the route is wrong or a different server owns that port. If connection refused, branch C.
- If 200 while auth is intended to be enabled, flag it as an auth configuration mismatch.
Token roundtrip in bearer mode: curl -sS -X POST -H "Authorization: Bearer $CLAUDE_PLUGIN_OPTION_API_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json, text/event-stream" -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-06-18","capabilities":{},"clientInfo":{"name":"curl","version":"0"}}}' "$CLAUDE_PLUGIN_OPTION_SERVER_URL/mcp". 401 = wrong token. 200 with valid response = server fine, problem is in Claude Code's MCP client config. For OAuth mode, use the OAuth client flow instead of bearer-token curl. Note: verify the MCP protocol version string (2025-06-18) matches the current spec if this test fails unexpectedly.

Branch B — "No logs from " / "host X stopped sending" / "missing entries"

Does the host appear in the hosts list at all? Call MCP tool: cortex action=hosts. If host is absent, no logs ever arrived → check forwarding config on <host>. If present with old last_seen, forwarding stopped → check rsyslog/forwarder on host.
Is the listener actually accepting connections? ss -tlnp | grep -E ":(${CLAUDE_PLUGIN_OPTION_SYSLOG_HOST_PORT:-$CLAUDE_PLUGIN_OPTION_SYSLOG_PORT})\\b" should show our process or container port publish. From <host>: nc -zv <our_host> "${CLAUDE_PLUGIN_OPTION_SYSLOG_HOST_PORT:-$CLAUDE_PLUGIN_OPTION_SYSLOG_PORT}" should connect.
Recent forwarding errors on the host? ssh <host> "sudo journalctl -t rsyslogd -n 30 --no-pager" — look for omfwd errors (DNS resolution, peer closed, EOF on TCP). Common patterns we've seen: stale forwarder pointing at a dead host, idle TCP timeout flapping, missing rsyslog drop-in.
Drop-in present and correct? ssh <host> "cat /etc/rsyslog.d/99-cortex.conf 2>/dev/null" should contain *.* @@<our_host>:<externally reachable syslog port> (TCP), usually ${CLAUDE_PLUGIN_OPTION_SYSLOG_HOST_PORT:-$CLAUDE_PLUGIN_OPTION_SYSLOG_PORT} — if missing or wrong, use cortex-deploy-dropins.
For Docker container logs: if user expected logs from a container in $CLAUDE_PLUGIN_OPTION_FLEET_HOSTS but doesn't see them, check $CLAUDE_PLUGIN_OPTION_DOCKER_INGEST_ENABLED. If false, ingest is off entirely. If true, verify the docker-socket-proxy on that host is reachable: curl -sS http://<host>:2375/_ping should return OK.

Branch C — Service down / crashing / unhealthy

Get current state: docker ps --filter name=cortex --format '{{.Status}}'
If recently restarted / crashing — get the actual error: use cortex-logs for the last 100 lines, or run docker compose logs manually. Look for: panic messages, port-bind errors (address already in use), DB lock errors, OOM kills.
Common service-failure causes (ranked by frequency in this plugin's history):
1. Port $CLAUDE_PLUGIN_OPTION_SYSLOG_PORT or $CLAUDE_PLUGIN_OPTION_MCP_PORT held by another process. First identify the owner with ss -tulpn/lsof/fuser; only kill or restart anything after the user approves the specific process and impact.
2. Database lock (another cortex stdio process holds it). pgrep -af "cortex" to list candidates; only kill stragglers after approval.
3. Docker image missing/stale: docker compose pull to refresh.
If healthcheck failing but /health works manually: Container is unhealthy because the healthcheck command inside the image is wrong/can't run. Compare image version to what you expect — docker inspect cortex | jq '.[0].Config.Image'.

Branch D — "Something's off" / vague / user doesn't know

Use cortex-dr for the comprehensive preflight and health check. Its PASS / WARN / FAIL output narrows the problem to a specific check. Then re-enter this skill on the failing check's category.

Use observability counters

The binary exposes runtime counters via cortex action=stats and /health. Useful signals:

total_logs not increasing → ingest pipeline is broken, not just MCP
write_blocked: true → storage budget tripped, oldest logs being purged but can't keep up; check $CLAUDE_PLUGIN_OPTION_MAX_DB_SIZE_MB vs disk free
phantom_fts_rows growing → retention purges aren't merging FTS5 cleanly; usually self-recovers
last_ingest_at minutes-stale → forwarders aren't reaching us
Newer counters in RuntimeObservability (since v0.13.0): UDP/TCP packets, ingest queue depth, writer flush failures — pull these via the /health endpoint or stats action and use to localize "ingest path" vs "writer path" failures

Don't over-fix

For a single-host symptom, don't restart the whole stack — just fix that host's forwarder.
For an MCP-only failure with healthy ingest, don't touch the listener config.
If the immediate problem is a missing config, prefer cortex-redeploy over manual Docker commands.

When to escalate to the user

After a confident diagnosis, propose the fix and ask before applying it for anything destructive: changing settings.json, killing processes, deleting files, switching deploy modes.
If checks return inconsistent state (e.g. listener says ours, but the binary says it isn't writing), surface the inconsistency rather than guessing.
If the failure looks like an upstream bug (panic, deadlock, repeated crash on the same input), gather the journalctl/docker-logs output and stop — don't try multiple fix attempts on suspected source-code bugs.