一键导入
gaia-debugging
Diagnose why a GAIA question failed — extract trace, classify failure mode, and propose a fix
菜单
Diagnose why a GAIA question failed — extract trace, classify failure mode, and propose a fix
基于 SOC 职业分类
One-command drift detection. Composes audit-list + oia-audit + audit-trend into a single primitive — finds the most recent audit in `metaharness-audit` namespace, runs a fresh audit against the current repo, diffs them via ADR-152 §3.1 similarity, and alerts when structural distance crosses `--threshold`. Iter 53 of ADR-150 deep integration.
ADR-152 — weighted similarity between two harness fingerprints (genome + score JSON). Returns overall score in [0,1] plus per-component breakdown (cosine over 9 numerics, categorical agreement over 4 enums, jaccard over agent_topology). Unblocks ADR-151 §3.2 Recommender, §3.3 Drift Detection, §3.5 Plugin Compat. Pure-TS, no `@metaharness/*` dep — preserves ADR-150's four architectural constraints.
Composite Phase-2 audit worker (ADR-150). Bundles harness oia-manifest + threat-model + mcp-scan into one timestamped audit record stored in the `metaharness-audit` memory namespace. Designed for cron-scheduled drift detection.
7-section repo readiness report from `metaharness genome <path>`. Returns repo_type / agent_topology / risk_score / mcp_surface / test_confidence / publish_readiness. Pure-read; degrades gracefully (ADR-150).
Static security scan of a harness's declared MCP surface via `harness mcp-scan <path>`. Reads `.mcp/servers.json` + `.harness/claims.json`. Pure-read, no dispatch. Exits 1 on findings at or above `--fail-on` severity.
Scaffold a custom AI agent harness via `metaharness new <name> --template <id> --host <id>`. Defaults to DRY-RUN (no writes) unless --confirm is passed. Refuses to write to the calling repo root or anywhere inside it. Honors ADR-150 architectural constraint + ruflo's "destructive-action confirmation" pattern.
| name | gaia-debugging |
| description | Diagnose why a GAIA question failed — extract trace, classify failure mode, and propose a fix |
| argument-hint | <task_id> [--results=<path>] |
| allowed-tools | Bash Read mcp__claude-flow__memory_search mcp__claude-flow__memory_store mcp__claude-flow__agentdb_pattern_search mcp__claude-flow__agentdb_pattern_store |
When a GAIA question fails, systematically diagnose the root cause and propose a targeted fix.
task_id returns the wrong answer or times out| Code | Mode | Symptom | Fix direction |
|---|---|---|---|
| TG | Tool Gap | Agent lacks a required tool (no image OCR, no PDF reader) | Add tool to catalogue |
| RM | Reasoning Miss | Agent has the right data but draws wrong conclusion | Improve system prompt, add CoT instruction |
| EB | Extraction Bug | Answer is in the trace but FINAL_ANSWER: regex fails | Fix answer extraction pattern |
| LI | Loop Issue | Agent loops (re-asks same tool call) and hits turn limit | Increase max-turns or add loop-detection |
| DS | Dataset Shift | Ground truth differs from what web currently shows | Flag for HAL dataset audit |
| AT | API Timeout | Tool call times out; agent never gets the result | Increase per-turn timeout |
# Find the result for the task_id in the latest run
RESULTS=~/.cache/ruflo/gaia/results-latest.json
node -e "
const r = JSON.parse(require('fs').readFileSync('$RESULTS'));
const q = r.results.find(x => x.task_id === '$TASK_ID');
console.log(JSON.stringify(q, null, 2));
"
Look at the trace output:
node v3/@claude-flow/cli/bin/cli.js gaia-bench run \
--level 1 --limit 1 \
--task-id $TASK_ID \
--models claude-sonnet-4-6 \
--max-turns 20 \
--output json
| Failure | Action |
|---|---|
| TG — missing web_browse | Verify gaia-tools/index.ts exports web_browse; check tool registration |
| TG — missing image OCR | Add image_describe tool call; verify GOOGLE_AI_API_KEY |
| RM — reasoning | Add a system prompt instruction: "Before answering, list all facts you have gathered" |
| EB — extraction | Test the FINAL_ANSWER_RE regex against the trace manually |
| LI — loop | Add a tool-call deduplication guard in gaia-agent.ts |
| AT — timeout | Set DEFAULT_PER_TURN_TIMEOUT_MS higher or use --max-turns flag |
# Re-run the single question
node … gaia-bench run --task-id $TASK_ID --models $MODEL --output json
# If now passing, store the pattern
npx @claude-flow/cli@latest memory store \
--namespace gaia-debug-patterns \
--key "fix-$FAILURE_CODE-$(date +%Y%m%d)" \
--value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"
node -e "
const { createDefaultToolCatalogue } = require('./v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.js');
const cat = createDefaultToolCatalogue({});
console.log('Tools registered:', cat.definitions.map(t => t.name));
"
Expected: web_search, file_read, web_browse, image_describe, python_exec
After resolving a debugging session, store the finding:
npx @claude-flow/cli@latest memory store \
--namespace gaia-debug-patterns \
--key "session-$(date +%Y%m%d-%H%M)" \
--value '{"task_id":"$TASK_ID","failure_mode":"$CODE","fix":"$FIX","verified":true}'
Search for similar past failures:
npx @claude-flow/cli@latest memory search \
--namespace gaia-debug-patterns \
--query "extraction bug final answer regex"