一键在 Manus 中运行任何 Skill

开始使用

gaia-debugging

Diagnose why a GAIA question failed — extract trace, classify failure mode, and propose a fix

在 Manus 中运行

星标60,814

分支7,069

更新时间2026年5月28日 03:06

来源

ruvnet

ruvnet/ruflo

打开 GitHub 仓库查看创作者相关仓库

安装命令

下载

在 Manus 中运行

GAIA Debugging Skill

When a GAIA question fails, systematically diagnose the root cause and propose a targeted fix.

When to use

A specific task_id returns the wrong answer or times out
Pass-rate dropped between two runs and you need to find the regression
You want to understand why a particular question class is consistently failing

Failure mode taxonomy

Code	Mode	Symptom	Fix direction
TG	Tool Gap	Agent lacks a required tool (no image OCR, no PDF reader)	Add tool to catalogue
RM	Reasoning Miss	Agent has the right data but draws wrong conclusion	Improve system prompt, add CoT instruction
EB	Extraction Bug	Answer is in the trace but `FINAL_ANSWER:` regex fails	Fix answer extraction pattern
LI	Loop Issue	Agent loops (re-asks same tool call) and hits turn limit	Increase max-turns or add loop-detection
DS	Dataset Shift	Ground truth differs from what web currently shows	Flag for HAL dataset audit
AT	API Timeout	Tool call times out; agent never gets the result	Increase per-turn timeout

Diagnostic workflow

Step 1 — Load the question trace

# Find the result for the task_id in the latest run
RESULTS=~/.cache/ruflo/gaia/results-latest.json
node -e "
  const r = JSON.parse(require('fs').readFileSync('$RESULTS'));
  const q = r.results.find(x => x.task_id === '$TASK_ID');
  console.log(JSON.stringify(q, null, 2));
"

Step 2 — Classify the failure

Look at the trace output:

No tools called at all → RM or configuration issue
Tool called but returned error → TG or AT
Tool returned data, wrong answer → RM or EB
Correct answer in trace but marked wrong → EB
max-turns hit → LI or question too hard for current model

Step 3 — Re-run with extended logging

node v3/@claude-flow/cli/bin/cli.js gaia-bench run \
  --level 1 --limit 1 \
  --task-id $TASK_ID \
  --models claude-sonnet-4-6 \
  --max-turns 20 \
  --output json

Step 4 — Apply targeted fix

Failure	Action
TG — missing web_browse	Verify `gaia-tools/index.ts` exports `web_browse`; check tool registration
TG — missing image OCR	Add `image_describe` tool call; verify `GOOGLE_AI_API_KEY`
RM — reasoning	Add a system prompt instruction: "Before answering, list all facts you have gathered"
EB — extraction	Test the `FINAL_ANSWER_RE` regex against the trace manually
LI — loop	Add a tool-call deduplication guard in `gaia-agent.ts`
AT — timeout	Set `DEFAULT_PER_TURN_TIMEOUT_MS` higher or use `--max-turns` flag

Step 5 — Verify fix and store pattern

# Re-run the single question
node … gaia-bench run --task-id $TASK_ID --models $MODEL --output json

# If now passing, store the pattern
npx @claude-flow/cli@latest memory store \
  --namespace gaia-debug-patterns \
  --key "fix-$FAILURE_CODE-$(date +%Y%m%d)" \
  --value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"

Quick reference: tool catalogue check

node -e "
  const { createDefaultToolCatalogue } = require('./v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.js');
  const cat = createDefaultToolCatalogue({});
  console.log('Tools registered:', cat.definitions.map(t => t.name));
"

Expected: web_search, file_read, web_browse, image_describe, python_exec

Pattern storage

After resolving a debugging session, store the finding:

npx @claude-flow/cli@latest memory store \
  --namespace gaia-debug-patterns \
  --key "session-$(date +%Y%m%d-%H%M)" \
  --value '{"task_id":"$TASK_ID","failure_mode":"$CODE","fix":"$FIX","verified":true}'

Search for similar past failures:

npx @claude-flow/cli@latest memory search \
  --namespace gaia-debug-patterns \
  --query "extraction bug final answer regex"

同仓库更多 Skills

同仓库

harness-drift-from-history

ruvnet/ruflo

One-command drift detection. Composes audit-list + oia-audit + audit-trend into a single primitive — finds the most recent audit in `metaharness-audit` namespace, runs a fresh audit against the current repo, diffs them via ADR-152 §3.1 similarity, and alerts when structural distance crosses `--threshold`. Iter 53 of ADR-150 deep integration.

2026-06-1660.8k

harness-similarity

ruvnet/ruflo

ADR-152 — weighted similarity between two harness fingerprints (genome + score JSON). Returns overall score in [0,1] plus per-component breakdown (cosine over 9 numerics, categorical agreement over 4 enums, jaccard over agent_topology). Unblocks ADR-151 §3.2 Recommender, §3.3 Drift Detection, §3.5 Plugin Compat. Pure-TS, no `@metaharness/*` dep — preserves ADR-150's four architectural constraints.

2026-06-1660.8k

harness-oia-audit

ruvnet/ruflo

Composite Phase-2 audit worker (ADR-150). Bundles harness oia-manifest + threat-model + mcp-scan into one timestamped audit record stored in the `metaharness-audit` memory namespace. Designed for cron-scheduled drift detection.

2026-06-1660.8k

harness-genome

ruvnet/ruflo

7-section repo readiness report from `metaharness genome <path>`. Returns repo_type / agent_topology / risk_score / mcp_surface / test_confidence / publish_readiness. Pure-read; degrades gracefully (ADR-150).

2026-06-1660.8k

harness-mcp-scan

ruvnet/ruflo

Static security scan of a harness's declared MCP surface via `harness mcp-scan <path>`. Reads `.mcp/servers.json` + `.harness/claims.json`. Pure-read, no dispatch. Exits 1 on findings at or above `--fail-on` severity.

2026-06-1660.8k

harness-mint

ruvnet/ruflo

Scaffold a custom AI agent harness via `metaharness new <name> --template <id> --host <id>`. Defaults to DRY-RUN (no writes) unless --confirm is passed. Refuses to write to the calling repo root or anywhere inside it. Honors ADR-150 architectural constraint + ruflo's "destructive-action confirmation" pattern.

2026-06-1660.8k

name	gaia-debugging
description	Diagnose why a GAIA question failed — extract trace, classify failure mode, and propose a fix
argument-hint	<task_id> [--results=<path>]
allowed-tools	Bash Read mcp__claude-flow__memory_search mcp__claude-flow__memory_store mcp__claude-flow__agentdb_pattern_search mcp__claude-flow__agentdb_pattern_store

GAIA Debugging Skill

When a GAIA question fails, systematically diagnose the root cause and propose a targeted fix.

When to use

A specific task_id returns the wrong answer or times out
Pass-rate dropped between two runs and you need to find the regression
You want to understand why a particular question class is consistently failing

Failure mode taxonomy

Code	Mode	Symptom	Fix direction
TG	Tool Gap	Agent lacks a required tool (no image OCR, no PDF reader)	Add tool to catalogue
RM	Reasoning Miss	Agent has the right data but draws wrong conclusion	Improve system prompt, add CoT instruction
EB	Extraction Bug	Answer is in the trace but `FINAL_ANSWER:` regex fails	Fix answer extraction pattern
LI	Loop Issue	Agent loops (re-asks same tool call) and hits turn limit	Increase max-turns or add loop-detection
DS	Dataset Shift	Ground truth differs from what web currently shows	Flag for HAL dataset audit
AT	API Timeout	Tool call times out; agent never gets the result	Increase per-turn timeout

Diagnostic workflow

Step 1 — Load the question trace

# Find the result for the task_id in the latest run
RESULTS=~/.cache/ruflo/gaia/results-latest.json
node -e "
  const r = JSON.parse(require('fs').readFileSync('$RESULTS'));
  const q = r.results.find(x => x.task_id === '$TASK_ID');
  console.log(JSON.stringify(q, null, 2));
"

Step 2 — Classify the failure

Look at the trace output:

No tools called at all → RM or configuration issue
Tool called but returned error → TG or AT
Tool returned data, wrong answer → RM or EB
Correct answer in trace but marked wrong → EB
max-turns hit → LI or question too hard for current model

Step 3 — Re-run with extended logging

node v3/@claude-flow/cli/bin/cli.js gaia-bench run \
  --level 1 --limit 1 \
  --task-id $TASK_ID \
  --models claude-sonnet-4-6 \
  --max-turns 20 \
  --output json

Step 4 — Apply targeted fix

Failure	Action
TG — missing web_browse	Verify `gaia-tools/index.ts` exports `web_browse`; check tool registration
TG — missing image OCR	Add `image_describe` tool call; verify `GOOGLE_AI_API_KEY`
RM — reasoning	Add a system prompt instruction: "Before answering, list all facts you have gathered"
EB — extraction	Test the `FINAL_ANSWER_RE` regex against the trace manually
LI — loop	Add a tool-call deduplication guard in `gaia-agent.ts`
AT — timeout	Set `DEFAULT_PER_TURN_TIMEOUT_MS` higher or use `--max-turns` flag

Step 5 — Verify fix and store pattern

# Re-run the single question
node … gaia-bench run --task-id $TASK_ID --models $MODEL --output json

# If now passing, store the pattern
npx @claude-flow/cli@latest memory store \
  --namespace gaia-debug-patterns \
  --key "fix-$FAILURE_CODE-$(date +%Y%m%d)" \
  --value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"

Quick reference: tool catalogue check

node -e "
  const { createDefaultToolCatalogue } = require('./v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.js');
  const cat = createDefaultToolCatalogue({});
  console.log('Tools registered:', cat.definitions.map(t => t.name));
"

Expected: web_search, file_read, web_browse, image_describe, python_exec

Pattern storage

After resolving a debugging session, store the finding:

npx @claude-flow/cli@latest memory store \
  --namespace gaia-debug-patterns \
  --key "session-$(date +%Y%m%d-%H%M)" \
  --value '{"task_id":"$TASK_ID","failure_mode":"$CODE","fix":"$FIX","verified":true}'

Search for similar past failures:

npx @claude-flow/cli@latest memory search \
  --namespace gaia-debug-patterns \
  --query "extraction bug final answer regex"