بنقرة واحدة
gaia-submission
Walk through a complete GAIA benchmark→submit flow — from key resolution through HAL-compatible package generation
القائمة
Walk through a complete GAIA benchmark→submit flow — from key resolution through HAL-compatible package generation
استنادا إلى تصنيف SOC المهني
One-command drift detection. Composes audit-list + oia-audit + audit-trend into a single primitive — finds the most recent audit in `metaharness-audit` namespace, runs a fresh audit against the current repo, diffs them via ADR-152 §3.1 similarity, and alerts when structural distance crosses `--threshold`. Iter 53 of ADR-150 deep integration.
ADR-152 — weighted similarity between two harness fingerprints (genome + score JSON). Returns overall score in [0,1] plus per-component breakdown (cosine over 9 numerics, categorical agreement over 4 enums, jaccard over agent_topology). Unblocks ADR-151 §3.2 Recommender, §3.3 Drift Detection, §3.5 Plugin Compat. Pure-TS, no `@metaharness/*` dep — preserves ADR-150's four architectural constraints.
Composite Phase-2 audit worker (ADR-150). Bundles harness oia-manifest + threat-model + mcp-scan into one timestamped audit record stored in the `metaharness-audit` memory namespace. Designed for cron-scheduled drift detection.
7-section repo readiness report from `metaharness genome <path>`. Returns repo_type / agent_topology / risk_score / mcp_surface / test_confidence / publish_readiness. Pure-read; degrades gracefully (ADR-150).
Static security scan of a harness's declared MCP surface via `harness mcp-scan <path>`. Reads `.mcp/servers.json` + `.harness/claims.json`. Pure-read, no dispatch. Exits 1 on findings at or above `--fail-on` severity.
Scaffold a custom AI agent harness via `metaharness new <name> --template <id> --host <id>`. Defaults to DRY-RUN (no writes) unless --confirm is passed. Refuses to write to the calling repo root or anywhere inside it. Honors ADR-150 architectural constraint + ruflo's "destructive-action confirmation" pattern.
| name | gaia-submission |
| description | Walk through a complete GAIA benchmark→submit flow — from key resolution through HAL-compatible package generation |
| argument-hint | [level] [limit] [models] |
| allowed-tools | Bash mcp__claude-flow__memory_store mcp__claude-flow__memory_search mcp__claude-flow__memory_list mcp__claude-flow__hooks_post_task mcp__claude-flow__hooks_pre_task |
Walk Claude Code through every step needed to go from a clean environment to a signed, HAL-compatible submission package ready to upload to the Princeton GAIA leaderboard.
When the user wants to:
Before starting, confirm these are available:
| Requirement | Check |
|---|---|
ANTHROPIC_API_KEY | echo ${ANTHROPIC_API_KEY:0:8}… (should show sk-ant-…) |
HF_TOKEN | echo ${HF_TOKEN:0:5}… (should show hf_…) |
| Node.js 20+ | node --version |
| CLI built | node v3/@claude-flow/cli/bin/cli.js --version |
# Run all pre-flight checks
/gaia validate
If any check fails, resolve it before continuing.
Ask the user for their configuration:
claude-sonnet-4-6)/gaia cost --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING
If projected cost > $5, show the estimate and ask: "This run will cost approximately $X. Proceed? (y/N)"
/gaia run --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING
While running, progress is reported every 5 questions:
[12/53] 22.7% (5 passed of 22 scored) — est. remaining: $0.18
Store the run summary in memory for history tracking:
npx @claude-flow/cli@latest memory store \
--namespace gaia-runs \
--key "run-$(date +%Y%m%d-%H%M)" \
--value '{"level":$LEVEL,"model":"$MODEL","total":$TOTAL,"passed":$PASSED,"pass_rate":$RATE,"est_cost_usd":$COST}'
/gaia submit --results=~/.cache/ruflo/gaia/results-latest.json
This produces:
submission-<date>-<sha>/
├── results.jsonl ← HAL-compatible, one JSON per line
├── trajectories.jsonl ← full agent traces
├── metadata.json ← harness info, model, tool catalogue
├── manifest.md.json ← Ed25519-signed witness
└── README.md ← human summary + leaderboard comparison
/gaia leaderboard --level=$LEVEL
/gaia history
Interpret the gap between ruflo's score and the leaderboard top-10.
Identify the primary failure mode (tool gap, reasoning miss, extraction bug)
using the /gaia-debugging skill if needed.
npx @claude-flow/cli@latest hooks post-task \
--task-id "gaia-submission-$(date +%Y%m%d)" \
--success true \
--train-neural true
Store any discovered patterns:
npx @claude-flow/cli@latest memory store \
--namespace gaia-patterns \
--key "submission-notes-$(date +%Y%m%d)" \
--value "Level $LEVEL, $MODEL: $NOTES"
This skill is intentionally structured to be benchmark-agnostic. The phase headers (validate → estimate → run → package → compare → learn) apply to SWE-bench, WebArena, and HumanEval with only phase 3-4 details changing.