تشغيل أي مهارة في Manus بنقرة واحدة

ابدأ الآن

gaia-submission

Walk through a complete GAIA benchmark→submit flow — from key resolution through HAL-compatible package generation

تشغيل في Manus

النجوم٥٩٬٨٨٠

التفرعات٦٬٩٣٩

آخر تحديث٢٨ مايو ٢٠٢٦ في ٠٣:٠٦

المصدر

ruvnet

ruvnet/ruflo

فتح مستودع GitHub عرض مستودعات المنشئ

أمر التثبيت

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

مطوّرو البرمجياتمهن الحاسوب والرياضيات·SOC 15-1252

SKILL.md

readonly

name	gaia-submission
description	Walk through a complete GAIA benchmark→submit flow — from key resolution through HAL-compatible package generation
argument-hint	[level] [limit] [models]
allowed-tools	Bash mcp__claude-flow__memory_store mcp__claude-flow__memory_search mcp__claude-flow__memory_list mcp__claude-flow__hooks_post_task mcp__claude-flow__hooks_pre_task

GAIA Submission Skill

Walk Claude Code through every step needed to go from a clean environment to a signed, HAL-compatible submission package ready to upload to the Princeton GAIA leaderboard.

When to use

When the user wants to:

Run a benchmark and submit results to the HAL leaderboard
Package an existing results file into a submission archive
Confirm their environment is ready for a benchmark run

Prerequisites

Before starting, confirm these are available:

Requirement	Check
`ANTHROPIC_API_KEY`	`echo ${ANTHROPIC_API_KEY:0:8}…` (should show `sk-ant-…`)
`HF_TOKEN`	`echo ${HF_TOKEN:0:5}…` (should show `hf_…`)
Node.js 20+	`node --version`
CLI built	`node v3/@claude-flow/cli/bin/cli.js --version`

Phase 1 — Validate environment

# Run all pre-flight checks
/gaia validate

If any check fails, resolve it before continuing.

Phase 2 — Estimate cost and confirm

Ask the user for their configuration:

Level (default: 1)
Question limit (default: 53 for a quick run, 165 for the full L1 set)
Models (default: claude-sonnet-4-6)
Self-consistency voting (default: 1; use 3 for L2/L3)

/gaia cost --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING

If projected cost > $5, show the estimate and ask: "This run will cost approximately $X. Proceed? (y/N)"

Phase 3 — Run the benchmark

/gaia run --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING

While running, progress is reported every 5 questions:

[12/53] 22.7% (5 passed of 22 scored) — est. remaining: $0.18

Store the run summary in memory for history tracking:

npx @claude-flow/cli@latest memory store \
  --namespace gaia-runs \
  --key "run-$(date +%Y%m%d-%H%M)" \
  --value '{"level":$LEVEL,"model":"$MODEL","total":$TOTAL,"passed":$PASSED,"pass_rate":$RATE,"est_cost_usd":$COST}'

Phase 4 — Package for submission

/gaia submit --results=~/.cache/ruflo/gaia/results-latest.json

This produces:

submission-<date>-<sha>/
├── results.jsonl        ← HAL-compatible, one JSON per line
├── trajectories.jsonl   ← full agent traces
├── metadata.json        ← harness info, model, tool catalogue
├── manifest.md.json     ← Ed25519-signed witness
└── README.md            ← human summary + leaderboard comparison

Phase 5 — Compare and report

/gaia leaderboard --level=$LEVEL
/gaia history

Interpret the gap between ruflo's score and the leaderboard top-10. Identify the primary failure mode (tool gap, reasoning miss, extraction bug) using the /gaia-debugging skill if needed.

Phase 6 — Persist learnings

npx @claude-flow/cli@latest hooks post-task \
  --task-id "gaia-submission-$(date +%Y%m%d)" \
  --success true \
  --train-neural true

Store any discovered patterns:

npx @claude-flow/cli@latest memory store \
  --namespace gaia-patterns \
  --key "submission-notes-$(date +%Y%m%d)" \
  --value "Level $LEVEL, $MODEL: $NOTES"

Extensibility note

This skill is intentionally structured to be benchmark-agnostic. The phase headers (validate → estimate → run → package → compare → learn) apply to SWE-bench, WebArena, and HumanEval with only phase 3-4 details changing.

المزيد من هذا المستودع

نفس المستودع

harness-drift-from-history

ruvnet/ruflo

One-command drift detection. Composes audit-list + oia-audit + audit-trend into a single primitive — finds the most recent audit in `metaharness-audit` namespace, runs a fresh audit against the current repo, diffs them via ADR-152 §3.1 similarity, and alerts when structural distance crosses `--threshold`. Iter 53 of ADR-150 deep integration.

2026-06-1660.8k

harness-similarity

ruvnet/ruflo

ADR-152 — weighted similarity between two harness fingerprints (genome + score JSON). Returns overall score in [0,1] plus per-component breakdown (cosine over 9 numerics, categorical agreement over 4 enums, jaccard over agent_topology). Unblocks ADR-151 §3.2 Recommender, §3.3 Drift Detection, §3.5 Plugin Compat. Pure-TS, no `@metaharness/*` dep — preserves ADR-150's four architectural constraints.

2026-06-1660.8k

harness-oia-audit

ruvnet/ruflo

Composite Phase-2 audit worker (ADR-150). Bundles harness oia-manifest + threat-model + mcp-scan into one timestamped audit record stored in the `metaharness-audit` memory namespace. Designed for cron-scheduled drift detection.

2026-06-1660.8k

harness-genome

ruvnet/ruflo

7-section repo readiness report from `metaharness genome <path>`. Returns repo_type / agent_topology / risk_score / mcp_surface / test_confidence / publish_readiness. Pure-read; degrades gracefully (ADR-150).

2026-06-1660.8k

harness-mcp-scan

ruvnet/ruflo

Static security scan of a harness's declared MCP surface via `harness mcp-scan <path>`. Reads `.mcp/servers.json` + `.harness/claims.json`. Pure-read, no dispatch. Exits 1 on findings at or above `--fail-on` severity.

2026-06-1660.8k

harness-mint

ruvnet/ruflo

Scaffold a custom AI agent harness via `metaharness new <name> --template <id> --host <id>`. Defaults to DRY-RUN (no writes) unless --confirm is passed. Refuses to write to the calling repo root or anywhere inside it. Honors ADR-150 architectural constraint + ruflo's "destructive-action confirmation" pattern.

2026-06-1660.8k

name	gaia-submission
description	Walk through a complete GAIA benchmark→submit flow — from key resolution through HAL-compatible package generation
argument-hint	[level] [limit] [models]
allowed-tools	Bash mcp__claude-flow__memory_store mcp__claude-flow__memory_search mcp__claude-flow__memory_list mcp__claude-flow__hooks_post_task mcp__claude-flow__hooks_pre_task

GAIA Submission Skill

Walk Claude Code through every step needed to go from a clean environment to a signed, HAL-compatible submission package ready to upload to the Princeton GAIA leaderboard.

When to use

When the user wants to:

Run a benchmark and submit results to the HAL leaderboard
Package an existing results file into a submission archive
Confirm their environment is ready for a benchmark run

Prerequisites

Before starting, confirm these are available:

Requirement	Check
`ANTHROPIC_API_KEY`	`echo ${ANTHROPIC_API_KEY:0:8}…` (should show `sk-ant-…`)
`HF_TOKEN`	`echo ${HF_TOKEN:0:5}…` (should show `hf_…`)
Node.js 20+	`node --version`
CLI built	`node v3/@claude-flow/cli/bin/cli.js --version`

Phase 1 — Validate environment

# Run all pre-flight checks
/gaia validate

If any check fails, resolve it before continuing.

Phase 2 — Estimate cost and confirm

Ask the user for their configuration:

Level (default: 1)
Question limit (default: 53 for a quick run, 165 for the full L1 set)
Models (default: claude-sonnet-4-6)
Self-consistency voting (default: 1; use 3 for L2/L3)

/gaia cost --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING

If projected cost > $5, show the estimate and ask: "This run will cost approximately $X. Proceed? (y/N)"

Phase 3 — Run the benchmark

/gaia run --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING

While running, progress is reported every 5 questions:

[12/53] 22.7% (5 passed of 22 scored) — est. remaining: $0.18

Store the run summary in memory for history tracking:

npx @claude-flow/cli@latest memory store \
  --namespace gaia-runs \
  --key "run-$(date +%Y%m%d-%H%M)" \
  --value '{"level":$LEVEL,"model":"$MODEL","total":$TOTAL,"passed":$PASSED,"pass_rate":$RATE,"est_cost_usd":$COST}'

Phase 4 — Package for submission

/gaia submit --results=~/.cache/ruflo/gaia/results-latest.json

This produces:

submission-<date>-<sha>/
├── results.jsonl        ← HAL-compatible, one JSON per line
├── trajectories.jsonl   ← full agent traces
├── metadata.json        ← harness info, model, tool catalogue
├── manifest.md.json     ← Ed25519-signed witness
└── README.md            ← human summary + leaderboard comparison

Phase 5 — Compare and report

/gaia leaderboard --level=$LEVEL
/gaia history

Interpret the gap between ruflo's score and the leaderboard top-10. Identify the primary failure mode (tool gap, reasoning miss, extraction bug) using the /gaia-debugging skill if needed.

Phase 6 — Persist learnings

npx @claude-flow/cli@latest hooks post-task \
  --task-id "gaia-submission-$(date +%Y%m%d)" \
  --success true \
  --train-neural true

Store any discovered patterns:

npx @claude-flow/cli@latest memory store \
  --namespace gaia-patterns \
  --key "submission-notes-$(date +%Y%m%d)" \
  --value "Level $LEVEL, $MODEL: $NOTES"