Run any Skill in Manus with one click

Get Started

$pwd:

gaia-submission

Name: Gaia Submission
Author: ruvnet

// Walk through a complete GAIA benchmark→submit flow — from key resolution through HAL-compatible package generation

Run Skill in Manus

$ git log --oneline --stat

stars:56,718

forks:6,459

updated:May 28, 2026 at 03:06

SKILL.md

readonly

name	gaia-submission
description	Walk through a complete GAIA benchmark→submit flow — from key resolution through HAL-compatible package generation
argument-hint	[level] [limit] [models]
allowed-tools	Bash mcp__claude-flow__memory_store mcp__claude-flow__memory_search mcp__claude-flow__memory_list mcp__claude-flow__hooks_post_task mcp__claude-flow__hooks_pre_task

GAIA Submission Skill

Walk Claude Code through every step needed to go from a clean environment to a signed, HAL-compatible submission package ready to upload to the Princeton GAIA leaderboard.

When to use

When the user wants to:

Run a benchmark and submit results to the HAL leaderboard
Package an existing results file into a submission archive
Confirm their environment is ready for a benchmark run

Prerequisites

Before starting, confirm these are available:

Requirement	Check
`ANTHROPIC_API_KEY`	`echo ${ANTHROPIC_API_KEY:0:8}…` (should show `sk-ant-…`)
`HF_TOKEN`	`echo ${HF_TOKEN:0:5}…` (should show `hf_…`)
Node.js 20+	`node --version`
CLI built	`node v3/@claude-flow/cli/bin/cli.js --version`

Phase 1 — Validate environment

# Run all pre-flight checks
/gaia validate

If any check fails, resolve it before continuing.

Phase 2 — Estimate cost and confirm

Ask the user for their configuration:

Level (default: 1)
Question limit (default: 53 for a quick run, 165 for the full L1 set)
Models (default: claude-sonnet-4-6)
Self-consistency voting (default: 1; use 3 for L2/L3)

/gaia cost --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING

If projected cost > $5, show the estimate and ask: "This run will cost approximately $X. Proceed? (y/N)"

Phase 3 — Run the benchmark

/gaia run --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING

While running, progress is reported every 5 questions:

[12/53] 22.7% (5 passed of 22 scored) — est. remaining: $0.18

Store the run summary in memory for history tracking:

npx @claude-flow/cli@latest memory store \
  --namespace gaia-runs \
  --key "run-$(date +%Y%m%d-%H%M)" \
  --value '{"level":$LEVEL,"model":"$MODEL","total":$TOTAL,"passed":$PASSED,"pass_rate":$RATE,"est_cost_usd":$COST}'

Phase 4 — Package for submission

/gaia submit --results=~/.cache/ruflo/gaia/results-latest.json

This produces:

submission-<date>-<sha>/
├── results.jsonl        ← HAL-compatible, one JSON per line
├── trajectories.jsonl   ← full agent traces
├── metadata.json        ← harness info, model, tool catalogue
├── manifest.md.json     ← Ed25519-signed witness
└── README.md            ← human summary + leaderboard comparison

Phase 5 — Compare and report

/gaia leaderboard --level=$LEVEL
/gaia history

Interpret the gap between ruflo's score and the leaderboard top-10. Identify the primary failure mode (tool gap, reasoning miss, extraction bug) using the /gaia-debugging skill if needed.

Phase 6 — Persist learnings

npx @claude-flow/cli@latest hooks post-task \
  --task-id "gaia-submission-$(date +%Y%m%d)" \
  --success true \
  --train-neural true

Store any discovered patterns:

npx @claude-flow/cli@latest memory store \
  --namespace gaia-patterns \
  --key "submission-notes-$(date +%Y%m%d)" \
  --value "Level $LEVEL, $MODEL: $NOTES"

Extensibility note

This skill is intentionally structured to be benchmark-agnostic. The phase headers (validate → estimate → run → package → compare → learn) apply to SWE-bench, WebArena, and HumanEval with only phase 3-4 details changing.

related-skills.json

same repository

workflow-create.md

from "ruvnet/ruflo"

Author a workflow — either an MCP workflow template (persisted, lifecycle) or a native .claude/workflows/*.js orchestration script (agent/parallel/pipeline fan-out)

2026-05-2956.7k

workflow-run.md

from "ruvnet/ruflo"

Run a workflow — drive an MCP workflow lifecycle (execute/pause/resume/cancel) or invoke + resume a native .claude/workflows/*.js orchestration via the Workflow tool

2026-05-2956.7k

gaia-architecture-comparison.md

from "ruvnet/ruflo"

Side-by-side comparison of ruflo vs HAL vs other GAIA harnesses — capability gaps, design decisions, and improvement roadmap

2026-05-2856.7k

gaia-debugging.md

from "ruvnet/ruflo"

Diagnose why a GAIA question failed — extract trace, classify failure mode, and propose a fix

2026-05-2856.7k

create-plugin.md

from "ruvnet/ruflo"

Scaffold a new Claude Code plugin with proper directory structure, plugin.json, skills, commands, and agents

2026-05-2556.7k

github-project-management.md

from "ruvnet/ruflo"

Comprehensive GitHub project management with swarm-coordinated issue tracking, project board automation, and sprint planning

2026-05-2156.7k

package.json

"author": "ruvnet"

"repository": "ruvnet/ruflo"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	gaia-submission
description	Walk through a complete GAIA benchmark→submit flow — from key resolution through HAL-compatible package generation
argument-hint	[level] [limit] [models]
allowed-tools	Bash mcp__claude-flow__memory_store mcp__claude-flow__memory_search mcp__claude-flow__memory_list mcp__claude-flow__hooks_post_task mcp__claude-flow__hooks_pre_task

GAIA Submission Skill

Walk Claude Code through every step needed to go from a clean environment to a signed, HAL-compatible submission package ready to upload to the Princeton GAIA leaderboard.

When to use

When the user wants to:

Run a benchmark and submit results to the HAL leaderboard
Package an existing results file into a submission archive
Confirm their environment is ready for a benchmark run

Prerequisites

Before starting, confirm these are available:

Requirement	Check
`ANTHROPIC_API_KEY`	`echo ${ANTHROPIC_API_KEY:0:8}…` (should show `sk-ant-…`)
`HF_TOKEN`	`echo ${HF_TOKEN:0:5}…` (should show `hf_…`)
Node.js 20+	`node --version`
CLI built	`node v3/@claude-flow/cli/bin/cli.js --version`

Phase 1 — Validate environment

# Run all pre-flight checks
/gaia validate

If any check fails, resolve it before continuing.

Phase 2 — Estimate cost and confirm

Ask the user for their configuration:

Level (default: 1)
Question limit (default: 53 for a quick run, 165 for the full L1 set)
Models (default: claude-sonnet-4-6)
Self-consistency voting (default: 1; use 3 for L2/L3)

/gaia cost --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING

If projected cost > $5, show the estimate and ask: "This run will cost approximately $X. Proceed? (y/N)"

Phase 3 — Run the benchmark

/gaia run --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING

While running, progress is reported every 5 questions:

[12/53] 22.7% (5 passed of 22 scored) — est. remaining: $0.18

Store the run summary in memory for history tracking:

npx @claude-flow/cli@latest memory store \
  --namespace gaia-runs \
  --key "run-$(date +%Y%m%d-%H%M)" \
  --value '{"level":$LEVEL,"model":"$MODEL","total":$TOTAL,"passed":$PASSED,"pass_rate":$RATE,"est_cost_usd":$COST}'

Phase 4 — Package for submission

/gaia submit --results=~/.cache/ruflo/gaia/results-latest.json

This produces:

submission-<date>-<sha>/
├── results.jsonl        ← HAL-compatible, one JSON per line
├── trajectories.jsonl   ← full agent traces
├── metadata.json        ← harness info, model, tool catalogue
├── manifest.md.json     ← Ed25519-signed witness
└── README.md            ← human summary + leaderboard comparison

Phase 5 — Compare and report

/gaia leaderboard --level=$LEVEL
/gaia history

Interpret the gap between ruflo's score and the leaderboard top-10. Identify the primary failure mode (tool gap, reasoning miss, extraction bug) using the /gaia-debugging skill if needed.

Phase 6 — Persist learnings

npx @claude-flow/cli@latest hooks post-task \
  --task-id "gaia-submission-$(date +%Y%m%d)" \
  --success true \
  --train-neural true

Store any discovered patterns:

npx @claude-flow/cli@latest memory store \
  --namespace gaia-patterns \
  --key "submission-notes-$(date +%Y%m%d)" \
  --value "Level $LEVEL, $MODEL: $NOTES"

gaia-submission

GAIA Submission Skill

When to use

Prerequisites

Phase 1 — Validate environment

Phase 2 — Estimate cost and confirm

Phase 3 — Run the benchmark

Phase 4 — Package for submission

Phase 5 — Compare and report

Phase 6 — Persist learnings

Extensibility note

More from this repository

More from this repository

GAIA Submission Skill

When to use

Prerequisites

Phase 1 — Validate environment

Phase 2 — Estimate cost and confirm

Phase 3 — Run the benchmark

Phase 4 — Package for submission

Phase 5 — Compare and report

Phase 6 — Persist learnings

Extensibility note