Run any Skill in Manus with one click

$pwd:

manage-evals

Name: Manage Evals
Author: OpenHands

// This skill should be used when the user asks to "trigger an eval", "run evaluation", "run swebench", "run gaia", "run benchmark", "compare eval runs", "compare evaluation results", "check eval regression", "compare benchmark results", "what changed in the eval", "diff eval runs", or mentions triggering, comparing, or reporting on SWE-bench, GAIA, or other benchmark evaluation results. Provides workflow for triggering evaluations on different benchmarks, finding and comparing runs, and reporting performance differences.

Run Skill in Manus

$ git log --oneline --stat

stars:744

forks:261

updated:May 11, 2026 at 19:08

File Explorer

3 files

SKILL.md

readonly

related-skills.json

same repository

sdk-release.md

from "OpenHands/software-agent-sdk"

This skill should be used when the user asks to "release the SDK", "prepare a release", "publish a new version", "cut a release", "do a release", or mentions the SDK release checklist or release process. Guides through the full software-agent-sdk release workflow from version bump to PyPI publication, emphasizing human checkpoints.

2026-05-04744

cross-repo-testing.md

from "OpenHands/software-agent-sdk"

This skill should be used when the user asks to "test a saas cross-repo feature", "deploy a feature branch to staging", "test SDK against OH Cloud branch", "e2e test a cloud workspace feature", "test secrets saas inheritance", or when changes span the SDK and OpenHands enterprise and need end-to-end validation against a staging deployment.

2026-04-05744

feature-release-rollout.md

from "OpenHands/software-agent-sdk"

This skill should be used when the user asks to "rollout a feature", "complete feature release", "propagate SDK feature", "track feature support", "what's missing for feature X", or mentions checking CLI/GUI/docs/blog support for SDK features. Guides agents through the multi-repository feature release workflow from SDK to docs to marketing.

2026-02-14744

debug-test-examples-workflow.md

from "OpenHands/software-agent-sdk"

Guide for debugging failing example tests in the `test-examples` labeled workflow. Use this skill when investigating CI failures in the run-examples.yml workflow, when example scripts fail to run correctly, when needing to isolate specific test failures, or when analyzing workflow logs and failure patterns.

2026-02-11744

code-style-guide.md

from "OpenHands/software-agent-sdk"

Project coding standards and style guidelines. Always follow these conventions when writing or reviewing code.

2026-01-08744

rot13-encryption.md

from "OpenHands/software-agent-sdk"

This skill helps encrypt and decrypt messages using ROT13 cipher. Use when the user asks to "encrypt" or "decrypt" a message.

2026-01-07744

package.json

"author": "OpenHands"

"repository": "OpenHands/software-agent-sdk"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

name

manage-evals

description

This skill should be used when the user asks to "trigger an eval", "run evaluation", "run swebench", "run gaia", "run benchmark", "compare eval runs", "compare evaluation results", "check eval regression", "compare benchmark results", "what changed in the eval", "diff eval runs", or mentions triggering, comparing, or reporting on SWE-bench, GAIA, or other benchmark evaluation results. Provides workflow for triggering evaluations on different benchmarks, finding and comparing runs, and reporting performance differences.

Managing Evaluations

Overview

OpenHands evaluations produce results stored on a CDN at https://results.eval.all-hands.dev/. Each run is identified by a path: {benchmark}/{model_slug}/{github_run_id}/. This skill enables triggering evaluation runs, comparing results between runs, and posting performance reports as GitHub PR comments.

Quick Start

Trigger an Evaluation

python .agents/skills/manage-evals/scripts/manage_evals.py trigger \
    --sdk-ref <BRANCH_OR_TAG> --benchmark swebench --eval-limit 50

Compare Runs

python .agents/skills/manage-evals/scripts/manage_evals.py compare \
    "<benchmark>/<model_slug>/<run_id>/" \
    --auto-baseline

Compare and Post to PR

python .agents/skills/manage-evals/scripts/manage_evals.py compare \
    "<benchmark>/<model_slug>/<run_id>/" \
    --auto-baseline \
    --post-comment --pr <PR_NUMBER> --repo OpenHands/software-agent-sdk

Triggering Evaluations

Using the Script

# SWE-bench (default) on a PR branch
python .agents/skills/manage-evals/scripts/manage_evals.py trigger \
    --sdk-ref my-feature-branch --eval-limit 50

# GAIA benchmark
python .agents/skills/manage-evals/scripts/manage_evals.py trigger \
    --sdk-ref main --benchmark gaia --eval-limit 50

# With a specific model
python .agents/skills/manage-evals/scripts/manage_evals.py trigger \
    --sdk-ref v1.16.0 --benchmark swebench --model-ids gemini-3-flash --eval-limit 50

# Multiple benchmarks (run the command multiple times)
for bench in swebench gaia; do
    python .agents/skills/manage-evals/scripts/manage_evals.py trigger \
        --sdk-ref main --benchmark "$bench" --eval-limit 50 --reason "Multi-benchmark eval"
done

Available Benchmarks

Benchmark	Description
`swebench`	SWE-bench (default) — software engineering tasks
`swebenchpro`	SWE-Bench Pro — harder software engineering tasks
`gaia`	GAIA — general AI assistant tasks
`swtbench`	SWT-bench — software testing tasks
`commit0`	Commit0 — commit generation tasks
`swebenchmultimodal`	SWE-bench Multimodal — tasks with images
`terminalbench`	TerminalBench — terminal interaction tasks

Trigger Options

Option	Default	Description
`--sdk-ref`	(required)	Branch, tag, or commit SHA to evaluate
`--benchmark`	`swebench`	Benchmark to run
`--eval-limit`	`50`	Number of instances to evaluate
`--model-ids`	(first in config)	Comma-separated model IDs from `resolve_model_config.py`
`--tool-preset`	`default`	Tool preset: `default`, `gemini`, `gpt5`, `planning`
`--agent-type`	`default`	Agent type: `default`, `acp-claude`, `acp-codex`
`--instance-ids`		Specific instance IDs to evaluate (overrides eval-limit)
`--reason`		Human-readable reason (shown in notifications)
`--benchmarks-branch`	`main`	Branch of the benchmarks repo
`--eval-branch`	`main`	Branch of the evaluation repo

Via PR Labels (Alternative)

Adding a label to a PR also triggers evaluations:

run-eval-1 — 1 instance (quick sanity check)
run-eval-50 — 50 instances (standard comparison)
run-eval-200 — 200 instances
run-eval-500 — 500 instances (full benchmark)

Comparing Evaluation Runs

Step 1: Find the Current PR's Eval Run

Eval runs are triggered by adding labels like run-eval-50 to a PR. The all-hands-bot posts a comment with results when complete.

Option A — From bot comments on the PR:

gh api repos/OpenHands/software-agent-sdk/issues/<PR_NUMBER>/comments \
    --jq '.[] | select(.user.login == "all-hands-bot") | .body' \
    | grep -o 'Evaluation:.*' | head -1

The evaluation name follows the format {github_run_id}-{model_slug_short} (e.g., 23775164157-claude-son). Extract the github_run_id from this.

Option B — From the "Evaluation Triggered" bot comment:

gh api repos/OpenHands/software-agent-sdk/issues/<PR_NUMBER>/comments \
    --jq '.[] | select(.body | test("Evaluation Triggered")) | .body'

This contains the SDK commit SHA. Cross-reference with daily metadata to find the run ID.

Option C — From daily metadata:

curl -s "https://results.eval.all-hands.dev/metadata/$(date -u +%Y-%m-%d).txt"

Each line is a run path. Match by benchmark and model to find the run.

Step 2: Identify the Run Path Components

A run path has three components:

benchmark: swebench, swebenchpro, gaia, swtbench, commit0, swebenchmultimodal, terminalbench
model_slug: Derived from model name with /:@. replaced by - (e.g., litellm_proxy-claude-sonnet-4-5-20250929)
run_id: The GitHub Actions workflow run ID from the OpenHands/evaluation repo

Step 3: Verify Results Exist

curl -sI "https://results.eval.all-hands.dev/<benchmark>/<model_slug>/<run_id>/output.report.json" | head -1

A 200 status confirms the run completed and results are available.

Step 4: Find a Baseline for Comparison

Automatic: The comparison script's --auto-baseline flag scans metadata files backward up to 14 days to find the most recent completed run with the same benchmark and model.

Manual: Inspect metadata files or other PR bot comments to identify a specific run:

# Check today's runs
curl -s "https://results.eval.all-hands.dev/metadata/$(date -u +%Y-%m-%d).txt" | grep "swebench/litellm_proxy-claude"

# Check yesterday's runs
curl -s "https://results.eval.all-hands.dev/metadata/$(date -u -d yesterday +%Y-%m-%d).txt" | grep "swebench/litellm_proxy-claude"

Step 5: Run the Comparison

python .agents/skills/manage-evals/scripts/manage_evals.py compare \
    "swebench/litellm_proxy-claude-sonnet-4-5-20250929/23775164157/" \
    --baseline "swebench/litellm_proxy-claude-sonnet-4-5-20250929/23773892085/"

Or with auto-baseline and PR comment posting:

python .agents/skills/manage-evals/scripts/manage_evals.py compare \
    "swebench/litellm_proxy-claude-sonnet-4-5-20250929/23775164157/" \
    --auto-baseline \
    --post-comment --pr 2334 --repo OpenHands/software-agent-sdk

Available Data Per Run

Each run stores files at https://results.eval.all-hands.dev/{run_path}/:

File	Description
`metadata/params.json`	Run parameters: SDK commit, PR number, model, eval_limit, triggered_by
`output.report.json`	Aggregated results: resolved/submitted/total counts and instance IDs
`cost_report.jsonl`	Per-instance cost data
`results.tar.gz`	Full archive with all outputs

Dashboard

The eval monitor dashboard provides a visual view of runs:

https://openhands-eval-monitor.vercel.app/?run={benchmark}/{model_slug}/{run_id}/

Interpreting Results

Success rate = resolved / min(eval_limit, total_instances)
A 50-instance sample has natural variance of ±2-4 resolved instances between runs
Focus on instance-level changes (gained/lost) to understand regressions vs. noise
If the same set of instances is resolved, the difference is likely noise

Additional Resources

Reference Files

references/eval-infrastructure.md — Detailed documentation on the evaluation infrastructure, GCS paths, metadata format, and workflow triggers

Scripts

scripts/manage_evals.py — Standalone comparison script with auto-baseline detection and GitHub comment posting

manage-evals

More from this repository

Managing Evaluations

Overview

Quick Start

Trigger an Evaluation

Compare Runs

Compare and Post to PR

Triggering Evaluations

Using the Script

Available Benchmarks

Trigger Options

Via PR Labels (Alternative)

Comparing Evaluation Runs

Step 1: Find the Current PR's Eval Run

Step 2: Identify the Run Path Components

Step 3: Verify Results Exist

Step 4: Find a Baseline for Comparison

Step 5: Run the Comparison

Available Data Per Run

Dashboard

Interpreting Results

Additional Resources

Reference Files

Scripts

Managing Evaluations

Overview

Quick Start

Trigger an Evaluation

Compare Runs

Compare and Post to PR

Triggering Evaluations

Using the Script

Available Benchmarks

Trigger Options

Via PR Labels (Alternative)

Comparing Evaluation Runs

Step 1: Find the Current PR's Eval Run

Step 2: Identify the Run Path Components

Step 3: Verify Results Exist

Step 4: Find a Baseline for Comparison

Step 5: Run the Comparison

Available Data Per Run

Dashboard

Interpreting Results

Additional Resources

Reference Files

Scripts

More from this repository