Skip to main content
Run any Skill in Manus
with one click

manage-evals

// This skill should be used when the user asks to "trigger an eval", "run evaluation", "run swebench", "run gaia", "run benchmark", "compare eval runs", "compare evaluation results", "check eval regression", "compare benchmark results", "what changed in the eval", "diff eval runs", or mentions triggering, comparing, or reporting on SWE-bench, GAIA, or other benchmark evaluation results. Provides workflow for triggering evaluations on different benchmarks, finding and comparing runs, and reporting performance differences.

$ git log --oneline --stat
stars:744
forks:261
updated:May 11, 2026 at 19:08
File Explorer
3 files
SKILL.md
readonly