Skip to main content
Jeden Skill in Manus ausführen
mit einem Klick

manage-evals

// This skill should be used when the user asks to "trigger an eval", "run evaluation", "run swebench", "run gaia", "run benchmark", "compare eval runs", "compare evaluation results", "check eval regression", "compare benchmark results", "what changed in the eval", "diff eval runs", or mentions triggering, comparing, or reporting on SWE-bench, GAIA, or other benchmark evaluation results. Provides workflow for triggering evaluations on different benchmarks, finding and comparing runs, and reporting performance differences.

$ git log --oneline --stat
stars:744
forks:261
updated:11. Mai 2026 um 19:08
Datei-Explorer
3 Dateien
SKILL.md
readonly