Skip to main content
Execute qualquer Skill no Manus
com um clique

manage-evals

// This skill should be used when the user asks to "trigger an eval", "run evaluation", "run swebench", "run gaia", "run benchmark", "compare eval runs", "compare evaluation results", "check eval regression", "compare benchmark results", "what changed in the eval", "diff eval runs", or mentions triggering, comparing, or reporting on SWE-bench, GAIA, or other benchmark evaluation results. Provides workflow for triggering evaluations on different benchmarks, finding and comparing runs, and reporting performance differences.

$ git log --oneline --stat
stars:744
forks:261
updated:11 de maio de 2026 às 19:08
Explorador de arquivos
3 arquivos
SKILL.md
readonly