Skip to main content
Ejecuta cualquier Skill en Manus
con un clic
$pwd:

evaluating-llms-harness

// Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

$ git log --oneline --stat
stars:1403
forks:100
updated:7 de mayo de 2026, 02:44
Explorador de archivos
5 archivos
SKILL.md
readonly