Skip to main content
Execute qualquer Skill no Manus
com um clique

evalscope

// Translates natural language requests into evalscope CLI commands. Core capabilities: (1) Model accuracy evaluation (eval) — runs 156+ benchmarks (Math, Coding, Chinese, Multimodal, Agent, etc.) against local checkpoints or OpenAI-compatible / Anthropic API endpoints; (2) Performance stress testing (perf) — measures TTFT, TPOT, throughput, and latency under configurable concurrency gradients or SLA auto-tuning; (3) Benchmark discovery — lists and filters benchmarks by capability tag, retrieves full metadata and sample examples; (4) Result visualization — launches a Web dashboard to compare and explore evaluation outputs. Trigger this skill whenever the user mentions: evaluate / benchmark / score a model, throughput / latency / QPS / stress test, find benchmarks by tag or capability, or view / compare evaluation results.

$ git log --oneline --stat
stars:2.866
forks:342
updated:7 de maio de 2026 às 09:12
Explorador de arquivos
4 arquivos
SKILL.md
readonly