Skip to main content
Run any Skill in Manus
with one click

agent-evaluation

Stars453
Forks139
UpdatedFebruary 11, 2026 at 08:53

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

File Explorer
7 files
SKILL.md
readonly