Skip to main content
在 Manus 中运行任何 Skill
一键导入

auto-arena

// Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation.

$ git log --oneline --stat
stars:619
forks:52
updated:2026年3月5日 02:46
SKILL.md
readonly