Skip to main content
Ejecuta cualquier Skill en Manus
con un clic
$pwd:

ag2-evaluation

// Evaluate, test, and track an AG2 beta Agent offline. Build a Suite of tasks, run the agent with run_agent, and grade answers with prebuilt scorers (final_answer_matches, tool_called, no_tool_errors, token_budget) or a custom @scorer — including the agent_judge LLM judge. Read the RunResult scorecard (pass_rate, score_stats, value_counts), gate it in CI with deterministic TestConfig cassettes, persist to store_dir and diff runs to catch regressions, and grade existing traces with evaluate_traces. Use when the user wants to evaluate, test, grade, or benchmark an agent, build a CI or regression gate, or score correctness, tool use, cost, or quality. To compare builds head-to-head or on a leaderboard, see ag2-eval-comparison.

$ git log --oneline --stat
stars:3
forks:1
updated:28 de mayo de 2026, 01:00
SKILL.md
readonly