con un clic
con un clic
REST and WebSocket endpoint patterns, error handling, and Pydantic schema conventions for the backend
Architecture, directory layout, communication protocol, and conventions for the full-stack multi-agent application
Step-by-step guide to add a new REST or WebSocket endpoint to the backend
Step-by-step guide to add a new agent to the multi-agent team
Step-by-step guide to add a new page or component to the React frontend
How the Code Reviewer agent conducts systematic code reviews with prioritized findings
| name | run-eval |
| description | How to run the Agent Eval Benchmark dataset with ag2 test eval |
| license | Apache-2.0 |
The agent-eval-bench dataset contains evaluation cases for testing AG2 agent capabilities. It is designed for use with the ag2 test eval command.
Run the sample split (included inline, no download required):
ag2 test eval --dataset agent-eval-bench
This runs the sample split by default, which contains 10 cases across four categories: tool-use, reasoning, coordination, and safety.
The full benchmark (12MB) is hosted remotely and will be downloaded on first use:
ag2 test eval --dataset agent-eval-bench --split full-bench
The file is cached at ~/.ag2/cache/datasets/agent-eval-bench/ after the first download.
Run only a specific category of tests:
ag2 test eval --dataset agent-eval-bench --filter category=tool-use
ag2 test eval --dataset agent-eval-bench --filter category=safety
ag2 test eval --dataset agent-eval-bench --filter difficulty=hard
Each test case defines assertions that the eval runner checks automatically:
| Type | Description |
|---|---|
contains | Response must contain the specified text |
not_contains | Response must not contain the specified text |
tool_called | The agent must have called the specified tool |
matches | Response must match a regex pattern |
Add new cases to data/sample.yaml following this format:
- name: "my-custom-test"
input: "The prompt sent to the agent"
category: "tool-use"
difficulty: "medium"
assertions:
- type: contains
value: "expected output"
- type: tool_called
value: "Read"
The eval runner outputs a summary table after each run:
A healthy agent should pass all easy cases and most medium cases. The hard cases test advanced capabilities and a lower pass rate is expected.