Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

ag2-evaluation

Name: Ag2 Evaluation
Author: ag2ai

// Evaluate, test, and track an AG2 beta Agent offline. Build a Suite of tasks, run the agent with run_agent, and grade answers with prebuilt scorers (final_answer_matches, tool_called, no_tool_errors, token_budget) or a custom @scorer — including the agent_judge LLM judge. Read the RunResult scorecard (pass_rate, score_stats, value_counts), gate it in CI with deterministic TestConfig cassettes, persist to store_dir and diff runs to catch regressions, and grade existing traces with evaluate_traces. Use when the user wants to evaluate, test, grade, or benchmark an agent, build a CI or regression gate, or score correctness, tool use, cost, or quality. To compare builds head-to-head or on a leaderboard, see ag2-eval-comparison.

Ejecutar en Manus

$ git log --oneline --stat

stars:3

forks:1

updated:28 de mayo de 2026, 01:00

SKILL.md

readonly

name	ag2-evaluation
description	Evaluate, test, and track an AG2 beta Agent offline. Build a Suite of tasks, run the agent with run_agent, and grade answers with prebuilt scorers (final_answer_matches, tool_called, no_tool_errors, token_budget) or a custom @scorer — including the agent_judge LLM judge. Read the RunResult scorecard (pass_rate, score_stats, value_counts), gate it in CI with deterministic TestConfig cassettes, persist to store_dir and diff runs to catch regressions, and grade existing traces with evaluate_traces. Use when the user wants to evaluate, test, grade, or benchmark an agent, build a CI or regression gate, or score correctness, tool use, cost, or quality. To compare builds head-to-head or on a leaderboard, see ag2-eval-comparison.
license	Apache-2.0

Evaluation — run, grade, and track an agent

When to use

Evaluate / test / benchmark an AG2 beta Agent, or build a regression / CI gate
Grade answers for correctness, tool use, cost, or subjective quality
Track a metric across versions (did this change help or regress?)

To compare two-plus builds head-to-head or on a leaderboard, use ag2-eval-comparison.

Install

pip install "ag2[openai,tracing]"

run_agent reconstructs each task's trace from OpenTelemetry spans, so the tracing extra is required.

The loop — dataset, agent, scorers, run_agent

import asyncio
from autogen.beta import Agent
from autogen.beta.config import OpenAIConfig
from autogen.beta.eval import Suite, run_agent
from autogen.beta.eval.scorers import final_answer_matches

suite = Suite.from_list([
    {"task_id": "france", "inputs": {"input": "Capital of France?"}, "reference_outputs": {"answer": "Paris"}},
    {"task_id": "japan",  "inputs": {"input": "Capital of Japan?"},  "reference_outputs": {"answer": "Tokyo"}},
])
agent = Agent("geographer", prompt="Answer with the capital city.", config=OpenAIConfig(model="gpt-4o-mini"))

async def main():
    result = await run_agent(
        suite, agent=agent,
        scorers=[final_answer_matches(field="answer", matcher="contains")],
        store_dir="./runs",
    )
    print(result.summary())                            # the scorecard
    print(result.pass_rate("final_answer_matches"))    # 1.0

asyncio.run(main())

inputs["input"] is the prompt; reference_outputs is the gold answer (a dict — omit it for trace-only checks). Each scorer is a column, looked up by its key.

Scorers

A scorer asks ONE question. Its RETURN TYPE picks the aggregation:

return	aggregation	accessor
`bool`	pass rate	`result.pass_rate(key)`
`int` / `float`	mean / p50 / p95	`result.score_stats(key)`
`str`	value counts	`result.value_counts(key)`

Prebuilt (autogen.beta.eval.scorers): final_answer_matches(field=, matcher="contains"|"casefold"|"exact"), tool_called(name), no_tool_errors(), token_budget(n), failure_attribution(...), agent_judge(...).

Custom — decorate a function that declares what it needs by name (outputs, trace, reference_outputs, inputs, task):

from autogen.beta.eval import scorer

@scorer
def answered_briefly(outputs) -> bool:
    return len(outputs["body"]) < 100      # outputs["body"] = final answer text

agent_judge grades quality you can't check with == (use a different model than the agent under test):

from autogen.beta.eval.scorers import agent_judge
judge = agent_judge(OpenAIConfig(model="gpt-4o"), criterion="Helpful and accurate.", key="quality")

CI — deterministic, no API key

Swap the model for a TestConfig cassette (a canned reply per task) so CI is free and repeatable:

from autogen.beta.testing import TestConfig

def build(*, config=None):                     # agent must be a FACTORY for per-task configs
    return Agent("geographer", prompt="Answer with the capital city.", config=config)

canned = {"france": TestConfig("Paris"), "japan": TestConfig("Tokyo")}
result = await run_agent(suite, agent=build, scorers=scorers, model_config=canned, store_dir="./runs")
assert result.pass_rate("final_answer_matches") == 1.0      # the gate

Persist, track, grade existing traces

store_dir= writes one JSON per run. Reload a past run and diff for regressions; or grade traces you already have (e.g. production telemetry) without re-running the agent:

from autogen.beta.eval import load_run, evaluate_traces, DirectoryTraceSource

assert not result.diff(load_run("./runs/<run_id>.json")).regressions   # scorers that flipped pass -> fail
graded = await evaluate_traces(DirectoryTraceSource("./traces"), scorers=scorers, store_dir="./runs")

Common pitfalls

Missing tracing extra — run_agent can't reconstruct traces. Install ag2[<provider>,tracing].
Return type vs aggregation — bool for pass/fail, a number for stats, a str for categories; look results up by the scorer's key.
Same model answers and judges — biases agent_judge; use a different judge model.

Going deeper

website/docs/beta/evaluation/ — getting-started, scorers (catalog + custom + return-type rules), runs, persistence
ag2-eval-comparison — leaderboard (run_variants) + head-to-head (run_pairwise)

related-skills.json

mismo repositorio

ag2-overview.md

from "ag2ai/ag2-skills"

Map of AG2 beta capabilities and which sibling skill to reach for. Load first when the user mentions building with AG2 beta (autogen.beta) but the specific feature isn't yet clear — agents, tools, model config, delegation, memory, observers, structured output, HITL, AG-UI, telemetry, testing, or evaluation.

2026-05-283

ag2-eval-comparison.md

from "ag2ai/ag2-skills"

Compare AG2 beta agents, models, or prompts to decide which is better. run_variants scores several named configurations on one suite and ranks them on a leaderboard (Variants.from_configs, from_prompts, from_tools, from_middleware, from_targets). run_pairwise with pairwise_judge does head-to-head LLM comparison using a dual-order position swap (a win counts only if it survives the swap, else a tie), reporting win-rate with a Wilson 95% CI, wins, losses, ties, flips, and agreement (Cohen's kappa). human_pairwise collects a person's blinded vote inline, or via an exported manifest with export_pairwise_cases and human_labels. Use when the user wants to A/B test prompts or models, run a leaderboard, pick a winner, judge head-to-head, measure win-rate, or collect human preference labels. For running and grading a single agent, see ag2-evaluation.

2026-05-283

ag2-middleware.md

from "ag2ai/ag2-skills"

Intercept the AG2 beta agent loop with `BaseMiddleware` — wrap full turns (`on_turn`), each LLM call (`on_llm_call`), each tool execution (`on_tool_execution`), or each human-input request (`on_human_input`). Use for retry, logging, history trimming, request mutation, tool auditing, guardrails, or rate limiting. Built-ins: `LoggingMiddleware`, `RetryMiddleware`, `HistoryLimiter`, `TokenLimiter`, `TelemetryMiddleware` (see `ag2-telemetry`). For per-tool hooks see also `ag2-add-custom-tool` tool-middleware section.

2026-05-283

ag2-network-discussion.md

from "ag2ai/ag2-skills"

Open an AG2 network `discussion` channel — N-party (2+) round-robin where each participant speaks in fixed order, cycling until explicit close or TTL. Use when the user wants a brainstorm with a fixed cast, a panel discussion, or round-robin reviewers. Covers `agent_client.open(type="discussion", target=[...], knobs={"ordering": ORDERING_ROUND_ROBIN})`, the `expected_next_speaker` rotation, the `hc.can_send(...)` probe pattern (default handlers skip LLM calls when it isn't their turn), `DiscussionState`, the `turn_within` expectation defaults (`warn` at 120s / `hide` at 600s), view-window sizing for N participants, and the four close patterns that work with this adapter. Load this after `ag2-network-quickstart`. For conditional handoffs or declarative orchestration, see `ag2-network-workflow` instead.

2026-05-283

ag2-network-governance.md

from "ag2ai/ag2-skills"

Govern an AG2 multi-agent network — identity (`Passport`, `Resume`), per-agent `Rule` with `AccessBlock` / `LimitsBlock` / `RateBlock` / `InboxBlock`, the swappable `HubArbiter` / `RuleBasedArbiter` access-&-routing seam, `AuthAdapter` / `AuthRegistry` registration, channel-level `Expectation`s with `audit` / `warn` / `auto_close` violation handlers, the hub's append-only audit log and `AUDIT_KIND_*` constants, live `HubListener` / `BaseHubListener` observability plus `Hub` `on_*` hooks and `register_sweeper`, and task observation via `agent.task(...)` + `TaskMirror` (updates `Resume.observed` for peer ranking). Use when the user needs rate limits, access policy, SLAs, compliance trails, live metrics/alerting, capability-driven peer ranking, or to inspect what actually happened on the network. Load this after `ag2-network-quickstart`. For the agent-side surface (custom handlers, views, LLM tools, `HumanClient`) see `ag2-network-tools-and-views`.

2026-05-283

ag2-network-tools-and-views.md

from "ag2ai/ag2-skills"

Shape what an AG2 network agent perceives and which actions its LLM can take. Covers the six auto-injected LLM-facing tools that ship via `NetworkPlugin` (`say`, `delegate`, `peers`, `channels`, `tasks`, `context`); replacing the default handler with `agent_client.on_envelope(callback)` (gateways, headless workers); the `ViewPolicy` Protocol with the built-in `FullTranscript` and `WindowedSummary(recent_n=N)` views plus how to write a custom view; peer discovery via skill markdown (`skill_md=`, `parse_skill_frontmatter`, `hub.set_skill`); the `Envelope` wire format with the `EV_*` event taxonomy (`EV_TEXT`, `EV_PACKET`, `EV_CHANNEL_*`, `EV_EXPECTATION_VIOLATED`, `ag2.task.*`), `audience` and `visible_to` semantics, `Priority`, `causation_id`, and how to send raw envelopes with custom event types via `agent_client.send_envelope(...)`. Use when the user wants to customise the LLM's network surface, write a custom envelope handler, build a gateway / headless worker, or wire peer discovery.

2026-05-283

package.json

"author": "ag2ai"

"repository": "ag2ai/ag2-skills"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

name	ag2-evaluation
description	Evaluate, test, and track an AG2 beta Agent offline. Build a Suite of tasks, run the agent with run_agent, and grade answers with prebuilt scorers (final_answer_matches, tool_called, no_tool_errors, token_budget) or a custom @scorer — including the agent_judge LLM judge. Read the RunResult scorecard (pass_rate, score_stats, value_counts), gate it in CI with deterministic TestConfig cassettes, persist to store_dir and diff runs to catch regressions, and grade existing traces with evaluate_traces. Use when the user wants to evaluate, test, grade, or benchmark an agent, build a CI or regression gate, or score correctness, tool use, cost, or quality. To compare builds head-to-head or on a leaderboard, see ag2-eval-comparison.
license	Apache-2.0

Evaluation — run, grade, and track an agent

When to use

Evaluate / test / benchmark an AG2 beta Agent, or build a regression / CI gate
Grade answers for correctness, tool use, cost, or subjective quality
Track a metric across versions (did this change help or regress?)

To compare two-plus builds head-to-head or on a leaderboard, use ag2-eval-comparison.

Install

pip install "ag2[openai,tracing]"

run_agent reconstructs each task's trace from OpenTelemetry spans, so the tracing extra is required.

The loop — dataset, agent, scorers, run_agent

import asyncio
from autogen.beta import Agent
from autogen.beta.config import OpenAIConfig
from autogen.beta.eval import Suite, run_agent
from autogen.beta.eval.scorers import final_answer_matches

suite = Suite.from_list([
    {"task_id": "france", "inputs": {"input": "Capital of France?"}, "reference_outputs": {"answer": "Paris"}},
    {"task_id": "japan",  "inputs": {"input": "Capital of Japan?"},  "reference_outputs": {"answer": "Tokyo"}},
])
agent = Agent("geographer", prompt="Answer with the capital city.", config=OpenAIConfig(model="gpt-4o-mini"))

async def main():
    result = await run_agent(
        suite, agent=agent,
        scorers=[final_answer_matches(field="answer", matcher="contains")],
        store_dir="./runs",
    )
    print(result.summary())                            # the scorecard
    print(result.pass_rate("final_answer_matches"))    # 1.0

asyncio.run(main())

inputs["input"] is the prompt; reference_outputs is the gold answer (a dict — omit it for trace-only checks). Each scorer is a column, looked up by its key.

Scorers

A scorer asks ONE question. Its RETURN TYPE picks the aggregation:

return	aggregation	accessor
`bool`	pass rate	`result.pass_rate(key)`
`int` / `float`	mean / p50 / p95	`result.score_stats(key)`
`str`	value counts	`result.value_counts(key)`

Custom — decorate a function that declares what it needs by name (outputs, trace, reference_outputs, inputs, task):

from autogen.beta.eval import scorer

@scorer
def answered_briefly(outputs) -> bool:
    return len(outputs["body"]) < 100      # outputs["body"] = final answer text

agent_judge grades quality you can't check with == (use a different model than the agent under test):

from autogen.beta.eval.scorers import agent_judge
judge = agent_judge(OpenAIConfig(model="gpt-4o"), criterion="Helpful and accurate.", key="quality")

CI — deterministic, no API key

Swap the model for a TestConfig cassette (a canned reply per task) so CI is free and repeatable:

from autogen.beta.testing import TestConfig

def build(*, config=None):                     # agent must be a FACTORY for per-task configs
    return Agent("geographer", prompt="Answer with the capital city.", config=config)

canned = {"france": TestConfig("Paris"), "japan": TestConfig("Tokyo")}
result = await run_agent(suite, agent=build, scorers=scorers, model_config=canned, store_dir="./runs")
assert result.pass_rate("final_answer_matches") == 1.0      # the gate

Persist, track, grade existing traces

store_dir= writes one JSON per run. Reload a past run and diff for regressions; or grade traces you already have (e.g. production telemetry) without re-running the agent:

from autogen.beta.eval import load_run, evaluate_traces, DirectoryTraceSource

assert not result.diff(load_run("./runs/<run_id>.json")).regressions   # scorers that flipped pass -> fail
graded = await evaluate_traces(DirectoryTraceSource("./traces"), scorers=scorers, store_dir="./runs")

Common pitfalls

Missing tracing extra — run_agent can't reconstruct traces. Install ag2[<provider>,tracing].
Return type vs aggregation — bool for pass/fail, a number for stats, a str for categories; look results up by the scorer's key.
Same model answers and judges — biases agent_judge; use a different judge model.

Going deeper

website/docs/beta/evaluation/ — getting-started, scorers (catalog + custom + return-type rules), runs, persistence
ag2-eval-comparison — leaderboard (run_variants) + head-to-head (run_pairwise)

ag2-evaluation

Evaluation — run, grade, and track an agent

When to use

Install

The loop — dataset, agent, scorers, run_agent

Scorers

CI — deterministic, no API key

Persist, track, grade existing traces

Common pitfalls

Going deeper

Más de este repositorio

Evaluation — run, grade, and track an agent

When to use

Install

The loop — dataset, agent, scorers, run_agent

Scorers

CI — deterministic, no API key

Persist, track, grade existing traces

Common pitfalls

Going deeper

Más de este repositorio