Run any Skill in Manus with one click

$pwd:

ag2-eval-comparison

Name: Ag2 Eval Comparison
Author: ag2ai

// Compare AG2 beta agents, models, or prompts to decide which is better. run_variants scores several named configurations on one suite and ranks them on a leaderboard (Variants.from_configs, from_prompts, from_tools, from_middleware, from_targets). run_pairwise with pairwise_judge does head-to-head LLM comparison using a dual-order position swap (a win counts only if it survives the swap, else a tie), reporting win-rate with a Wilson 95% CI, wins, losses, ties, flips, and agreement (Cohen's kappa). human_pairwise collects a person's blinded vote inline, or via an exported manifest with export_pairwise_cases and human_labels. Use when the user wants to A/B test prompts or models, run a leaderboard, pick a winner, judge head-to-head, measure win-rate, or collect human preference labels. For running and grading a single agent, see ag2-evaluation.

Run Skill in Manus

$ git log --oneline --stat

stars:3

forks:1

updated:May 28, 2026 at 01:00

SKILL.md

readonly

name	ag2-eval-comparison
description	Compare AG2 beta agents, models, or prompts to decide which is better. run_variants scores several named configurations on one suite and ranks them on a leaderboard (Variants.from_configs, from_prompts, from_tools, from_middleware, from_targets). run_pairwise with pairwise_judge does head-to-head LLM comparison using a dual-order position swap (a win counts only if it survives the swap, else a tie), reporting win-rate with a Wilson 95% CI, wins, losses, ties, flips, and agreement (Cohen's kappa). human_pairwise collects a person's blinded vote inline, or via an exported manifest with export_pairwise_cases and human_labels. Use when the user wants to A/B test prompts or models, run a leaderboard, pick a winner, judge head-to-head, measure win-rate, or collect human preference labels. For running and grading a single agent, see ag2-evaluation.
license	Apache-2.0

Evaluation — comparing builds (variants & pairwise)

When to use

Rank N models / prompts / configs on a leaderboard → run_variants
Decide which of two is better, head-to-head → run_pairwise with pairwise_judge (LLM) or human_pairwise (people)

For running and grading a single agent (scorers, CI, persistence), use ag2-evaluation.

Install

pip install "ag2[openai,tracing]"

Leaderboard — run_variants

Vary ONE axis (a Variants.from_* constructor fixes it), score each, rank:

from autogen.beta import Agent
from autogen.beta.config import OpenAIConfig, GeminiConfig
from autogen.beta.eval import Variants, run_variants
from autogen.beta.eval.scorers import agent_judge

def build(*, config=None):
    return Agent("a", prompt="Answer helpfully.", config=config)

board = await run_variants(
    suite,
    variants=Variants.from_configs(build, {
        "gpt-4o": OpenAIConfig("gpt-4o"),
        "flash":  GeminiConfig("gemini-3-flash-preview"),
    }),
    scorers=[agent_judge(OpenAIConfig("gpt-4o"), criterion="Helpful and accurate.", key="quality")],
    store_dir="runs",
    repeats=5,                          # optional: N runs per variant for stability
)
print(board.summary("quality"))         # ranked leaderboard
board.best("quality")                   # winning variant name (None if tied)
board.results["gpt-4o"]                 # each variant's full RunResult

Axes: from_configs (model), from_prompts (prompt), from_tools, from_middleware, from_targets (whole build). Tied scores share a rank; a 3-way tie usually means the eval isn't discriminating — make it harder, or score quality with a judge.

Head-to-head (LLM) — run_pairwise + pairwise_judge

A comparator picks a winner PER task. pairwise_judge shows the pair in BOTH orders and counts a win only if it's consistent — else a tie (cancels position bias):

from autogen.beta.eval import run_pairwise
from autogen.beta.eval.scorers import pairwise_judge

result = await run_pairwise(
    suite, variant_a=build_v1, variant_b=build_v2,
    comparators=[pairwise_judge(OpenAIConfig("gpt-4o"), criterion="more helpful answer", key="quality")],
    store_dir="runs",
)
wr = result.win_rate("quality")         # B's win-rate
print(wr.rate, wr.ci, wr.wins, wr.losses, wr.ties)   # ties count 0.5; ci is a Wilson 95% interval
print(result.flips("quality"))          # pairs where the two orders disagreed

variant_a / variant_b are agents or build factories. result.agreement("quality", "human") gives Cohen's κ between two comparators. Use a judge model different from the variants.

Head-to-head (human) — human_pairwise

Same unit, decided by a person. The pair is blinded and order-randomized; the default prints it and reads 1 / 2 / tie. Pass your own async ask(task, response_1, response_2) to collect a vote from a UI (returns "1", "2", or "tie"):

from autogen.beta.eval.scorers import human_pairwise

async def ask(task, response_1, response_2) -> str:
    return await my_ui.compare(task.inputs["input"], response_1, response_2)   # "1" / "2" / "tie"

result = await run_pairwise(suite, variant_a=build_v1, variant_b=build_v2,
                            comparators=[human_pairwise(key="quality", ask=ask)], store_dir="runs")

At scale, export a blinded manifest, label it in any tool, import it. evaluate_pairwise is the grade-only twin of run_pairwise (pairs two existing trace sources by task_id):

from autogen.beta.eval import evaluate_pairwise, DirectoryTraceSource
from autogen.beta.eval.scorers import export_pairwise_cases, human_labels

a, b = DirectoryTraceSource("runs/champion"), DirectoryTraceSource("runs/challenger")
await export_pairwise_cases(a, b, criteria=["more helpful"], out="labels.jsonl", suite=suite)   # blinded JSONL
# a person adds  "preferred": "1" | "2" | "tie"  per line, then:
result = await evaluate_pairwise(a, b, suite=suite, store_dir="runs",
                                 comparators=[human_labels("labels.jsonl", criterion="more helpful", key="helpful")])

The manifest hides which model is which; its first_variant field de-blinds it for human_labels.

Common pitfalls

Judge == a variant's model — self-preference bias; use a different judge model.
Bare win-rate on few pairs — report wr.ci (Wilson); a small n straddles 50%.
Variants not factories — run_variants / run_pairwise need build callables so each task and order gets a fresh agent. Keep pairwise_judge's default swap (don't set swap=False) for unbiased verdicts.

Going deeper

website/docs/beta/evaluation/ — variants (the from_* axes), pairwise (comparators, win-rate, blinded labeling)
ag2-evaluation — single-agent run/grade, scorers, CI, persistence

related-skills.json

same repository

ag2-overview.md

from "ag2ai/ag2-skills"

Map of AG2 beta capabilities and which sibling skill to reach for. Load first when the user mentions building with AG2 beta (autogen.beta) but the specific feature isn't yet clear — agents, tools, model config, delegation, memory, observers, structured output, HITL, AG-UI, telemetry, testing, or evaluation.

2026-05-283

ag2-evaluation.md

from "ag2ai/ag2-skills"

Evaluate, test, and track an AG2 beta Agent offline. Build a Suite of tasks, run the agent with run_agent, and grade answers with prebuilt scorers (final_answer_matches, tool_called, no_tool_errors, token_budget) or a custom @scorer — including the agent_judge LLM judge. Read the RunResult scorecard (pass_rate, score_stats, value_counts), gate it in CI with deterministic TestConfig cassettes, persist to store_dir and diff runs to catch regressions, and grade existing traces with evaluate_traces. Use when the user wants to evaluate, test, grade, or benchmark an agent, build a CI or regression gate, or score correctness, tool use, cost, or quality. To compare builds head-to-head or on a leaderboard, see ag2-eval-comparison.

2026-05-283

ag2-middleware.md

from "ag2ai/ag2-skills"

Intercept the AG2 beta agent loop with `BaseMiddleware` — wrap full turns (`on_turn`), each LLM call (`on_llm_call`), each tool execution (`on_tool_execution`), or each human-input request (`on_human_input`). Use for retry, logging, history trimming, request mutation, tool auditing, guardrails, or rate limiting. Built-ins: `LoggingMiddleware`, `RetryMiddleware`, `HistoryLimiter`, `TokenLimiter`, `TelemetryMiddleware` (see `ag2-telemetry`). For per-tool hooks see also `ag2-add-custom-tool` tool-middleware section.

2026-05-283

ag2-network-discussion.md

from "ag2ai/ag2-skills"

Open an AG2 network `discussion` channel — N-party (2+) round-robin where each participant speaks in fixed order, cycling until explicit close or TTL. Use when the user wants a brainstorm with a fixed cast, a panel discussion, or round-robin reviewers. Covers `agent_client.open(type="discussion", target=[...], knobs={"ordering": ORDERING_ROUND_ROBIN})`, the `expected_next_speaker` rotation, the `hc.can_send(...)` probe pattern (default handlers skip LLM calls when it isn't their turn), `DiscussionState`, the `turn_within` expectation defaults (`warn` at 120s / `hide` at 600s), view-window sizing for N participants, and the four close patterns that work with this adapter. Load this after `ag2-network-quickstart`. For conditional handoffs or declarative orchestration, see `ag2-network-workflow` instead.

2026-05-283

ag2-network-governance.md

from "ag2ai/ag2-skills"

Govern an AG2 multi-agent network — identity (`Passport`, `Resume`), per-agent `Rule` with `AccessBlock` / `LimitsBlock` / `RateBlock` / `InboxBlock`, the swappable `HubArbiter` / `RuleBasedArbiter` access-&-routing seam, `AuthAdapter` / `AuthRegistry` registration, channel-level `Expectation`s with `audit` / `warn` / `auto_close` violation handlers, the hub's append-only audit log and `AUDIT_KIND_*` constants, live `HubListener` / `BaseHubListener` observability plus `Hub` `on_*` hooks and `register_sweeper`, and task observation via `agent.task(...)` + `TaskMirror` (updates `Resume.observed` for peer ranking). Use when the user needs rate limits, access policy, SLAs, compliance trails, live metrics/alerting, capability-driven peer ranking, or to inspect what actually happened on the network. Load this after `ag2-network-quickstart`. For the agent-side surface (custom handlers, views, LLM tools, `HumanClient`) see `ag2-network-tools-and-views`.

2026-05-283

ag2-network-tools-and-views.md

from "ag2ai/ag2-skills"

Shape what an AG2 network agent perceives and which actions its LLM can take. Covers the six auto-injected LLM-facing tools that ship via `NetworkPlugin` (`say`, `delegate`, `peers`, `channels`, `tasks`, `context`); replacing the default handler with `agent_client.on_envelope(callback)` (gateways, headless workers); the `ViewPolicy` Protocol with the built-in `FullTranscript` and `WindowedSummary(recent_n=N)` views plus how to write a custom view; peer discovery via skill markdown (`skill_md=`, `parse_skill_frontmatter`, `hub.set_skill`); the `Envelope` wire format with the `EV_*` event taxonomy (`EV_TEXT`, `EV_PACKET`, `EV_CHANNEL_*`, `EV_EXPECTATION_VIOLATED`, `ag2.task.*`), `audience` and `visible_to` semantics, `Priority`, `causation_id`, and how to send raw envelopes with custom event types via `agent_client.send_envelope(...)`. Use when the user wants to customise the LLM's network surface, write a custom envelope handler, build a gateway / headless worker, or wire peer discovery.

2026-05-283

package.json

"author": "ag2ai"

"repository": "ag2ai/ag2-skills"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

name	ag2-eval-comparison
description	Compare AG2 beta agents, models, or prompts to decide which is better. run_variants scores several named configurations on one suite and ranks them on a leaderboard (Variants.from_configs, from_prompts, from_tools, from_middleware, from_targets). run_pairwise with pairwise_judge does head-to-head LLM comparison using a dual-order position swap (a win counts only if it survives the swap, else a tie), reporting win-rate with a Wilson 95% CI, wins, losses, ties, flips, and agreement (Cohen's kappa). human_pairwise collects a person's blinded vote inline, or via an exported manifest with export_pairwise_cases and human_labels. Use when the user wants to A/B test prompts or models, run a leaderboard, pick a winner, judge head-to-head, measure win-rate, or collect human preference labels. For running and grading a single agent, see ag2-evaluation.
license	Apache-2.0

Evaluation — comparing builds (variants & pairwise)

When to use

Rank N models / prompts / configs on a leaderboard → run_variants
Decide which of two is better, head-to-head → run_pairwise with pairwise_judge (LLM) or human_pairwise (people)

For running and grading a single agent (scorers, CI, persistence), use ag2-evaluation.

Install

pip install "ag2[openai,tracing]"

Leaderboard — run_variants

Vary ONE axis (a Variants.from_* constructor fixes it), score each, rank:

from autogen.beta import Agent
from autogen.beta.config import OpenAIConfig, GeminiConfig
from autogen.beta.eval import Variants, run_variants
from autogen.beta.eval.scorers import agent_judge

def build(*, config=None):
    return Agent("a", prompt="Answer helpfully.", config=config)

board = await run_variants(
    suite,
    variants=Variants.from_configs(build, {
        "gpt-4o": OpenAIConfig("gpt-4o"),
        "flash":  GeminiConfig("gemini-3-flash-preview"),
    }),
    scorers=[agent_judge(OpenAIConfig("gpt-4o"), criterion="Helpful and accurate.", key="quality")],
    store_dir="runs",
    repeats=5,                          # optional: N runs per variant for stability
)
print(board.summary("quality"))         # ranked leaderboard
board.best("quality")                   # winning variant name (None if tied)
board.results["gpt-4o"]                 # each variant's full RunResult

Head-to-head (LLM) — run_pairwise + pairwise_judge

A comparator picks a winner PER task. pairwise_judge shows the pair in BOTH orders and counts a win only if it's consistent — else a tie (cancels position bias):

from autogen.beta.eval import run_pairwise
from autogen.beta.eval.scorers import pairwise_judge

result = await run_pairwise(
    suite, variant_a=build_v1, variant_b=build_v2,
    comparators=[pairwise_judge(OpenAIConfig("gpt-4o"), criterion="more helpful answer", key="quality")],
    store_dir="runs",
)
wr = result.win_rate("quality")         # B's win-rate
print(wr.rate, wr.ci, wr.wins, wr.losses, wr.ties)   # ties count 0.5; ci is a Wilson 95% interval
print(result.flips("quality"))          # pairs where the two orders disagreed

variant_a / variant_b are agents or build factories. result.agreement("quality", "human") gives Cohen's κ between two comparators. Use a judge model different from the variants.

Head-to-head (human) — human_pairwise

from autogen.beta.eval.scorers import human_pairwise

async def ask(task, response_1, response_2) -> str:
    return await my_ui.compare(task.inputs["input"], response_1, response_2)   # "1" / "2" / "tie"

result = await run_pairwise(suite, variant_a=build_v1, variant_b=build_v2,
                            comparators=[human_pairwise(key="quality", ask=ask)], store_dir="runs")

At scale, export a blinded manifest, label it in any tool, import it. evaluate_pairwise is the grade-only twin of run_pairwise (pairs two existing trace sources by task_id):

from autogen.beta.eval import evaluate_pairwise, DirectoryTraceSource
from autogen.beta.eval.scorers import export_pairwise_cases, human_labels

a, b = DirectoryTraceSource("runs/champion"), DirectoryTraceSource("runs/challenger")
await export_pairwise_cases(a, b, criteria=["more helpful"], out="labels.jsonl", suite=suite)   # blinded JSONL
# a person adds  "preferred": "1" | "2" | "tie"  per line, then:
result = await evaluate_pairwise(a, b, suite=suite, store_dir="runs",
                                 comparators=[human_labels("labels.jsonl", criterion="more helpful", key="helpful")])

The manifest hides which model is which; its first_variant field de-blinds it for human_labels.

Common pitfalls

Judge == a variant's model — self-preference bias; use a different judge model.
Bare win-rate on few pairs — report wr.ci (Wilson); a small n straddles 50%.
Variants not factories — run_variants / run_pairwise need build callables so each task and order gets a fresh agent. Keep pairwise_judge's default swap (don't set swap=False) for unbiased verdicts.

Going deeper

website/docs/beta/evaluation/ — variants (the from_* axes), pairwise (comparators, win-rate, blinded labeling)
ag2-evaluation — single-agent run/grade, scorers, CI, persistence

ag2-eval-comparison

Evaluation — comparing builds (variants & pairwise)

When to use

Install

Leaderboard — run_variants

Head-to-head (LLM) — run_pairwise + pairwise_judge

Head-to-head (human) — human_pairwise

Common pitfalls

Going deeper

More from this repository

More from this repository

Evaluation — comparing builds (variants & pairwise)

When to use

Install

Leaderboard — run_variants

Head-to-head (LLM) — run_pairwise + pairwise_judge

Head-to-head (human) — human_pairwise

Common pitfalls

Going deeper