Skip to main content
تشغيل أي مهارة في Manus
بنقرة واحدة
$pwd:

ag2-eval-comparison

// Compare AG2 beta agents, models, or prompts to decide which is better. run_variants scores several named configurations on one suite and ranks them on a leaderboard (Variants.from_configs, from_prompts, from_tools, from_middleware, from_targets). run_pairwise with pairwise_judge does head-to-head LLM comparison using a dual-order position swap (a win counts only if it survives the swap, else a tie), reporting win-rate with a Wilson 95% CI, wins, losses, ties, flips, and agreement (Cohen's kappa). human_pairwise collects a person's blinded vote inline, or via an exported manifest with export_pairwise_cases and human_labels. Use when the user wants to A/B test prompts or models, run a leaderboard, pick a winner, judge head-to-head, measure win-rate, or collect human preference labels. For running and grading a single agent, see ag2-evaluation.

$ git log --oneline --stat
stars:٣
forks:١
updated:٢٨ مايو ٢٠٢٦ في ٠١:٠٠
SKILL.md
readonly