Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

Loslegen

comparative-evaluation

A/B testing, side-by-side comparison, and preference ranking for AI outputs.

In Manus ausführen

Überblick

A/B testing, side-by-side comparison, and preference ranking for AI outputs.

Installationsbefehl

npx skills add https://github.com/wanghaisheng/openaiworkhorse-design-team --skill comparative-evaluation

Kopieren Sie diesen Befehl und fügen Sie ihn in Claude Code ein, um den Skill zu installieren

Quelle

wanghaisheng/openaiworkhorse-design-team

Sterne1

Forks0

Aktualisiert24. April 2026 um 07:17

SKILL.md

readonly

name	comparative-evaluation
description	A/B testing, side-by-side comparison, and preference ranking for AI outputs.

Comparative Evaluation

Absolute quality scores are useful but limited. Comparative evaluation — putting outputs side by side and asking which is better — often reveals quality differences that rubrics miss.

Comparison Methods

A/B testing: Show different users different versions and compare outcomes
Side-by-side evaluation: Show evaluators two outputs for the same input and ask which is better
Preference ranking: Show evaluators multiple outputs and rank them from best to worst
Paired comparison: Compare every pair of options to build a complete ranking
Elo rating: Use tournament-style comparisons to develop continuous quality scores

Designing A/B Tests for AI

A/B testing AI is different from A/B testing UI:

Variance is high: The same prompt can produce different outputs, so you need more samples
Context matters: The same change might help for one task and hurt for another
Metrics lag: AI quality changes may take time to show up in user behavior
Interaction effects: A change to one part of the conversation affects all subsequent parts Design A/B tests with:
Sufficient sample sizes to account for output variance
Segmentation by task type and user experience level
Multiple metrics (don't optimise for one at the expense of others)
Guardrails to catch severe quality regressions quickly

Side-by-Side Evaluation Design

For human evaluation of AI outputs:

Blind evaluation: Evaluators shouldn't know which version is which
Consistent inputs: Compare outputs generated from the same input
Structured criteria: Give evaluators specific dimensions to compare on, not just "which is better"
Multiple evaluators: Use at least 3 evaluators per comparison for reliability
Diverse inputs: Test across a representative sample of real user inputs

When to Use Comparative vs. Absolute Evaluation

Comparative: Best for choosing between alternatives, detecting subtle quality differences, and model selection
Absolute: Best for measuring against a standard, tracking progress over time, and certification

Design Artefacts

A/B test design templates
Side-by-side evaluation protocols
Evaluator instructions and rubrics
Sample size calculators for AI experiments
Comparison result analysis frameworks

Mehr aus diesem Repository

gleiches Repository

designer-role

wanghaisheng/openaiworkhorse-design-team

扮演设计师角色，完成从用户研究、UX策略、UI设计到交互设计的全流程设计任务。支持设计研究、设计系统、原型测试、设计运营等多维度职责。

2026-04-241

agent-role-design

wanghaisheng/openaiworkhorse-design-team

Defining what each agent does, knows, and owns in a multi-agent system.

2026-04-241

behavioral-consistency

wanghaisheng/openaiworkhorse-design-team

Ensuring the AI behaves predictably across sessions, edge cases, and modalities.

2026-04-241

bias-detection-design

wanghaisheng/openaiworkhorse-design-team

Designing review workflows to surface and mitigate bias in AI outputs.

2026-04-241

chain-of-thought-design

wanghaisheng/openaiworkhorse-design-team

Designing reasoning chains that produce better outputs.

2026-04-241

consent-and-agency

wanghaisheng/openaiworkhorse-design-team

Designing for informed user consent, opt-out, and human override.

2026-04-241

Quelle

wanghaisheng

wanghaisheng/openaiworkhorse-design-team

GitHub-Repository öffnen Creator-Repositorys ansehen

Installationsbefehl

Download

In Manus ausführen

Nützlich fürSOC

Marktforschungsanalysten und MarketingspezialistenWirtschafts- und Finanzberufe13-1161L4

name	comparative-evaluation
description	A/B testing, side-by-side comparison, and preference ranking for AI outputs.

Comparative Evaluation

Absolute quality scores are useful but limited. Comparative evaluation — putting outputs side by side and asking which is better — often reveals quality differences that rubrics miss.

Comparison Methods

A/B testing: Show different users different versions and compare outcomes
Side-by-side evaluation: Show evaluators two outputs for the same input and ask which is better
Preference ranking: Show evaluators multiple outputs and rank them from best to worst
Paired comparison: Compare every pair of options to build a complete ranking
Elo rating: Use tournament-style comparisons to develop continuous quality scores

Designing A/B Tests for AI

A/B testing AI is different from A/B testing UI:

Variance is high: The same prompt can produce different outputs, so you need more samples
Context matters: The same change might help for one task and hurt for another
Metrics lag: AI quality changes may take time to show up in user behavior
Interaction effects: A change to one part of the conversation affects all subsequent parts Design A/B tests with:
Sufficient sample sizes to account for output variance
Segmentation by task type and user experience level
Multiple metrics (don't optimise for one at the expense of others)
Guardrails to catch severe quality regressions quickly

Side-by-Side Evaluation Design

For human evaluation of AI outputs:

Blind evaluation: Evaluators shouldn't know which version is which
Consistent inputs: Compare outputs generated from the same input
Structured criteria: Give evaluators specific dimensions to compare on, not just "which is better"
Multiple evaluators: Use at least 3 evaluators per comparison for reliability
Diverse inputs: Test across a representative sample of real user inputs

When to Use Comparative vs. Absolute Evaluation

Comparative: Best for choosing between alternatives, detecting subtle quality differences, and model selection
Absolute: Best for measuring against a standard, tracking progress over time, and certification

Design Artefacts

A/B test design templates
Side-by-side evaluation protocols
Evaluator instructions and rubrics
Sample size calculators for AI experiments
Comparison result analysis frameworks