Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

ab-plan

Design an LLM A/B test — pick platform (Statsig or GrowthBook), primary metric, guardrails, sample size with LLM-noise buffer, CUPED, sequential stopping, and multiple-comparison correction. Use when you need help with ab plan.

In Manus ausführen

Sterne1

Forks2

Aktualisiert24. Mai 2026 um 17:44

Quelle

anubhavg-icpl

anubhavg-icpl/vibe

GitHub-Repository öffnen Creator-Repositorys ansehen

Installationsbefehl

Download

In Manus ausführen

Nützlich fürSOC

SoftwareentwicklerInformatik- und Mathematikberufe15-1252L4

SKILL.md

readonly

name	ab-plan
description	Design an LLM A/B test — pick platform (Statsig or GrowthBook), primary metric, guardrails, sample size with LLM-noise buffer, CUPED, sequential stopping, and multiple-comparison correction. Use when you need help with ab plan.
license	CC-BY-NC-SA-4.0
phase	17
lesson	21
metadata	{"version":"1.0.0","tags":["ab-testing","statsig","growthbook","cuped","sequential","benjamini-hochberg","srm"]}

Given the feature change (prompt / model / generation parameter), baseline metrics, expected lift, and team posture (warehouse-native OSS vs bundled SaaS), produce an A/B plan.

Produce:

Platform. Statsig (bundled SaaS, OpenAI-owned) or GrowthBook (MIT OSS, warehouse-native). Justify.
Primary metric + guardrails. Primary is the metric you are trying to move; guardrails are things that must not regress (cost/request, latency P99, refusal rate).
Sample size. Classical power calculation × 1.4 (LLM non-determinism buffer).
Design. Fixed-horizon or sequential. Sequential if you expect strong signals; fixed if the change is subtle.
CUPED. Enable if pre-period data exists for the primary metric; specify the regressor.
Correction. Bonferroni for small number of tests; Benjamini-Hochberg for many related tests.
SRM. Require SRM check on every experiment; halt and debug if flagged.

Hard rejects:

Shipping on vibes. Refuse — require A/B or documented no-A/B exception.
Running >5 experiments on the same primary metric without BH/Bonferroni. Refuse — false discovery certain.
Skipping SRM check. Refuse — assignment bugs are common.

Refusal rules:

If traffic < 1000 users/week for the feature, refuse fixed A/B — require shadow + canary (Phase 17 · 20) instead.
If the primary metric is subjective (e.g., "quality") without an objective proxy, require human eval in parallel.
If the lift hypothesis is smaller than the LLM noise floor, refuse — the experiment cannot detect it with realistic sample size.

Output: a one-page plan with platform, primary + guardrails, sample size, design, CUPED, correction, SRM policy. End with the decision rule: primary significant + all guardrails not significant-negative → ship; any guardrail breach → do not ship regardless of primary.

Mehr aus diesem Repository

gleiches Repository

3d-pipeline

anubhavg-icpl/vibe

Choose a 3D generation or reconstruction pipeline given input type, output format, and use case. Use when you need help with 3d pipeline.

2026-05-241

8-bit-orbit-video-template

anubhavg-icpl/vibe

|. Use when you need help with 8 bit orbit video template.

2026-05-241

a2a-agent-spec

anubhavg-icpl/vibe

Produce the Agent Card and skills schema for an agent that should be callable over A2A. Use when you need help with a2a agent spec.

2026-05-241

a2a-integrator

anubhavg-icpl/vibe

Design an A2A integration between two agents — Agent Card, task schemas, auth, streaming or polling. Use when you need help with a2a integrator.

2026-05-241

aar-deployment-review

anubhavg-icpl/vibe

Pre-deployment review of an automated-alignment-research pipeline, including sandbox isolation and log integrity. Use when you need help with aar deployment review.

2026-05-241

accessibility-designer

anubhavg-icpl/vibe

accessibility-designer. Use when you need help with accessibility designer.

2026-05-241

name	ab-plan
description	Design an LLM A/B test — pick platform (Statsig or GrowthBook), primary metric, guardrails, sample size with LLM-noise buffer, CUPED, sequential stopping, and multiple-comparison correction. Use when you need help with ab plan.
license	CC-BY-NC-SA-4.0
phase	17
lesson	21
metadata	{"version":"1.0.0","tags":["ab-testing","statsig","growthbook","cuped","sequential","benjamini-hochberg","srm"]}

Given the feature change (prompt / model / generation parameter), baseline metrics, expected lift, and team posture (warehouse-native OSS vs bundled SaaS), produce an A/B plan.

Produce:

Platform. Statsig (bundled SaaS, OpenAI-owned) or GrowthBook (MIT OSS, warehouse-native). Justify.
Primary metric + guardrails. Primary is the metric you are trying to move; guardrails are things that must not regress (cost/request, latency P99, refusal rate).
Sample size. Classical power calculation × 1.4 (LLM non-determinism buffer).
Design. Fixed-horizon or sequential. Sequential if you expect strong signals; fixed if the change is subtle.
CUPED. Enable if pre-period data exists for the primary metric; specify the regressor.
Correction. Bonferroni for small number of tests; Benjamini-Hochberg for many related tests.
SRM. Require SRM check on every experiment; halt and debug if flagged.

Hard rejects:

Shipping on vibes. Refuse — require A/B or documented no-A/B exception.
Running >5 experiments on the same primary metric without BH/Bonferroni. Refuse — false discovery certain.
Skipping SRM check. Refuse — assignment bugs are common.

Refusal rules:

If traffic < 1000 users/week for the feature, refuse fixed A/B — require shadow + canary (Phase 17 · 20) instead.
If the primary metric is subjective (e.g., "quality") without an objective proxy, require human eval in parallel.
If the lift hypothesis is smaller than the LLM noise floor, refuse — the experiment cannot detect it with realistic sample size.