// Generates a rigorous experiment design given a hypothesis. Use when asked to design experiments, plan experiments, create an experimental setup, or figure out how to test a research hypothesis. Covers controls, baselines, ablations, metrics, statistical tests, and compute estimates.
| name | experiment-design-checklist |
| description | Generates a rigorous experiment design given a hypothesis. Use when asked to design experiments, plan experiments, create an experimental setup, or figure out how to test a research hypothesis. Covers controls, baselines, ablations, metrics, statistical tests, and compute estimates. |
Prevent the "I ran experiments for 3 months and they're meaningless" disaster through rigorous upfront design.
Before running ANY experiment, you should be able to answer:
Convert your research question into falsifiable predictions:
Template:
If [intervention/method], then [measurable outcome], because [mechanism].
Examples:
Null hypothesis: What does "no effect" look like? This is what you're trying to reject.
Independent Variables (what you manipulate):
| Variable | Levels | Rationale |
|---|---|---|
| [Var 1] | [Level A, B, C] | [Why these levels] |
Dependent Variables (what you measure):
| Metric | How Measured | Why This Metric |
|---|---|---|
| [Metric 1] | [Procedure] | [Justification] |
Control Variables (what you hold constant):
| Variable | Fixed Value | Why Fixed |
|---|---|---|
| [Var 1] | [Value] | [Prevents confound X] |
Every experiment needs comparisons. No result is meaningful in isolation.
Baseline Hierarchy:
Random/Trivial Baseline
Simple Baseline
Standard Baseline
State-of-the-Art Baseline
Ablated Self
For each baseline, document:
Ablations answer: "Is each component necessary?"
Ablation Template:
| Variant | What's Removed/Changed | Expected Effect | If No Effect... |
|---|---|---|---|
| Full Model | Nothing | Best performance | - |
| w/o Component A | Remove A | Performance drops X% | A isn't helping |
| w/o Component B | Remove B | Performance drops Y% | B isn't helping |
| Component A only | Only A, no B | Shows A's isolated contribution | - |
Good ablations are:
Things that could explain your results OTHER than your hypothesis:
Common Confounds:
| Confound | How to Check | How to Control |
|---|---|---|
| Hyperparameter tuning advantage | Same tuning budget for all | Report tuning procedure |
| Compute advantage | Matched FLOPs/params | Report compute used |
| Data leakage | Check train/test overlap | Strict separation |
| Random seed luck | Multiple seeds | Report variance |
| Implementation bugs (baseline) | Verify baseline numbers | Use official implementations |
| Cherry-picked examples | Random or systematic selection | Pre-register selection criteria |
Sample Size:
What to Report:
Appropriate Tests:
| Comparison | Test | Assumptions |
|---|---|---|
| Two methods, normal data | t-test | Normality, equal variance |
| Two methods, unknown dist | Mann-Whitney U | Ordinal data |
| Multiple methods | ANOVA + post-hoc | Normality |
| Multiple methods, unknown | Kruskal-Wallis | Ordinal data |
| Paired comparisons | Wilcoxon signed-rank | Same test instances |
Avoid:
Before running, estimate:
| Component | Estimate | Notes |
|---|---|---|
| Single training run | X GPU-hours | [Details] |
| Hyperparameter search | Y runs × X hours | [Search strategy] |
| Baselines | Z runs × W hours | [Which baselines] |
| Ablations | N variants × X hours | [Which ablations] |
| Seeds | M seeds × above | [How many seeds] |
| Total | T GPU-hours | Buffer: 1.5-2x |
Go/No-Go Decision: Is this feasible with available resources?
Write down BEFORE running:
This prevents unconscious goal-post moving.
# Experiment Design: [Title]
## Hypothesis
[Precise statement]
## Variables
### Independent
[Table]
### Dependent
[Table]
### Controls
[Table]
## Baselines
1. [Baseline 1]: [Source, details]
2. [Baseline 2]: [Source, details]
## Ablations
[Table]
## Confound Mitigation
[Table]
## Statistical Plan
- Seeds: [N]
- Tests: [Which tests for which comparisons]
- Significance threshold: [α level]
## Compute Budget
[Table with total estimate]
## Success Criteria
- Primary: [What must be true]
- Secondary: [Nice to have]
## Timeline
- Phase 1: [What, when]
- Phase 2: [What, when]
## Known Risks
1. [Risk 1]: [Mitigation]
2. [Risk 2]: [Mitigation]
🚩 "We'll figure out the metrics later" 🚩 "One run should be enough" 🚩 "We don't need baselines, it's obviously better" 🚩 "Let's just see what happens" 🚩 "We can always run more if it's not significant" 🚩 No compute estimate before starting 🚩 Vague success criteria