// Generates a rigorous experiment design given a hypothesis. Use when asked to design experiments, plan experiments, create an experimental setup, or figure out how to test a research hypothesis. Covers controls, baselines, ablations, metrics, statistical tests, and compute estimates.
| name | experiment-design-checklist |
| description | Generates a rigorous experiment design given a hypothesis. Use when asked to design experiments, plan experiments, create an experimental setup, or figure out how to test a research hypothesis. Covers controls, baselines, ablations, metrics, statistical tests, and compute estimates. |
Prevent the "I ran experiments for 3 months and they're meaningless" disaster through rigorous upfront design.
Before running ANY experiment, you should be able to answer:
Convert your research question into falsifiable predictions:
Template:
If [intervention/method], then [measurable outcome], because [mechanism].
Examples:
Null hypothesis: What does "no effect" look like? This is what you're trying to reject.
Independent Variables (what you manipulate):
| Variable | Levels | Rationale |
|---|---|---|
| [Var 1] | [Level A, B, C] | [Why these levels] |
Dependent Variables (what you measure):
| Metric | How Measured | Why This Metric |
|---|---|---|
| [Metric 1] | [Procedure] | [Justification] |
Control Variables (what you hold constant):
| Variable | Fixed Value | Why Fixed |
|---|---|---|
| [Var 1] | [Value] | [Prevents confound X] |
Every experiment needs comparisons. No result is meaningful in isolation.
Baseline Hierarchy:
Random/Trivial Baseline
Simple Baseline
Standard Baseline
State-of-the-Art Baseline
Ablated Self
For each baseline, document:
Ablations answer: "Is each component necessary?"
Ablation Template:
| Variant | What's Removed/Changed | Expected Effect | If No Effect... |
|---|---|---|---|
| Full Model | Nothing | Best performance | - |
| w/o Component A | Remove A | Performance drops X% | A isn't helping |
| w/o Component B | Remove B | Performance drops Y% | B isn't helping |
| Component A only | Only A, no B | Shows A's isolated contribution | - |
Good ablations are:
Things that could explain your results OTHER than your hypothesis:
Common Confounds:
| Confound | How to Check | How to Control |
|---|---|---|
| Hyperparameter tuning advantage | Same tuning budget for all | Report tuning procedure |
| Compute advantage | Matched FLOPs/params | Report compute used |
| Data leakage | Check train/test overlap | Strict separation |
| Random seed luck | Multiple seeds | Report variance |
| Implementation bugs (baseline) | Verify baseline numbers | Use official implementations |
| Cherry-picked examples | Random or systematic selection | Pre-register selection criteria |
Sample Size:
What to Report:
Appropriate Tests:
| Comparison | Test | Assumptions |
|---|---|---|
| Two methods, normal data | t-test | Normality, equal variance |
| Two methods, unknown dist | Mann-Whitney U | Ordinal data |
| Multiple methods | ANOVA + post-hoc | Normality |
| Multiple methods, unknown | Kruskal-Wallis | Ordinal data |
| Paired comparisons | Wilcoxon signed-rank | Same test instances |
Avoid:
Before running, estimate:
| Component | Estimate | Notes |
|---|---|---|
| Single training run | X GPU-hours | [Details] |
| Hyperparameter search | Y runs ร X hours | [Search strategy] |
| Baselines | Z runs ร W hours | [Which baselines] |
| Ablations | N variants ร X hours | [Which ablations] |
| Seeds | M seeds ร above | [How many seeds] |
| Total | T GPU-hours | Buffer: 1.5-2x |
Go/No-Go Decision: Is this feasible with available resources?
Write down BEFORE running:
This prevents unconscious goal-post moving.
# Experiment Design: [Title]
## Hypothesis
[Precise statement]
## Variables
### Independent
[Table]
### Dependent
[Table]
### Controls
[Table]
## Baselines
1. [Baseline 1]: [Source, details]
2. [Baseline 2]: [Source, details]
## Ablations
[Table]
## Confound Mitigation
[Table]
## Statistical Plan
- Seeds: [N]
- Tests: [Which tests for which comparisons]
- Significance threshold: [ฮฑ level]
## Compute Budget
[Table with total estimate]
## Success Criteria
- Primary: [What must be true]
- Secondary: [Nice to have]
## Timeline
- Phase 1: [What, when]
- Phase 2: [What, when]
## Known Risks
1. [Risk 1]: [Mitigation]
2. [Risk 2]: [Mitigation]
๐ฉ "We'll figure out the metrics later" ๐ฉ "One run should be enough" ๐ฉ "We don't need baselines, it's obviously better" ๐ฉ "Let's just see what happens" ๐ฉ "We can always run more if it's not significant" ๐ฉ No compute estimate before starting ๐ฉ Vague success criteria