| name | gtd |
| description | Warrant-first research GTD system. Manages the capture-clarify-organize-reflect-engage cycle for causal inference research. Scaffolds hypotheses/, insights/, decisions/ directories. Interrogates conjectures, files results, tracks binding decisions, checks pipeline freshness, drives the courtroom checklist. |
| allowed-tools | Read, Write, Edit, Bash, AskUserQuestion |
| argument-hint | [init | conjecture | insight | decide | pipeline | status | courtroom] |
/gtd — Warrant-First Research GTD
Philosophy
Research with an AI thinking partner is iterated dialogue between human judgment and agent throughput, where every cycle either strengthens a warrant or kills a claim. The harness is whatever machinery makes that dialogue fast, honest, and recoverable.
Four elements:
| Element | Definition | Role of dialogue |
|---|
| Frame | A question worth asking | Interrogated by dialogue |
| Work | A way to interrogate it | Supervised by dialogue (agents make this cheap) |
| Warrant | A way to know you've earned the answer | Built by dialogue — this is the product |
| Dialogue | The substrate across all three | Human and agent argue their way to claims that hold |
The binding constraint has shifted. Pre-agents, Work was binding (coding, cleaning, drafting). Agents make Work cheap. The binding constraint is now Frame and Warrant — what to ask, and whether you've earned the answer. Design the harness around that reallocation.
Commands
/gtd init
Creates the directory structure in the current project:
hypotheses/INDEX.md — DAG of testable claims
insights/INDEX.md — Atomic findings with provenance
decisions/INDEX.md — Binding commitments that constrain the pipeline
dashboard.html — Visual status (serves from localhost)
scripts/build_dashboard_data.py — Regenerates dashboard_data.json
Then asks: "What's the first claim you want to test?"
/gtd conjecture
The clarify step. Adversarial interrogation:
- You state something you believe.
- I run the courtroom checklist:
- Estimand: What parameter are you trying to learn?
- Population: On whom?
- Variation: What source of variation identifies it?
- Mechanism: What's the treatment assignment process?
- Falsification: What specific result would kill this?
- Sub-claims: Can this decompose into independently testable pieces?
- We agree on the precise statement.
- I write
hypotheses/HXX_slug.md and update INDEX.
/gtd insight
File a result:
- What did we find? (One sentence, exact numbers.)
- Which hypothesis does it speak to?
- What pipeline script produced it? (Must be pipeline, not ad hoc.)
- Is the figure fresh? (Script timestamp vs. output timestamp.)
Writes insights/YYYY-MM-DD_slug.md, updates the linked hypothesis, regenerates dashboard_data.json.
/gtd decide
Commit a binding design choice:
- What's the decision?
- Why? (One sentence.)
- What does it constrain downstream?
Writes to decisions/INDEX.md. Updates CLAUDE.md if the decision persists across sessions.
/gtd pipeline
Check freshness:
- For each output, is the source script newer? → stale.
- Does every figure trace to a pipeline script? → orphans flagged.
- When was the pipeline last verified?
Runs python3 scripts/build_dashboard_data.py and reports.
/gtd status
Quick orientation: hypothesis DAG, pipeline freshness, next actions.
/gtd courtroom
Walk through the DiD checklist stage by stage:
- Show Bite — the event was real
- Event Studies — dynamic effects, pre-trends = 0
- Falsification — placebo finds nothing
- Main Results — headline ATT
- Mechanisms — why, heterogeneity
For each: present the exhibit, interrogate it, confirm or flag. Populates the manuscript view as we go. After completion, draft the narrative from confirmed material in the chosen voice.
The Courtroom (DiD Checklist)
Every quasi-experimental study presents its case. The courtroom is the general form — not just DiD but any design that requires:
- A first-order effect to exist (show bite)
- A credible counterfactual (event study / pre-trends)
- Falsification of confounders (placebo period)
- The estimate itself (main results)
- Understanding of why (mechanisms)
Two cross-cutting standards apply to ALL stages:
- Beautiful — figures and tables communicate clearly
- Verified — pipeline reproducibility, referee2 audits, number consistency
File Formats
Hypothesis (hypotheses/HXX_slug.md)
---
id: H01a
status: conjecture | testing | confirmed | rejected | complicated
parent: H01
date_proposed: 2026-05-19
---
## Claim
[One sentence, testable.]
## Courtroom
- Estimand: [what parameter]
- Population: [on whom]
- Variation: [what identifies it]
- Falsification: [what kills it]
## Evidence
- [links to insights, added as they accumulate]
Insight (insights/YYYY-MM-DD_slug.md)
---
date: 2026-04-10
updates: H01a
result: confirmed | rejected | complicated
stage: [2, 4] # optional — courtroom stage(s) this speaks to. Overrides keyword matching.
script: scripts/r/05_estimate_did.R
output: output/figures/event_study.pdf
---
## Finding
[The fact. Numbers. Script path. What it means for the hypothesis.]
## Key Numbers
[Table with point estimate, SE, CI, p-value, N]
## Context
[Specification details, baseline, relative magnitude]
Decision (decisions/INDEX.md)
Table format. One row per binding decision:
| ID | Decision | Date | Rationale |
|---|---|---|---|
| D01 | Primary estimator is TWFE with district and week FE | 2026-04-01 | Sufficient pre-periods; no staggered-timing bias |
Status Transitions
conjecture → testing: First pipeline script assigned to test this hypothesis
testing → confirmed: Positive evidence + falsification passes (Stages 2-4 confirmed)
testing → rejected: Evidence contradicts + falsification confirms the negative
testing → complicated: Evidence mixed OR falsification fails
complicated → confirmed: Complication resolved (new evidence or new design)
complicated → rejected: Further investigation confirms failure
Rules:
- A hypothesis CANNOT move to
confirmed without passing falsification (Stage 3)
- A hypothesis CAN move directly from
conjecture to rejected (if "kills it" condition met immediately)
complicated is NOT terminal — it requires resolution
- Parent hypothesis status = worst child status (if any child is
complicated, parent is at most testing)
Pipeline Levels
| Level | Name | Contains | Example |
|---|
| 1 | Cleaning | Raw → clean; format standardization | 00_clean_survey.py |
| 2 | Derived | Clean → derived variables; joins, constructs | 02_build_panel.py |
| 3 | Classification | Derived → treatment/control assignment | 03_classify_treated.py |
| 4 | Figures | Descriptive outputs, maps, timelines | 04_descriptive_figures.R |
| 5 | Estimation | Causal inference; the main results | 05_estimate_did.R |
Rules:
- A level-N script may only read outputs from levels < N
- Numbering within level is sequential (00, 01, 02...)
- Language suffix indicates the tool (.py, .R, .do)
- Every output in
output/figures/ must map to exactly one pipeline script
Freshness
Freshness is computed dynamically by comparing file modification times:
output.mtime >= script.mtime → FRESH (output generated after script was last modified)
output.mtime < script.mtime → STALE (script changed since output was generated)
- Output does not exist → MISSING
Freshness is NEVER stored as a permanent field. It is always computed at runtime by build_dashboard_data.py. The fresh field in insight frontmatter is a snapshot at filing time — the dashboard recomputes it.
INDEX.md Formats
hypotheses/INDEX.md — Hierarchical DAG
# Hypothesis DAG
## H01 — Main Claim
Status: **testing**
One sentence description.
### H01a — Sub-claim
Status: **confirmed** (date)
One sentence description.
Two levels: parent hypotheses (##) and children (###). Each entry has bold status inline.
insights/INDEX.md — Table
# Insights Log
| Date | Finding | Hypothesis | Status |
|---|---|---|---|
| 2026-04-15 | [Placebo is null](file.md) | H01a | confirmed |
| 2026-04-10 | [Urban ATT = 2.3pp](file.md) | H01a | confirmed |
Most recent first. Links to individual insight files.
Courtroom → Dashboard Flow
When /gtd courtroom confirms a stage:
- The relevant insight(s) are filed (if not already)
- The linked hypothesis status may update
build_dashboard_data.py regenerates the JSON
- Dashboard Courtroom tab shows the stage as confirmed (green)
- Dashboard Manuscript tab allows the confirmed material to appear
When /gtd courtroom flags a stage as complicated:
- An insight is filed with
result: complicated
- The linked hypothesis moves to
complicated
- Dashboard Courtroom tab shows the stage with a yellow indicator
- Manuscript tab moves that material to "Unearned"
Hooks
Only add hooks for failures that are silently wrong (produce plausible but incorrect output).
Do hook: Classification file changes but county file not rebuilt → wrong treatment set → wrong ATT → presented wrong numbers. Silent failure. Hook it.
Don't hook: Missing figure → LaTeX won't compile. Visible failure. Don't hook it.
Starter hook (adapt paths to your project):
{
"hooks": {
"PostToolUse": [{
"matcher": "Write",
"command": "if echo \"$TOOL_INPUT\" | grep -q 'LINCHPIN_FILE_NAME'; then echo '⚠️ PIPELINE DEPENDENCY: Rebuild downstream'; fi"
}]
}
}
Dashboard
The dashboard (dashboard.html) reads from dashboard_data.json generated by scripts/build_dashboard_data.py. It shows:
- Status — pipeline freshness, hypothesis summary, latest finding, next actions
- Courtroom — 5-stage checklist with expandable evidence panels
- Pipeline — scripts grouped by level with freshness indicators
- Hypotheses — claim DAG with color-coded status
- Decisions — binding commitments table
- Figures — all outputs: pipeline vs. orphaned, fresh vs. stale
- Manuscript — only confirmed claims with fresh evidence appear here; unearned claims are listed separately
Serve with: cd project_root && python3 -m http.server 8080
GTD Mapping
| GTD Stage | Research Equivalent | Mechanism |
|---|
| Capture | Ideas emerge through dialogue | The chat itself |
| Clarify | Courtroom checklist + interrogation | /gtd conjecture or /gtd courtroom |
| Organize | Commit to directory | hypotheses/ decisions/ CLAUDE.md |
| Reflect | Dashboard review | dashboard.html |
| Engage | Run the pipeline | scripts/ → output/ |
Principles
- The pipeline is the source of truth. A figure only counts if it traces to a numbered pipeline script.
- Freshness is visible. You should never wonder whether an output is current.
- Decisions bind. Once committed, they constrain downstream work across sessions.
- Hypotheses are falsifiable. Every one has a "kills it" condition written before the test.
- The conversation is the inbox. It generates ideas. The directory captures them.
- Warrant is the product. Not the coefficient — the structure that earns the right to assert it.
- Verification is cheap and constant. Not a quality gate at the end.