| name | developing-test-first |
| description | Enforces strict Red/Green/Refactor TDD discipline before writing production code. Use when implementing any feature or bugfix, when user asks to write tests, implement test-first, practice TDD, or starts coding without tests. Trigger phrases include "write a failing test", "red/green", "test-driven", "TDD", "test first". Also applies when implementation begins without a preceding test. |
Developing Test-First
Overview
Write the test first. Watch it fail. Write minimal code to pass.
Core principle: If you didn't watch the test fail, you don't know if it tests the right thing.
This is a discipline skill — absolute rules, no exceptions. Violating the letter of the rules is violating the spirit of the rules.
When to Use
Always:
- New features
- Bug fixes
- Behavior changes
- Refactoring
Exceptions (ask your human partner):
- Throwaway prototypes
- Generated code
- Configuration files
Do NOT use for:
- Test strategy and layer design →
driving-with-tests
- Classifying CI failure output →
autofixing-and-escalating
- Test framework API docs →
understanding-code-context
Thinking "skip TDD just this once"? Stop. That's rationalization.
The Iron Law
NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST
Write code before the test? Delete it. Start over.
No exceptions:
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Don't look at it
- Delete means delete
Implement fresh from tests. Period.
Anti-Pattern: Horizontal Slicing
The Iron Law forbids production code without a failing test, but it does not by itself forbid writing five tests up front and then five implementations. That sequence — horizontal slicing — is the most common way TDD silently fails.
WRONG (horizontal):
RED RED RED RED RED → GREEN GREEN GREEN GREEN GREEN
RIGHT (vertical / tracer bullet):
RED → GREEN, RED → GREEN, RED → GREEN, ...
Why horizontal slicing produces bad tests:
- Tests written in bulk verify imagined behavior, not actual behavior. The implementation hasn't been built yet, so the tests can only assert against guesses.
- They test the shape of things — data structures, function signatures, return types — instead of user-facing behavior.
- They become insensitive to real changes. They pass when the system breaks in ways the bulk-author didn't anticipate; they fail when behavior is fine but a signature shifted.
- You out-run your headlights. Committing to test structure before any implementation feedback locks you into assumptions you haven't validated.
Why vertical slicing works:
- Each test responds to what the previous cycle taught you
- You write the test you actually need, because you just felt the implementation
- Tests stay honest because they were written against working code, not imagination
Rule: one test, one implementation, one cycle. Then the next.
Red-Green-Refactor
digraph tdd_cycle {
rankdir=LR;
red [label="RED\nWrite failing test", shape=box, style=filled, fillcolor="#ffcccc"];
verify_red [label="Verify fails\ncorrectly", shape=diamond];
green [label="GREEN\nMinimal code", shape=box, style=filled, fillcolor="#ccffcc"];
verify_green [label="All tests\npass?", shape=diamond];
refactor [label="REFACTOR\nClean up", shape=box, style=filled, fillcolor="#ccccff"];
next [label="Next\nrequirement", shape=ellipse];
red -> verify_red;
verify_red -> green [label="yes"];
verify_red -> red [label="wrong\nfailure"];
green -> verify_green;
verify_green -> refactor [label="yes"];
verify_green -> green [label="no"];
refactor -> verify_green [label="stay\ngreen"];
verify_green -> next;
next -> red;
}
RED — Write Failing Test
Write one minimal test showing what should happen.
Requirements:
- One behavior per test
- Clear name that describes expected behavior
- Real code, not mocks (unless external dependency is unavoidable)
Verify RED — MANDATORY. Never skip.
Run the test. Confirm:
- Test fails (not errors — syntax errors and import failures don't count)
- Failure message matches expectations
- Fails because the feature is missing, not because of typos or setup
Test passes? You're testing existing behavior. Fix the test.
Test errors? Fix the error, re-run until it fails on the assertion.
GREEN — Minimal Code
Write the simplest code to pass the test. Nothing more.
Don't add features, refactor other code, or "improve" beyond what the test demands.
Verify GREEN — MANDATORY.
Run the test. Confirm:
- New test passes
- All other tests still pass
- Output is clean (no errors, no warnings)
New test fails? Fix code, not test.
Other tests fail? Fix the regression now.
REFACTOR — Clean Up
After green only:
- Remove duplication
- Improve names
- Extract helpers
Keep all tests green. Don't add behavior during refactoring.
Then: next failing test for the next requirement.
Why Order Matters
Tests written after code pass immediately — proving nothing. Tests-first force you to see the failure, proving the test actually catches the bug. Tests-after verify what you remembered. Tests-first discover what you didn't.
If you wrote code before the test: delete it. The sunk cost fallacy says "keep it." Reality says keeping unverified code is technical debt. See the rationalization table below for common excuses and their rebuttals.
Common Rationalizations
See reference/anti-patterns.md for the full rationalization table with rebuttals.
Red Flags — STOP and Start Over
- Wrote code before writing a test
- Test written after implementation
- Test passes immediately on first run
- Can't explain why the test failed
- Tests planned for "later" or "follow-up"
- Rationalizing "just this once"
- "I already manually tested it"
- "Tests after achieve the same purpose"
- "It's about spirit not ritual"
- "Keep as reference" or "adapt existing code"
- "Already spent X hours, deleting is wasteful"
- "TDD is dogmatic, I'm being pragmatic"
- "This is different because..."
All of these mean: Delete code. Start over with TDD.
When Stuck
| Problem | Solution |
|---|
| Don't know how to test | Write the wished-for API. Write assertion first. Ask your human partner. |
| Test too complicated | Design too complicated. Simplify the interface. |
| Must mock everything | Code too coupled. Use dependency injection. See reference/anti-patterns.md. |
| Test setup is huge | Extract test helpers. Still complex? Simplify the design. |
| Flaky test | Find root cause — timing, shared state, ordering. Never skip or retry. |
Verification Checklist
See reference/anti-patterns.md for the full checklist. Summary: every function has a test, each test was watched failing, minimal code was written to pass, all tests pass cleanly.
Troubleshooting
| Problem | Fix |
|---|
| Test fails with error, not assertion failure | Fix syntax/import errors first. RED means assertion failure, not crash. |
| Test passes immediately on first run | You're testing existing behavior. Rewrite the test to target the missing feature. |
| Flaky test (passes sometimes, fails sometimes) | Find root cause: timing, shared state, test ordering. Never skip or retry. |
| Must mock everything to test a function | Code is too coupled. Refactor with dependency injection. See reference/anti-patterns.md. |
| Test setup is longer than test logic | Extract test helpers. If still complex, simplify the design under test. |
| Unsure what to test first | Write the wished-for API. Start with the assertion, work backward. |
| Existing codebase has no tests | Add tests for the code you are changing. Do not boil the ocean. |
Language Detection
Detect your human partner's language from conversation context, project docs, and git history. Default to English. Adapt all user-facing messages; keep code, test names, and commands in English.
Composability
- Standalone: Use for any coding task — no other skill required
- With
driving-with-tests: Pair for full test strategy (Orient → TDD → Probe → Guard)
- Within
ralph iterations: Apply TDD discipline inside each iteration
Reference
reference/anti-patterns.md — Mocking pitfalls, test-only methods, Gate Functions