| name | ff-debug |
| description | Bug fixing and debugging for ANY error, crash, loss divergence, gradient explosion, distributed hang, NaN, or unexpected behavior. Covers quick fixes and full protocol with 5-phase investigation. Trigger: 'fix bug', 'fix error', 'broken', 'crash', 'doesn't work', 'fails with', 'loss NaN', 'training hangs', 'OOM'. |
Debug Workflow
Related Topics (read for numerical / consistency issues)
- NaN, loss divergence, wrong gradients ->
topics/train_inference_consistency.md
- Dtype mismatch, overflow, precision ->
topics/dtype_precision.md
Two Pathways
Quick Path (obvious root cause)
Use when: Error message clearly points to the issue (typo, missing import, wrong type).
- Reproduce the error
- Check
.agents/knowledge/constraints.md for relevant constraints
- Write targeted fix
- Verify with test
- Run
/ff-review, commit
If not resolved in 15 min -> switch to Full Protocol.
Full Protocol (complex issues)
Use when:
- Distributed training bugs (deadlocks, rank mismatches)
- Numerical issues (NaN, loss divergence, wrong gradients)
- Silent failures (training runs but produces garbage)
- Multiple failed fix attempts
Full Protocol — Five Phases
Phase 1: Root Cause Investigation
- Read complete error messages — Full stack traces matter, don't skim
- Consult constraints — Check
.agents/knowledge/constraints.md
- Reproduce consistently — Isolate the exact trigger condition
- Trace execution path — Follow through the 6-stage pipeline
- Check recent changes —
git log --oneline -10 — what changed recently?
Distributed-Specific Checklist
- Does the error appear on all ranks or just one?
- Is
accelerator.wait_for_everyone() missing before the failure point?
- Are frozen components synchronized across ranks? (Constraint #19)
- Is ZeRO-3 being used? (Constraint #10 — unsupported)
Phase 2: Pattern Analysis
- Find working examples — Compare with a similar model/algorithm that works
- Diff analysis — What's different between working and broken paths? Compare completely — diff line by line, not skim. Include config YAML and environment vars.
- Isolate variables — Change one thing at a time
- Check dependencies — Different diffusers version? Different PyTorch version?
Phase 3: Hypothesis Testing
- One hypothesis per iteration — Formulate a single falsifiable hypothesis
- Minimal test case — Reproduce with smallest possible config
- Low confidence (<80%)? — Add debug logging before applying fix
Red flags — STOP and restart from Phase 1:
- "Let me just try changing X and see what happens"
- "Quick fix for now, clean up later"
- "It probably works, let me move on"
Verification gate — before acting on a conclusion, check:
- Does the evidence actually support this cause, or just correlate?
- Could a different root cause produce the same symptoms?
- What observation would disprove this hypothesis? Have you looked for it?
Phase 4: Fix Implementation
- Write failing test first (if possible)
- Implement targeted fix — Only fix the bug, don't refactor
- Check cross-algorithm impact — Does this fix break GRPO? NFT? AWM?
- Check cross-model impact — Test with at least two model adapters
- Before committing: run
/ff-review skill.
Phase 5: Knowledge Capture
After fix is verified:
- Update
constraints.md if a new constraint was discovered
- Add regression test if applicable
- Document the root cause in the commit message
- Follow fix archival process in
topics/fix_patterns.md
Three-Strike Rule
If the same approach fails three times:
- HALT all fix attempts
- Question whether the underlying approach/architecture is wrong
- Step back and re-examine: are you solving the right problem?
- Report to user with analysis before continuing
Common Issue Categories
Training Loop Issues
Model Adapter Issues
Reward Issues
Configuration Issues
Distributed Issues