| name | systematic-debugging |
| description | Applies 4-phase root cause analysis: gather data, form hypothesis, test hypothesis, solve and verify. Includes root-cause-tracing, defense-in-depth, and condition-based-waiting techniques. Activates for any bug. |
Systematic Debugging
The systematic-debugging skill applies a disciplined approach to finding and fixing bugs through structured root cause analysis, preventing guesswork and ensuring fixes actually solve the problem.
When to Use
Activate systematic-debugging when:
- Bug is reported
- Test is failing
- Feature not working as expected
- Error occurs in production
- Something breaks unexpectedly
The 4-Phase Debugging Process
Phase 1: Gather Data
Goal: Understand the problem fully before attempting fixes.
Questions to answer:
- What is the expected behavior?
- What is the actual behavior?
- When does it happen? (time, frequency)
- Where does it happen? (code location, environment)
- Who does it affect? (users, systems)
- How can it be reproduced?
How to gather data:
1. Read error messages carefully:
ā Database connection failed: "Connection refused"
ā
What's the full error?
ā
What database? Where is it hosted?
ā
What credentials are being used?
ā
Has it ever worked?
ā
What changed recently?
2. Check logs:
journalctl -u myservice --since "1 hour ago"
tail -f logs/app.log
grep -i error logs/app.log | tail -50
3. Reproduce the bug:
./myapp --input problematic_data
RUST_BACKTRACE=1 cargo run
4. Gather context:
- User report details
- System state at time of failure
- Recent changes (commits, deployments)
- Environment differences
- Related issues
Document what you know:
## Bug: Login fails for some users
**Symptoms:**
- Users with email containing "+" cannot login
- Example: "user+tag@example.com" fails
- Example: "user@example.com" works
**Reproduction:**
1. Create user with email "user+tag@example.com"
2. Attempt login
3. Result: 401 Unauthorized
**Error:**
[Error message here]
**When it started:**
- Last 24 hours
- After deployment of commit abc123
**Affected users:**
- ~5% of users (those with "+" in email)
Phase 2: Form Hypothesis
Goal: Develop testable theories about the root cause.
Don't assume - hypothesize:
ā "It's probably a database issue"
ā "The server must be overloaded"
ā "This always happens"
ā
"If the bug is in the email parsing, then I should see it fail at that exact line"
ā
"If it's a connection timeout, then increasing the timeout should help"
How to form good hypotheses:
1. Consider the "NUTS" framework:
- Normal - What should happen?
- Under what conditions - When does it break?
- Theory - What could cause this?
- Specific - Where exactly does it fail?
2. List possible causes:
## Possible Causes
1. Email parsing fails on "+" character
2. Database doesn't accept "+" in email field
3. Auth service has regex that excludes "+"
4. Frontend form validation blocks "+"
5. Email verification logic rejects "+"
3. Rank by likelihood:
## Most Likely (Based on Error Message)
1. **Auth regex excludes "+"** - Error mentions "invalid format" - likelihood: HIGH
2. Email parsing fails - likelihood: MEDIUM
3. Database rejects - likelihood: LOW (other emails work)
## Tests to Verify:
Test 1: Check auth regex
Test 2: Test email parsing function
Test 3: Check database schema
4. Form specific hypothesis:
## Hypothesis
"If the bug is in the email validation regex (cause #1),
then:
- The regex should be rejecting emails with '+'
- Fixing the regex should allow '+' emails
- The fix should not break valid email formats"
Phase 3: Test Hypothesis
Goal: Verify or disprove your hypothesis through controlled experiments.
Design experiments:
1. Isolate the variable:
./app --all-features --with-database --with-cache
./app --only-email-validation --test-email "user+tag@example.com"
2. Test the hypothesis directly:
#[test]
fn test_email_with_plus_sign() {
let email = "user+tag@example.com";
let result = validate_email(email);
assert!(result.is_ok(), "Email with + should be valid");
}
3. Binary search:
## Binary Search Strategy
If bug appeared in commit range 100-200:
- Test commit 150
- Works? ā Bug is in 150-200
- Broken? ā Bug is in 100-150
- Continue halving until you isolate the exact commit
4. Compare working vs. broken:
echo "user@example.com" | ./validate
echo "user+tag@example.com" | ./validate
Test systematically:
Step 1: Reproduce the bug
cargo test test_email_with_plus_sign
Step 2: Narrow down location
println!("Email before parsing: {}", email);
let parsed = parse_email(&email);
println!("Email after parsing: {:?}", parsed);
Step 3: Test the specific theory
grep -n "email.*regex" src/auth/validation.rs
Step 4: Verify with test
cargo test test_email_with_plus_sign
Phase 4: Solve and Verify
Goal: Implement the fix and confirm it solves the problem without breaking anything.
1. Implement the fix:
let email_regex = Regex::new(r"^[\w]+@[\w]+\.[\w]+").unwrap();
let email_regex = Regex::new(r"^[\w.+-]+@[\w.-]+\.[\w]+$").unwrap();
2. Verify the fix works:
cargo test test_email_with_plus_sign
cargo test test_email_validation
3. Run regression tests:
cargo test
cargo test --test '*_integration'
4. Test in environment:
./myapp --test-login user+tag@example.com
5. Add regression test:
#[test]
fn test_email_validation_allows_plus_sign() {
assert!(validate_email("user+tag@example.com").is_ok());
assert!(validate_email("user+test+tag@example.com").is_ok());
}
#[test]
fn test_email_validation_rejects_invalid() {
assert!(validate_email("invalid-email").is_err());
}
6. Document the fix:
## Fix: Email validation now allows "+" character
**Problem:** Emails containing "+" (e.g., "user+tag@example.com") were rejected.
**Root Cause:** Email regex `^[\w]+@[\w]+\.[\w]+` didn't include "+" character.
**Solution:** Updated regex to `^[\w.+-]+@[\w.-]+\.[\w]+$`
**Files changed:**
- src/auth/validation.rs (line 45)
**Tests added:**
- test_email_validation_allows_plus_sign
- test_email_validation_rejects_invalid
**Verified:**
- Unit tests pass
- Integration tests pass
- Manually tested in staging
Debugging Techniques
Root Cause Tracing
See companion skill: root-cause-tracing for detailed techniques.
Key techniques:
- 5 Whys (ask "why" until you reach root cause)
- Fishbone diagrams (categorize causes)
- Fault tree analysis (boolean logic of failures)
Defense in Depth
See companion skill: defense-in-depth for detailed techniques.
Key principles:
- Multiple layers of protection
- Fail-safe defaults
- Validate at boundaries
- Log for forensics
Condition-Based Waiting
See companion skill: condition-based-waiting for detailed techniques.
For timing/race condition issues:
- Wait for specific conditions, not arbitrary timeouts
- Use polling with exponential backoff
- Implement timeouts with clear failure modes
Common Debugging Mistakes
ā Fixing symptoms, not causes
ā "User gets error" ā "Hide error message"
ā
"User gets error" ā "Fix the actual problem"
ā Making random changes
ā Change A, test, change B, test, change C, test
ā
Form hypothesis, test hypothesis, verify
ā Not testing the fix
ā "Seems to work now"
ā
"All tests pass, manually verified, regression test added"
ā Skipping documentation
ā Fixed it, moving on
ā
Document root cause, fix, and verification
ā Not adding regression tests
ā Fixed bug, never to return
ā
Fixed bug, added test to prevent return
Debugging Tools
Command Line
strace -e trace=read,write ./myapp
ltrace ./myapp
sudo dtrace -n 'syscall::read:entry { @[execname] = count(); }'
Language-Specific
Rust:
RUST_BACKTRACE=1 cargo run
RUSTFLAGS="-C debug-assertions=y" cargo build
cargo install cargo-valgrind
cargo valgrind test
Logging
tracing::info!("Processing email: {}", email);
tracing::debug!("Parsed result: {:?}", result);
tracing::error!("Failed: {:?}", error);
Debuggers
Rust (lldb/gdb):
rust-lldb target/debug/myapp
Systematic Debugging Checklist
Success Criteria
Debugging is successful when:
- ā
Root cause identified (not just symptoms)
- ā
Fix actually solves the problem
- ā
No new bugs introduced
- ā
Regression test added
- ā
Documentation complete
- ā
Problem won't recur
Remember
Phase 1: Understand the problem completely
Phase 2: Form a testable hypothesis
Phase 3: Prove or disprove it
Phase 4: Fix it right and verify
Don't:
- Guess
- Fix randomly
- Skip testing
- Forget to document
- Leave tests broken
Do:
- Be systematic
- Test assumptions
- Verify thoroughly
- Learn and prevent