| name | systematic-debugging |
| description | Use when debugging failures, errors, or unexpected behavior. Covers root cause investigation, data flow tracing, hypothesis-driven debugging, and fix verification to prevent trial-and-error approaches. |
| keywords | ["debugging","root-cause","trace-data-flow","divergence","hypothesis","fix-verification","stack-trace","error-analysis","profiling","logging","TypeError","null-undefined","race-condition","promise-not-awaited","validation-failed","wrong-condition","env-var","stale-cache","format-changed","early-return","service-not-running"] |
| created | "2026-01-20T00:00:00.000Z" |
| updated | "2026-01-20T00:00:00.000Z" |
| plugin | dev |
| type | discipline |
| difficulty | beginner |
Systematic Debugging
Iron Law: "NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST"
When to Use
Use this skill when:
- A test fails and you need to understand why
- An error is thrown and you need to find the cause
- A feature behaves unexpectedly
- Performance degrades and you need to identify bottlenecks
- Data corruption occurs and you need to trace the source
- A bug reappears after "fixing" it
Red Flags (Violation Indicators)
Detect these patterns that indicate skipping root cause investigation:
Key Concepts
1. Root Cause vs. Symptom
Symptom: What you observe (test fails, error thrown, wrong output)
Root Cause: Why it happens (null value, wrong condition, missing await)
Example:
Symptom: "TypeError: Cannot read property 'name' of undefined"
Root Cause: API returns null when user not found, but code expects object
Bad approach: Add user?.name (fixes symptom, not cause)
Good approach: Add validation if (!user) throw new NotFoundError() (fixes cause)
2. Data Flow Tracing
Principle: Follow data from source to error point
Steps:
- Identify error location (stack trace line number)
- Identify data involved (variable name, object property)
- Trace backwards: Where does this data come from?
- Find divergence: Where does actual differ from expected?
Example:
Error: "Expected 'active' but got 'inactive'"
Location: user.test.ts:42 - expect(user.status).toBe('active')
Data: user.status = 'inactive'
Trace: user.status ← updateUser() ← API response ← database
Divergence: Database has status='inactive' (expected 'active')
Root Cause: Test setup didn't create user with active status
3. Hypothesis-Driven Debugging
Principle: Form hypothesis, test with evidence, refine
Process:
- Observe: What is the symptom? (error message, wrong output)
- Hypothesize: What could cause this? (list 2-3 possibilities)
- Predict: If hypothesis is true, what else should I see?
- Test: Add logging, check state, run minimal reproduction
- Conclude: Does evidence support hypothesis? If no, try next hypothesis
Example:
Symptom: API request times out after 30s
Hypothesis 1: Database query is slow
Prediction: Should see long query time in logs
Test: Add query timing logs
Result: Queries complete in <100ms ✗ Hypothesis rejected
Hypothesis 2: Network connection is hanging
Prediction: Should see connection delay, not query delay
Test: Add request timing logs (connect time vs. query time)
Result: Connection takes 31s, query never runs ✓ Hypothesis confirmed
Root Cause: Firewall blocks connection, causing timeout
4. Fix Verification
Principle: Verify fix addresses root cause, not just symptom
Checklist:
4-Phase Debugging Process
Phase 1: TRACE DATA FLOW
Objective: Identify where actual diverges from expected
Steps:
- Read error message (what failed?)
- Read stack trace (where failed?)
- Identify data involved (what value is wrong?)
- Trace backwards from error to source
- Log intermediate values to find divergence point
Example (TypeScript):
console.log('1. API response:', response);
console.log('2. Extracted user:', response.data);
console.log('3. User object:', response.data.user);
console.log('4. Email field:', response.data.user.email);
Phase 2: IDENTIFY DIVERGENCE
Objective: Determine why actual differs from expected
Questions:
- What is the expected value? (from spec, test, documentation)
- What is the actual value? (from logs, debugger, state inspection)
- Where does the divergence occur? (which function, which line)
- What changed recently? (git diff, recent commits)
Example (Python):
with open('users.csv') as f:
print(f.readline())
Phase 3: HYPOTHESIZE ROOT CAUSE
Objective: Form testable hypothesis about why divergence occurred
Hypothesis Template:
"I believe [divergence] occurs because [root cause].
If this is true, I should see [evidence].
I can test this by [action]."
Example (Go):
Phase 4: VERIFY FIX
Objective: Confirm fix addresses root cause
Verification Steps:
- Write test that reproduces the bug (fails before fix)
- Apply fix
- Run test (should pass)
- Explain why fix works (addresses root cause)
- Run regression tests (no side effects)
Example (TypeScript):
JSON.stringify({ id: 1, email: undefined })
const filtered = Object.fromEntries(
Object.entries(user).filter(([_, v]) => v !== undefined)
);
expect(filtered).toEqual({ id: 1 });
expect(JSON.stringify(filtered)).toBe('{"id":1}');
Debugging Strategies by Problem Type
Test Fails
Strategy: Identify assertion, trace data, find divergence
test('calculates total', () => {
const result = calculateTotal([1, 2, 2]);
console.log('Input:', [1, 2, 2]);
console.log('Expected:', 5);
console.log('Actual:', result);
console.log('Divergence:', result - 5);
expect(result).toBe(5);
});
Performance Slow
Strategy: Profile execution, identify bottleneck
import time
def slow_function():
start = time.time()
data = fetch_data()
print(f"Fetch: {time.time() - start:.2f}s")
processed = process_data(data)
print(f"Process: {time.time() - start:.2f}s")
save_data(processed)
print(f"Save: {time.time() - start:.2f}s")
Data Corruption
Strategy: Trace data mutations, find unexpected write
func UpdateUser(user *User) {
log.Printf("Before: %+v", user)
user.Email = normalizeEmail(user.Email)
log.Printf("After normalize: %+v", user)
db.Save(user)
log.Printf("After save: %+v", user)
}
Error Thrown
Strategy: Read stack trace, identify throw location, trace backwards
function validateInput(input: string) {
if (input.length < 3) {
throw new Error('Too short');
}
}
function handleSubmit() {
const input = getInputValue();
validateInput(input);
}
Feature Broken
Strategy: Identify last working state, compare changes
git bisect start
git bisect bad HEAD
git bisect good v1.2.0
git show abc123
Common Root Causes
Type Issues
- Null/undefined: Value is null when code expects object
- String vs. number: "42" treated as string, not number
- Array vs. object: Iterating object as array
- Promise vs. value: Forgot to await async function
Async Issues
- Race condition: Two async operations modify same state
- Promise not awaited: Code continues before async completes
- Callback hell: Nested callbacks lose error context
- Event ordering: Events fire in unexpected order
Data Issues
- Validation failed: Input doesn't match expected format
- Format changed: API response structure changed
- Stale cache: Cached data is outdated
- Encoding mismatch: UTF-8 vs. ASCII, JSON vs. string
Logic Issues
- Wrong condition: if (x > 5) should be if (x >= 5)
- Early return: Function returns before reaching correct code
- Off-by-one: Loop iterates n-1 or n+1 times
- Short-circuit: && or || causes early exit
Environment Issues
- Env var not set: Missing API_KEY environment variable
- Service not running: Database or API server is down
- Version mismatch: Dependency version incompatibility
- Permission denied: File or directory not accessible
Examples
Example 1: React Test Failure (TypeScript)
Symptom:
test('disables submit when invalid', () => {
render(<Form />);
const button = screen.getByRole('button');
expect(button).toBeDisabled();
});
Phase 1: Trace Data Flow
function Form() {
const [isValid, setIsValid] = useState(false);
console.log('isValid:', isValid);
return (
<button disabled={!isValid}>Submit</button>
);
}
Phase 2: Identify Divergence
const button = screen.getByRole('button');
console.log('Disabled attribute:', button.disabled);
console.log('Button HTML:', button.outerHTML);
Phase 3: Hypothesize Root Cause
return <button disabled={true}>Submit</button>;
Phase 4: Verify Fix
const [isValid, setIsValid] = useState(false);
console.log('Rendering with isValid:', isValid);
test('disables submit when invalid', () => {
render(<Form initialValid={false} />);
});
Example 2: Python API Timeout
Symptom:
response = requests.get('https://api.example.com/users')
Phase 1: Trace Data Flow
import time
start = time.time()
try:
response = requests.get('https://api.example.com/users', timeout=30)
print(f"Request took: {time.time() - start:.2f}s")
except requests.exceptions.Timeout:
print(f"Timeout after: {time.time() - start:.2f}s")
Phase 2: Identify Divergence
response = requests.get(
'https://api.example.com/users',
timeout=(5, 30)
)
Phase 3: Hypothesize Root Cause
response = requests.get('http://192.168.1.100/users')
import socket
socket.gethostbyname('api.example.com')
Phase 4: Verify Fix
API_HOST = os.getenv('API_HOST', '192.168.1.100')
response = requests.get(f'http://{API_HOST}/users')
assert response.status_code == 200
assert response.elapsed.total_seconds() < 1
Example 3: Go Data Corruption
Symptom:
user := User{ID: 1, Email: "Alice@Example.com"}
UpdateUser(&user)
fmt.Println(user.Email)
Phase 1: Trace Data Flow
func UpdateUser(user *User) {
log.Printf("Before: %+v", user)
user.Email = normalizeEmail(user.Email)
log.Printf("After normalize: %+v", user)
db.Save(user)
log.Printf("After save: %+v", user)
}
Phase 2: Identify Divergence
func normalizeEmail(email string) string {
return strings.ToLower(email)
}
Phase 3: Hypothesize Root Cause
Phase 4: Verify Fix
func normalizeEmail(email string) string {
parts := strings.Split(email, "@")
if len(parts) != 2 {
return email
}
return parts[0] + "@" + strings.ToLower(parts[1])
}
assert.Equal(t, "Alice@example.com", normalizeEmail("Alice@Example.com"))
assert.Equal(t, "alice@example.com", normalizeEmail("alice@EXAMPLE.COM"))
Integration with Other Skills
With verification-before-completion
After debugging and fixing, use verification-before-completion to confirm:
- Test that was failing now passes (evidence: test output)
- Root cause is addressed in fix (evidence: code review)
- No regressions introduced (evidence: full test suite passes)
With test-driven-development
When debugging reveals a bug:
- Write test that reproduces the bug (RED phase)
- Debug to find root cause (this skill)
- Implement fix (GREEN phase)
- Refactor if needed (REFACTOR phase)
With agent-coordination-discipline
For complex debugging requiring multiple investigations:
- Use agent delegation when debugging spans multiple services
- Use claudish CLI for external debugging expertise
- Define clear success criteria: "Find root cause of timeout"
Enforcement Checklist
Before marking debugging task complete:
If any checkbox is unchecked, continue debugging. Do not apply fix until root cause is understood.