with one click
self-heal-locate-fault
// WHEN: Semantic machine-eval failed. Parse qa/semantic-eval-run.log (+ semantic-eval-manifest.json), trace the failure chain backwards to the root service, and collect logs and state as evidence.
// WHEN: Semantic machine-eval failed. Parse qa/semantic-eval-run.log (+ semantic-eval-manifest.json), trace the failure chain backwards to the root service, and collect logs and state as evidence.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | self-heal-locate-fault |
| description | WHEN: Semantic machine-eval failed. Parse qa/semantic-eval-run.log (+ semantic-eval-manifest.json), trace the failure chain backwards to the root service, and collect logs and state as evidence. |
| type | rigid |
| requires | ["brain-read"] |
| version | 1.0.3 |
| preamble-tier | 3 |
| triggers | ["locate the fault","find where eval failed","which service failed"] |
| allowed-tools | ["Bash"] |
| Rationalization | Why It Fails |
|---|---|
| "The error message says which service failed" | Error messages report symptoms, not causes. A 500 from the API may be caused by a cache miss, a DB timeout, or a Kafka lag. |
| "It's probably the service I just changed" | Confirmation bias. The change you made may have exposed a pre-existing bug in a different service. Follow the evidence. |
| "I'll just re-run the eval and see if it passes" | Flaky passes hide real bugs. Diagnose first, fix, then verify. A green re-run without diagnosis is a time bomb. |
| "The logs are too noisy, I'll guess" | Guessing wastes self-heal loop iterations. Filter logs by timestamp and request ID to find the actual failure chain. |
| "Multiple services failed, so it's an environment issue" | Multi-service failures often have a single root cause (e.g., one service returning bad data that cascades). Find the first failure in the chain. |
If you are thinking any of the above, you are about to violate this skill.
FAULT LOCATION TRACES THE FAILURE CHAIN BACKWARDS TO THE FIRST FAILURE IN THE TIMELINE. THE LAST SERVICE TO LOG AN ERROR IS NOT THE ROOT. FOLLOW REQUEST IDs AND TIMESTAMPS — NOT INTUITION.
Why It Fails: Last output often reflects a downstream cascade effect, not the root cause. The root cause is almost always upstream—the service that failed first, which caused the downstream failures.
Enforcement (5 MUST bullets):
Example of failure: Eval output shows cache service error last. Tracing backwards finds API service returned stale data 500ms earlier. Cache was responding normally but to poisoned data from API. API is the root cause.
Why It Fails: Top-level error messages are user-facing abstractions designed for readability, not diagnosis. Root cause is buried in nested exception chains, caused_by fields, or stack trace context layers.
Enforcement (5 MUST bullets):
Example of failure: Top-level says "Database error". Unwrapping finds "Connection pool exhausted" → "Max connections (10) reached" → actual root is connection leak in another service. Top-level hid the real cause.
Why It Fails: Empty logs for the failure window typically indicate the service never received the request (routing failure), not a code bug. Code always produces logs when it executes; silence means the request never arrived.
Enforcement (5 MUST bullets):
Example of failure: Web-to-API call produces no API logs. "Code must be broken." Actually: API binding to localhost only; web (in different container) cannot reach it. Networking issue, not code.
Why It Fails: Identical error messages can occur from completely different code paths and faults. Error string alone is not unique; full context (stack trace, code location, surrounding logs) is required to distinguish root causes.
Enforcement (5 MUST bullets):
Example of failure: "Connection refused" is seen in both outbound database connection and external service API call. Same message; completely different roots. Stack trace reveals one is in db.connect() and other is in webhooks.post(). Two different faults.
Why It Fails: Infrastructure failures always produce application-level logs. When infrastructure fails (network down, disk full, OOM), the application always logs the effect: connection timeout, ECONNREFUSED, disk write failure, memory allocation error. Infra faults are observable through app logs.
Enforcement (5 MUST bullets):
Example of failure: "Service X is buggy" based on error logs. Actually: node running low on disk, kernel killing processes, service never even got to run user code. Infra fault, observable through "too many open files" in logs, but misattributed to service code.
If you notice any of these, STOP and do not proceed:
When an eval scenario fails, this skill diagnoses which service caused the failure and collects evidence for remediation.
The skill performs three sequential operations:
qa/semantic-eval-run.log + qa/semantic-eval-manifest.json (semantic CSV eval only — see below)Forge machine-eval is semantic CSV + manifest + run log only. Do not look for legacy prds/<task-id>/eval/ YAML scenario dumps or driver transcript trees — that path is removed.
| Situation | Primary failure artifact | Do not use |
|---|---|---|
| Pre-semantic (no run log yet) | N/A — do not self-heal until qa-semantic-csv-orchestrate has produced qa/semantic-eval-run.log (and usually qa/semantic-eval-manifest.json). | Guessing from semantic-automation.csv alone without a failed run record |
Semantic (qa/semantic-automation.csv + manifest) | ~/forge/brain/prds/<task-id>/qa/semantic-eval-run.log (plus qa/semantic-eval-manifest.json outcome) | Nonexistent eval/ YAML / driver artifacts |
Semantic RED: If qa/semantic-eval-manifest.json has outcome: fail (or Phase 4.4 semantic branch returned RED), open qa/semantic-eval-run.log first. Format: comment header lines (# …, task_id=…, driver=…), then one JSON object per line per semantic step. Parse each line with jq or a small script; locate objects where status is FAILED, ERROR, or non-success; read id, surface, intent, and any error / message fields.
Trace Surface → service: Map surface values (Web, API, Android, …) to repos/services via ~/forge/brain/products/<slug>/product.md Projects / roles. Evidence starts in the log JSON lines and the failing step’s id in qa/semantic-automation.csv.
semantic-eval-run.log)qa/semantic-eval-run.log under the task’s qa/ folder.#, task_id=, driver=, blanks).id, status, surface, reason, error, message, intent.SKIPPED with dependency_not_passed as cascade — the first non-PASS step in topological order is often the root; still verify upstream DependsOn in semantic-automation.csv.id with qa/semantic-automation.csv for Intent text and dependencies.When: Semantic Phase 4.4 path failed; eval/ may have no matching driver transcript — rely on the steps above (semantic log only).
Map the failure to a service or component using causal reasoning:
Status Code → Service
400, 401, 403 → Client error or auth service
404 → API routing or endpoint not found
422 → Validation service or input processor
500, 502, 503 → Backend API service
504 → Gateway timeout, likely upstream service
Scenario → Fault
API returned 200 but DB didn't → Database service
update row
Web displayed data but DB shows → Database service or
different value cache consistency
Request accepted but notification → Event bus or
not delivered notification service
Cache hit but data was stale → Cache invalidation
or TTL service
Web → API → DB
├── Web error (render, CDN, routing) → Web service
├── API error (500, exception) → Backend API service
├── DB error (constraint, timeout) → Database service
├── No response from API → API service or network
└── Slow response (timeout) → Slowest service in chain
Error → Fault
Third-party API timeout → External service
Rate limit exceeded → External service quota
Authentication token invalid → Auth/token service
Webhook delivery failed → Event bus or
notification service
Cache connection refused → Cache service
Search index not responding → Search service
For the identified fault service, gather:
Format the diagnosis clearly with these sections:
fault_diagnosis:
service: "<service-name>:<port>"
status: "failed"
error:
type: "<exception-type or http-status>"
message: "<error-message>"
step: "<scenario-step-that-failed>"
evidence:
logs:
- timestamp: "2026-04-10T14:32:15Z"
level: "ERROR"
message: "<log-line>"
file: "<source-file>:<line>"
stack_trace:
- function: "<function-name>"
file: "<filename>"
line: <line-number>
context: "<code-context>"
request:
method: "POST"
url: "/endpoint"
headers: { ... }
body: { ... }
response:
status: 500
headers: { ... }
body: { ... }
db_state:
query: "<failed-query>"
error: "<constraint-or-syntax-error>"
affected_rows: <count>
transaction_state: "rolled_back"
cache_state:
key: "<cache-key>"
expected_value: "..."
actual_value: "..."
ttl_remaining: <seconds>
was_hit: false
actionable:
root_cause: "<what-actually-broke>"
immediate_fix: "<how-to-fix-now>"
prevention: "<how-to-prevent-next-time>"
affected_flows: ["<flow-1>", "<flow-2>"]
fault_diagnosis:
service: "backend-api:3000"
status: "failed"
error:
type: "InternalServerError"
message: "POST /auth/2fa/enable returned 500"
step: "Enable 2FA on user account"
evidence:
logs:
- timestamp: "2026-04-10T14:32:15Z"
level: "ERROR"
message: "Error: 2FA secret generation failed"
file: "auth.js:123"
- timestamp: "2026-04-10T14:32:15Z"
level: "ERROR"
message: "Cannot read property 'base32' of undefined"
file: "auth.js:125"
stack_trace:
- function: "generateSecret"
file: "auth.js"
line: 123
context: "const encoded = speakeasy.totp.base32Encode(secret)"
- function: "enableTwoFactor"
file: "auth.js"
line: 156
request:
method: "POST"
url: "/auth/2fa/enable"
body: { phone: "+1234567890", method: "sms" }
response:
status: 500
body: { error: "Internal Server Error" }
actionable:
root_cause: "speakeasy library not imported or undefined"
immediate_fix: "Add: const speakeasy = require('speakeasy')"
prevention: "Add unit tests for auth.js, check imports in CI"
affected_flows: ["2FA setup", "login with 2FA"]
fault_diagnosis:
service: "mysql:3306"
status: "failed"
error:
type: "ConstraintViolation"
message: "Duplicate entry for user_id in profile table"
step: "Update user profile after registration"
evidence:
logs:
- timestamp: "2026-04-10T14:32:16Z"
level: "ERROR"
message: "Duplicate entry '12345' for key 'uk_user_id'"
db_state:
query: "INSERT INTO user_profile (user_id, name) VALUES (?, ?)"
error: "ER_DUP_ENTRY: Duplicate entry '12345' for key 'uk_user_id'"
affected_rows: 0
transaction_state: "rolled_back"
actionable:
root_cause: "Unique constraint violation—profile already exists for user"
immediate_fix: "Check if profile exists before INSERT, use UPSERT instead"
prevention: "Add integration tests for duplicate profile scenarios"
affected_flows: ["User registration", "Profile updates"]
fault_diagnosis:
service: "redis:6379"
status: "failed"
error:
type: "StaleDataError"
message: "Cache verification failed—expected '{role:admin}' but got '{role:user}'"
step: "Verify admin cache after role upgrade"
evidence:
cache_state:
key: "user:12345:roles"
expected_value: { role: "admin" }
actual_value: { role: "user" }
ttl_remaining: 3599
was_hit: true
actionable:
root_cause: "Cache not invalidated when user role was updated"
immediate_fix: "Add cache.delete('user:12345:roles') to role update handler"
prevention: "Implement cache invalidation triggers on role changes"
affected_flows: ["Permission checks", "Authorization"]
Symptom: Eval timestamps present (failure occurred at 14:32:15Z), but no log entries exist within ±30 seconds of failure time.
Why This Happens:
Do NOT:
Action Steps:
ls -la /var/log/SERVICE to see log rotation scheduleWhen to Escalate: If logs remain empty after expansion:
Symptom: api-service, backend-service, and db-service all log "ECONNREFUSED" at exactly 14:32:15.123Z
Why This Happens:
Do NOT:
Action Steps:
When to Escalate: If timeline is ambiguous:
Symptom: Error message encountered: "Error: handle_json_decode failed" — does not match any known pattern in fault library
Why This Happens:
Do NOT:
Action Steps:
"pattern_status": "NEW_PATTERN"When to Escalate: If diagnosis is unclear from stack trace alone:
Symptom: Service A logs failure at 09:00:01.000Z, Service B logs ECONNREFUSED at 09:00:05.000Z for the same logical event. Should be simultaneous but 4-second gap.
Why This Happens:
Do NOT:
Action Steps:
timedatectl status, ntpq -pdate from each service's logs or hostWhen to Escalate: If clock skew is large (>5 seconds):
Symptom: Eval assertion fails (returns wrong data) but all services log "200 OK" or "Request processed successfully". No errors in any log.
Why This Happens:
Do NOT:
Action Steps:
SELECT * FROM TABLE WHERE id=X before/after eval to see if data changed incorrectlyGET key)When to Escalate: If data flow and logic are correct but output is still wrong:
Use this tree to determine which log/state sources to query based on fault fingerprint type.
START: Fault Fingerprint Type Identified?
│
├─ HTTP Status Code Error (4xx, 5xx, timeout)
│ │
│ ├─ 400, 401, 403, 422
│ │ └─ Query: API request logs, auth service logs, validation service logs
│ │ Collect: Request body, auth headers, validation rules
│ │ Evidence Type: Request/Response Data + Logs
│ │
│ ├─ 404, 405, 406
│ │ └─ Query: API routing logs, endpoint definitions, CDN logs
│ │ Collect: Request URL, available endpoints, routing rules
│ │ Evidence Type: Request/Response Data + Service State
│ │
│ ├─ 500, 502, 503
│ │ └─ Query: Backend service logs, exception traces, upstream service logs
│ │ Collect: Full stack trace, request payload, downstream responses
│ │ Evidence Type: Logs + Stack Traces + Request/Response Data
│ │
│ └─ 504, timeout
│ └─ Query: Upstream service logs, network logs, resource usage at timeout moment
│ Collect: Service response times, resource exhaustion signs, connection states
│ Evidence Type: Logs + Service State
│
├─ Exception / Stack Trace Error
│ │
│ ├─ NullPointerException / TypeError / ReferenceError
│ │ └─ Query: Service logs for the file/line, code context around error location
│ │ Collect: Stack trace, variable states, recent code changes
│ │ Evidence Type: Stack Traces + Code Context
│ │
│ ├─ Network Error (ECONNREFUSED, ENOTFOUND, ETIMEDOUT)
│ │ └─ Query: Downstream service logs, network configuration, connectivity checks
│ │ Collect: Service up/down status, routing rules, firewall rules
│ │ Evidence Type: Service State + Network Logs
│ │
│ ├─ Constraint Violation / Database Error
│ │ └─ Query: Database logs, transaction logs, schema definitions
│ │ Collect: Failed query, constraint rules, data state
│ │ Evidence Type: DB State + Logs
│ │
│ └─ Out of Memory / Too Many Open Files / Disk Full
│ └─ Query: System resource logs, container logs, process resource limits
│ Collect: Memory usage, file descriptor count, disk space
│ Evidence Type: Service State + System Metrics
│
├─ Data Inconsistency (200 OK but wrong data)
│ │
│ ├─ Cache stale / Cache poisoned
│ │ └─ Query: Cache hit/miss logs, cache invalidation logs, upstream data source
│ │ Collect: Cache key, expected vs actual value, TTL, data source state
│ │ Evidence Type: Cache State + Request/Response Data
│ │
│ ├─ Database inconsistency
│ │ └─ Query: Database transaction logs, concurrent update logs, replication logs
│ │ Collect: Row state before/after, concurrent modifications, transaction boundaries
│ │ Evidence Type: DB State + Logs
│ │
│ └─ Business logic bug
│ └─ Query: Application logic logs, transformation logs, data flow logs
│ Collect: Input data, transformation steps, output data
│ Evidence Type: Logs + Request/Response Data
│
├─ External Service Failure
│ │
│ ├─ Third-party API timeout
│ │ └─ Query: Outbound request logs, external service status page, network logs
│ │ Collect: Request to external service, response time, service status
│ │ Evidence Type: Logs + Request/Response Data
│ │
│ ├─ Rate limit exceeded
│ │ └─ Query: API call frequency logs, rate limit configuration
│ │ Collect: Call count in time window, rate limit threshold
│ │ Evidence Type: Logs + Service State
│ │
│ └─ Authentication token invalid
│ └─ Query: Auth service logs, token validation logs, token expiry logs
│ Collect: Token payload, expiry time, auth validation rules
│ Evidence Type: Logs + Request/Response Data
│
└─ Unknown / No Clear Fingerprint
└─ Query: All service logs in eval window, cross-service timing correlation
Collect: Aggregate errors, timeline reconstruction, log patterns
Evidence Type: Logs + Timeline Analysis
| Evidence Type | Where to Find | Fault Category Indicator | Query Command |
|---|---|---|---|
| HTTP Status Logs | API gateway, reverse proxy, service HTTP handler | Client errors (4xx) indicate API layer; server errors (5xx) indicate backend | grep "POST|GET" /var/log/api.log | grep "200|4[0-9]{2}|5[0-9]{2}" |
| Exception Stack Traces | Application error logs, exception handlers, APM tools | File:line shows exact code location; exception type shows error category | grep -A 10 "Exception|Error|Traceback" /var/log/app.log |
| Network Errors | System logs, service logs for connection attempts | ECONNREFUSED = port not listening; ENOTFOUND = DNS failure; ETIMEDOUT = unreachable | grep "ECONNREFUSED|ENOTFOUND|ETIMEDOUT|EAGAIN" /var/log/*.log |
| Database Errors | Database logs, application logs from DB driver | Constraint violations indicate schema/data issues; timeouts indicate resource exhaustion | grep "Duplicate|Constraint|deadlock|timeout" /var/log/mysql.log |
| Request ID Trace | All service logs filtered by correlation ID | Follow a single request through call chain; first error is root cause | grep -r "request_id=ABC123" /var/log/ | sort |
| Cache State | Redis/Memcache logs, cache client logs, cache monitoring | Cache miss = not cached; stale = TTL expired but not refreshed; poisoned = cached wrong value | redis-cli GET key; redis-cli TTL key |
| Timestamp Alignment | Log timestamps from all services | Clock skew >100ms indicates NTP issue; aligned timestamps enable causal ordering | grep "2026-04-10T14:32:15" /var/log/*/service.log |
| Resource Exhaustion | System metrics, container metrics, process resource limits | OOM, ENOBUFS, "too many open files" = resource limits hit | free -h; ulimit -a; du -sh /var/log/ |
| Dependency Status | Health check endpoints, service discovery logs, circuit breaker logs | Circuit breaker OPEN = downstream service failing; health check FAIL = service not ready | curl -s http://service:port/health; grep "circuit|health" /var/log/app.log |
| Concurrent Access Patterns | Transaction logs, lock logs, concurrent request logs | Same resource accessed simultaneously → race condition or deadlock | grep "UPDATE.*WHERE|SELECT.*FOR UPDATE" /var/log/mysql.log |
/brain-read to access service definitions and dependenciesBefore handing fault diagnosis to self-heal-triage:
semantic-eval-run.log (+ manifest outcome); RED must cite failed step **id**s from JSON linesself-heal-triage consumption — not removed eval-scenario YAML)qa-semantic-csv-orchestrate, docs/semantic-eval-csv.md — semantic CSV schema and runner.eval-judge § Semantic path — verdict from manifest + log.self-heal-triage
self-heal-loop-cap
self-heal-systematic-debug
Eval Fails
│
└──> self-heal-locate-fault (THIS SKILL)
Output: Fault diagnosis with evidence
│
└──> self-heal-triage
Output: Failure classification
│
├──> IF FLAKY: re-run with instrumentation
│
├──> IF TEST_BUG: skip test, file issue
│
└──> IF REPRODUCIBLE:
│
└──> self-heal-systematic-debug (if needed for complex faults)
Output: Confirmed root cause
│
└──> self-heal-remediate
Output: Fix applied, eval re-run