Run any Skill in Manus with one click

rca-report

Use when investigating and documenting a production incident, outage, data corruption event, or post-mortem — guides evidence collection during the investigation AND produces a rich, reproducible Root Cause Analysis report. Trigger on phrases like "write an RCA", "post-mortem for X", "document this incident", "what went wrong with...", "the pipeline broke yesterday, help me investigate", or any time the user is debugging a recently-resolved incident and wants a writeup. Also use proactively when the user finishes resolving an incident in-session and the resolution context is fresh — offer to capture it as an RCA before details fade.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/bmad-labs/skills --skill rca-report

Copy and paste this command into Claude Code to install the skill

Source

bmad-labs/skills

Stars9

Forks2

UpdatedMay 5, 2026 at 12:21

File Explorer

4 files

SKILL.md

readonly

name

rca-report

description

RCA Report

Produce post-mortems that are reproducible, layered, and operationally useful — not just narrative. A good RCA lets a future engineer (or future you) understand the incident, verify the fix held, and avoid repeating it. This skill covers both the investigation flow (what to gather while the incident is fresh) and the report itself.

When this skill applies

A production incident, outage, or near-miss has occurred and needs documenting
A pipeline, service, or system silently failed and the user just resolved it
The user wants a post-mortem, RCA, or incident write-up
Mid-investigation: the user is debugging something and wants help structuring evidence as it comes in

If the incident is still actively burning and the user just wants help fixing it, skip this skill — fix first, document after.

Output location

Save the report to <topic>-rca-<YYYY-MM-DD>.md in the current working directory, where:

<topic> is a short kebab-case identifier of the failing system (e.g. debezium, auth-service, kafka-consumer-lag)
<YYYY-MM-DD> is the incident date (when it occurred), not necessarily today

Example: debezium-rca-2026-05-05.md, auth-500s-rca-2026-04-12.md

The two-phase workflow

Phase 1 — Investigation (gather evidence)

Before writing anything, walk through references/investigation-checklist.md with the user. The goal is to lock in concrete, reproducible facts — timestamps, version numbers, exact LSNs/IDs/error strings, command outputs — while the system state is still observable. Memory degrades fast; logs rotate; replication slots advance. Capture now, write later.

Do not skip this phase even if the user says "I already fixed it" — fixed-state evidence (the healthy confirmed_flush_lsn advancing, the test row flowing through Kafka, the new container log showing "streaming from latest xlogpos") is what proves the resolution actually held. That proof is what separates a real RCA from a story.

If the user already has notes/transcripts/scrollback from the live incident, mine those first before asking questions. Don't make them re-type what's already in the conversation.

Phase 2 — Write the report

Use templates/rca-report.md as the structural skeleton. Fill it section by section using the evidence from Phase 1. Then validate against references/quality-rubric.md before declaring done.

What makes an RCA actually good

The Debezium RCA that this skill is modeled on worked because it had:

A timeline with UTC timestamps for every observable event — "the connector was wedged for ~18h" is narrative; "2026-05-04 09:54:16 Postgres terminated replication connection" is evidence. Always prefer the precise version.
An infrastructure table that fully identifies the system — versions, hostnames, zones, connector names, topic names, slot names. Someone reading this six months later should be able to find the exact resources without ambiguity.
Quantified impact across user, system, data, and SLA dimensions — vague impact ("some customers were affected") is worthless for severity calibration. State user-visible effect, internal system degradation, data integrity status, and SLA / financial cost as concrete numbers. If a number is unknown, say so explicitly rather than skipping the dimension.
Layered root cause analysis — not just what broke, but:
- The primary cause (the trigger event)
- Why it appeared healthy (what masked the failure — this is usually the most valuable part)
- Secondary mechanisms (gates, retries, internal state that contributed)
State snapshots with actual values — the contrast between expected state and observed state is what makes the diagnosis click. confirmed_flush_lsn = 1/AD5B16C0 (pre-restore stale value) next to pg_current_wal_lsn = 1/ADC4B740 (current) tells the whole story in two lines. Capture similar contrasts for whatever domain you're in (queue depth, error rate, version mismatch, schema drift).
Workaround / temporary mitigation captured separately from the resolution — the fast, low-risk action that stopped the bleeding before the root cause was fully understood. Workarounds and resolutions answer different questions: workaround = what does on-call do at 3am next time this fires; resolution = what permanently closes the case. Document the workaround's effect, its risks, and the trigger condition for applying it.
Resolution with ordering rationale — not just "I ran these commands", but why this order. If step 4 must come after step 3 because of in-memory state, say so. The next person hitting this will try the obvious order first and fail; document why obvious-order doesn't work.
A Five Whys chain that lands on a systemic gap — the chain is only useful if it stops at a missing guardrail (alert / review / test / knowledge), not at the technical trigger. Each "why" should narrow on a different mechanism — synonyms across adjacent steps mean you're padding. The final answer should map directly to a Recommendation below it.
A "What did NOT work" section — capture the dead ends. Future-you will be tempted to try the same thing. The Debezium RCA's "drop slot + recreate connector without offset reset" entry is gold — it's the most intuitive fix and it silently fails.
Diagnostic commands as a copy-paste block — the next incident in this domain will reuse 80% of these. Make them runnable, not pseudocode.
Verification evidence — proof the fix held. Test data flowing end-to-end. Slot lag stabilizing. Error rate returning to baseline. With actual values from the post-fix state.
Recommendations binned by urgency — Immediate (alerting/monitoring), Process (runbooks, comms), Configuration (settings changes). Bins force the user to think about timeline, not just "stuff to do".

Anti-patterns to avoid

Vague timelines: "The next morning we noticed..." — when? What did "noticed" actually mean? Who saw what?
Single-layer root cause: stopping at the trigger event without explaining the masking mechanism. If the system appeared healthy, that masking is the root cause for the duration of the outage.
Resolution without rationale: a list of commands with no explanation of why this order or why this approach. That's a runbook, not an RCA.
Hand-wavy recommendations: "improve monitoring" is not actionable. "Alert when pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) > 100MB for 5+ minutes" is.
Skipping the failed approaches: every dead end you don't document is a trap for the next person.
No verification: closing the report without proof the fix actually worked. This is how RCAs become folklore.

Workflow

Confirm the incident is resolved or contained. If still actively firing, defer this skill.
Mine existing context first. Conversation transcripts, scrollback, prior notes — extract everything you can before asking the user questions.
Walk the investigation checklist (references/investigation-checklist.md). Fill gaps by asking targeted questions or running diagnostic commands. Do this even for "small" incidents — the structure forces depth.
Capture state snapshots. If the system is reachable, grab the current healthy state values now (slot lag, queue depth, error rate, etc.) — these go in the verification section. Lose them and you can't prove the fix.
Draft the report using templates/rca-report.md. Fill every section; if a section truly doesn't apply, write "N/A — [reason]" rather than deleting it.
Validate against the quality rubric (references/quality-rubric.md). Fix any rubric failures before presenting.
Save to <topic>-rca-<YYYY-MM-DD>.md in CWD.
Show the user the file path and offer to walk through any section. Don't dump the full report into chat unless asked — they'll read the file.

Tone

Match the operator's voice: technical, concise, evidence-led. Lead each section with the answer, then the reasoning. No corporate hedging ("there may have been some impact") — state what happened. No blame language — focus on system gaps, not individuals. The Debezium RCA is the reference; mirror its directness.

More from this repository

same repository

ultrathink-protocol

bmad-labs/skills

A structured root-cause investigation protocol for complex, ambiguous, or multi-layer technical problems. Activate this skill whenever: a problem has resisted two or more fix attempts; the root cause is unknown or assumed; you are tempted to try a variation of something that already failed; a system has multiple interacting layers (hardware, OS, runtime, middleware, config, network); the user says "ultrathink", "think deeper", "figure out why", "stop guessing", "find the root cause", or "it's still broken after your fix". Also activate proactively when you catch yourself about to write a fix before you have verified the cause — that instinct is the signal the protocol is needed. The protocol enforces three disciplines that distinguish root-cause investigation from trial-and-error: (1) explicit THOUGHT/ACTION/OBSERVATION cycles, (2) a hard gate that blocks implementation until the cause is verified by direct evidence, and (3) structured escalation when in-process diagnostic tools are exhausted.

2026-05-119

bmad-auto

bmad-labs/skills

Orchestrates BMAD implementation workflows automatically — both the full Phase 4 epic/story pipeline and the Quick Flow for small, well-understood changes. Use this skill whenever the user wants to: (1) automate Phase 4 implementation ("auto implement", "start implementation", "begin phase 4", "automatic working on phase 4", "implement all stories", "process the epics"), (2) check implementation progress or status ("what's the status?", "how many stories are done?"), (3) resume a previously interrupted session ("continue from where we left off", "resume"), (4) implement a small self-contained change without going through full BMAD planning ("quick dev", "quick flow", "implement this change", a described bug fix, refactor, or small feature, patch). When the user describes a small change or asks to quickly implement something, route to Quick Flow — `bmad-quick-dev` handles intent-to-code directly without a separate spec step. If a multi-story project is already in flight (`sprint-status.yaml` exists) AND the us

2026-05-109

typescript-e2e-testing

bmad-labs/skills

E2E and integration testing for TypeScript/NestJS projects using Jest, supertest, and real infrastructure via Docker (Kafka, PostgreSQL, MongoDB, Redis) with the Given-When-Then pattern. Use whenever the user is working on `.e2e-spec.ts` files or anything under `test/e2e/`, or asks to set up, write, review, run, debug, or optimize E2E or integration tests — including flaky tests, docker-compose for tests, Kafka/Redpanda consumers, test isolation, or GWT compliance.

2026-05-079

typescript-unit-testing

bmad-labs/skills

Unit testing for TypeScript/NestJS projects using Jest, @golevelup/ts-jest (DeepMocked/createMock), and in-memory databases, with AAA structure. Use whenever the user is working on `.spec.ts` files or asks to set up Jest, write/add tests for a service/usecase/controller/guard/interceptor/pipe/filter, mock dependencies, review test quality or coverage, run unit tests, debug failing or flaky tests, or optimize test performance and open handles.

2026-05-079

trade-off-analysis

bmad-labs/skills

Create structured technology trade-off analysis documents with scored comparison matrices. Use this skill whenever the user wants to compare technologies, evaluate architectural options, analyze build-vs-buy decisions, assess migration strategies, or produce any decision document that compares multiple approaches across weighted dimensions. Triggers on: 'trade-off analysis', 'tradeoff', 'comparison matrix', 'evaluate options', 'which technology should we use', 'compare approaches', 'pros and cons of', 'build vs buy', 'migration analysis', 'consolidation analysis', 'technology selection'. Also use when the user has completed technical research and wants to structure findings into a decision document.

2026-04-229

load-docs

bmad-labs/skills

Loads documents fully into the main agent's context so the agent can answer questions, summarize, or work with that content in subsequent turns. Use whenever the user wants to ingest, read, study, review, absorb, or pull in documents — especially when they say things like "load these docs", "read all of these", "ingest this folder", "pull in these PDFs", "load all docs in X", or paste a list of file paths/URLs and ask you to read them. Handles local files (text, code, markdown, PDFs, notebooks, images), entire folders (recursively), and remote URLs. The skill is single-turn — once the agent reports "DONE", it deactivates until the user invokes it again.

2026-04-189

Source

bmad-labs

bmad-labs/skills

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

name

rca-report

description

RCA Report

When this skill applies

A production incident, outage, or near-miss has occurred and needs documenting
A pipeline, service, or system silently failed and the user just resolved it
The user wants a post-mortem, RCA, or incident write-up
Mid-investigation: the user is debugging something and wants help structuring evidence as it comes in

If the incident is still actively burning and the user just wants help fixing it, skip this skill — fix first, document after.

Output location

Save the report to <topic>-rca-<YYYY-MM-DD>.md in the current working directory, where:

<topic> is a short kebab-case identifier of the failing system (e.g. debezium, auth-service, kafka-consumer-lag)
<YYYY-MM-DD> is the incident date (when it occurred), not necessarily today

Example: debezium-rca-2026-05-05.md, auth-500s-rca-2026-04-12.md

The two-phase workflow

Phase 1 — Investigation (gather evidence)

If the user already has notes/transcripts/scrollback from the live incident, mine those first before asking questions. Don't make them re-type what's already in the conversation.

Phase 2 — Write the report

Use templates/rca-report.md as the structural skeleton. Fill it section by section using the evidence from Phase 1. Then validate against references/quality-rubric.md before declaring done.

What makes an RCA actually good

The Debezium RCA that this skill is modeled on worked because it had:

A timeline with UTC timestamps for every observable event — "the connector was wedged for ~18h" is narrative; "2026-05-04 09:54:16 Postgres terminated replication connection" is evidence. Always prefer the precise version.
An infrastructure table that fully identifies the system — versions, hostnames, zones, connector names, topic names, slot names. Someone reading this six months later should be able to find the exact resources without ambiguity.
Quantified impact across user, system, data, and SLA dimensions — vague impact ("some customers were affected") is worthless for severity calibration. State user-visible effect, internal system degradation, data integrity status, and SLA / financial cost as concrete numbers. If a number is unknown, say so explicitly rather than skipping the dimension.
Layered root cause analysis — not just what broke, but:
- The primary cause (the trigger event)
- Why it appeared healthy (what masked the failure — this is usually the most valuable part)
- Secondary mechanisms (gates, retries, internal state that contributed)
State snapshots with actual values — the contrast between expected state and observed state is what makes the diagnosis click. confirmed_flush_lsn = 1/AD5B16C0 (pre-restore stale value) next to pg_current_wal_lsn = 1/ADC4B740 (current) tells the whole story in two lines. Capture similar contrasts for whatever domain you're in (queue depth, error rate, version mismatch, schema drift).
Workaround / temporary mitigation captured separately from the resolution — the fast, low-risk action that stopped the bleeding before the root cause was fully understood. Workarounds and resolutions answer different questions: workaround = what does on-call do at 3am next time this fires; resolution = what permanently closes the case. Document the workaround's effect, its risks, and the trigger condition for applying it.
Resolution with ordering rationale — not just "I ran these commands", but why this order. If step 4 must come after step 3 because of in-memory state, say so. The next person hitting this will try the obvious order first and fail; document why obvious-order doesn't work.
A Five Whys chain that lands on a systemic gap — the chain is only useful if it stops at a missing guardrail (alert / review / test / knowledge), not at the technical trigger. Each "why" should narrow on a different mechanism — synonyms across adjacent steps mean you're padding. The final answer should map directly to a Recommendation below it.
A "What did NOT work" section — capture the dead ends. Future-you will be tempted to try the same thing. The Debezium RCA's "drop slot + recreate connector without offset reset" entry is gold — it's the most intuitive fix and it silently fails.
Diagnostic commands as a copy-paste block — the next incident in this domain will reuse 80% of these. Make them runnable, not pseudocode.
Verification evidence — proof the fix held. Test data flowing end-to-end. Slot lag stabilizing. Error rate returning to baseline. With actual values from the post-fix state.
Recommendations binned by urgency — Immediate (alerting/monitoring), Process (runbooks, comms), Configuration (settings changes). Bins force the user to think about timeline, not just "stuff to do".

Anti-patterns to avoid

Vague timelines: "The next morning we noticed..." — when? What did "noticed" actually mean? Who saw what?
Single-layer root cause: stopping at the trigger event without explaining the masking mechanism. If the system appeared healthy, that masking is the root cause for the duration of the outage.
Resolution without rationale: a list of commands with no explanation of why this order or why this approach. That's a runbook, not an RCA.
Hand-wavy recommendations: "improve monitoring" is not actionable. "Alert when pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) > 100MB for 5+ minutes" is.
Skipping the failed approaches: every dead end you don't document is a trap for the next person.
No verification: closing the report without proof the fix actually worked. This is how RCAs become folklore.

Workflow

Confirm the incident is resolved or contained. If still actively firing, defer this skill.
Mine existing context first. Conversation transcripts, scrollback, prior notes — extract everything you can before asking the user questions.
Walk the investigation checklist (references/investigation-checklist.md). Fill gaps by asking targeted questions or running diagnostic commands. Do this even for "small" incidents — the structure forces depth.
Capture state snapshots. If the system is reachable, grab the current healthy state values now (slot lag, queue depth, error rate, etc.) — these go in the verification section. Lose them and you can't prove the fix.
Draft the report using templates/rca-report.md. Fill every section; if a section truly doesn't apply, write "N/A — [reason]" rather than deleting it.
Validate against the quality rubric (references/quality-rubric.md). Fix any rubric failures before presenting.
Save to <topic>-rca-<YYYY-MM-DD>.md in CWD.
Show the user the file path and offer to walk through any section. Don't dump the full report into chat unless asked — they'll read the file.