一键在 Manus 中运行任何 Skill

rootcauseanalysis

星标16,098

分支2,216

更新时间2026年4月30日 16:09

Structured incident investigation grounded in Toyota Production System, Kaoru Ishikawa, James Reason's Swiss Cheese model, Dean Gano's Apollo method, and Google SRE blameless-postmortem culture. Five workflows: FiveWhys (linear/branching causal chain, single-thread incidents), Fishbone/Ishikawa (6 M's or 4 P's category mapping, multiple suspected areas), Postmortem (blameless timeline + contributing factors + action items, wraps other methods), FaultTree (AND/OR gate logic, safety-critical multi-path failures), KepnerTregoe IS/IS-NOT (distinction analysis, subtle hard-to-reproduce defects). Context files: Foundation.md (Toyoda, Ishikawa, Reason, Gano, Google SRE; canonical methods), MethodSelection.md (decision flow for workflow selection). Core axiom: proximate cause is where analysis starts, not ends. Humans are never root causes — if a human could make the mistake, the system allowed it. A cause is "root enough" when it's actionable. Also supports FMEA-style pre-launch risk inversion (what could fail befor

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

danielmiessler

danielmiessler/LifeOS

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

文件资源管理器

8 个文件

SKILL.md

readonly

同仓库更多 Skills

同仓库

art

danielmiessler/LifeOS

Generates static visual content across 20+ formats via Flux, Nano Banana Pro (Gemini 3 Pro), and GPT-Image-1. Covers blog header illustrations, editorial art, Mermaid flowcharts, technical architecture diagrams, D3.js dashboards, taxonomies, timelines, 2x2 framework matrices, comparisons, annotated screenshots, recipe cards, aphorism/quote cards, conceptual maps, stat cards, comic panels, YouTube thumbnails, PAI pack icons, and brand-logo wallpapers. Named workflows: Essay, D3Dashboards, Visualize, Mermaid, TechnicalDiagrams, Taxonomies, Timelines, Frameworks, Comparisons, AnnotatedScreenshots, RecipeCards, Aphorisms, Maps, Stats, Comics, YouTubeThumbnailChecklist, AdHocYouTubeThumbnail, CreatePAIPackIcon, LogoWallpaper, RemoveBackground. SKILLCUSTOMIZATIONS loads PREFERENCES.md, CharacterSpecs.md, and SceneConstruction.md. --remove-bg flag produces transparent-background PNG (can produce black backgrounds — verify visually). Up to 14 reference images per request (5 human, 6 object Gemini API limit). Output s

2026-05-0116.1k

evals

danielmiessler/LifeOS

Comprehensive AI agent evaluation framework with three grader types (code-based: deterministic/fast; model-based: nuanced/LLM rubric; human: gold standard) and pass@k / pass^k scoring. Evaluates agent transcripts, tool-call sequences, and multi-turn conversations — not just single outputs. Supports capability evals (~70% pass target) and regression evals (~99% pass target). Workflows: RunEval, CompareModels, ComparePrompts, CreateJudge, CreateUseCase, RunScenario, CreateScenario, ViewResults. Integrates with THE ALGORITHM ISC rows for automated verification. Domain patterns pre-configured for coding, conversational, research, and computer-use agent types in Data/DomainPatterns.yaml. Tools: AlgorithmBridge.ts (ISC integration), FailureToTask.ts (failures → tasks), SuiteManager.ts (create/graduate/saturation-check), ScenarioRunner.ts (multi-turn simulated-user), TranscriptCapture.ts, PAIAgentAdapter.ts (wraps Inference.ts), ScenarioToTranscript.ts. Code-based graders: string_match, regex_match, binary_tests, st

2026-05-0116.1k

prompting

danielmiessler/LifeOS

Meta-prompting standard library — the PAI system for generating, optimizing, and composing prompts programmatically. Owns three pillars: Standards (Anthropic Claude 4.x best practices, context engineering principles, 1,500+ paper synthesis, Fabric pattern system, markdown-first / no-XML-tags); Templates (Handlebars-based — Briefing.hbs, Structure.hbs, Gate.hbs, DynamicAgent.hbs, and eval-specific templates Judge.hbs, Rubric.hbs, TestCase.hbs, Comparison.hbs, Report.hbs used by Agents and Evals skills); and Tools (RenderTemplate.ts for CLI/TypeScript rendering with data-content separation). Philosophy: prompts that write prompts — structure is code, content is data. Delivered 65% token reduction across PAI (53K → 18K tokens) via template extraction. Output is always a prompt to be used elsewhere, not final content. Reference files: Standards.md (complete prompt engineering guide), Tools/RenderTemplate.ts (rendering implementation). NOT FOR generating final content or answers — this skill produces prompts only

2026-05-0116.1k

agents

danielmiessler/LifeOS

Compose CUSTOM agents from Base Traits + Voice + Specialization, and manage predefined functional TEAMS. Traits combine expertise (security, technical, research), personality (skeptical, analytical, enthusiastic), and approach (thorough, rapid, systematic). ComposeAgent.ts merges base + user config, outputs unique prompt + ElevenLabs voice + prosody. Predefined teams: engineering, architecture, marketing, design, security, research, content, strategy — each YAML-configured with roles, tensions, and specialist members. Observer team variant: read-only oversight agents that vote continue/halt/escalate against the tool-activity audit log (high-blast-radius or unattended runs only). USE WHEN create custom agents, spin up agents, specialized agents, agent personalities, available traits, list traits, agent voices, compose agent, spawn parallel agents, launch agents, engineering team, architecture team, marketing team, design team, security team, research team, content team, strategy team, get the team on this, obs

2026-04-3016.1k

apertureoscillation

danielmiessler/LifeOS

3-pass scope oscillation that holds a question constant while shifting the scope envelope — narrow/tactical, wide/strategic, then synthesis — to surface design tensions invisible at any single zoom level. Requires two distinct inputs: the tactical target (what you're building) and strategic context (the larger system it serves). Pass 1 captures the component's own internal logic. Pass 2 reveals what the system needs it to be. Pass 3 finds where those views diverge — that delta is the output. Produces: design tensions, scope recommendations, and coherence assessments. Single workflow: Workflows/Oscillate.md. BPE-fragile — quarterly test recommended to verify smarter models don't naturally oscillate scope without prompting. Best integration point: Algorithm OBSERVE phase (before ISC) or THINK phase (before approach commitment). NOT a lens rotation (that's IterativeDepth) and NOT idea generation (that's BeCreative). NOT FOR deep incident causal chains (use RootCauseAnalysis) or assumption decomposition (use Firs

2026-04-3016.1k

aphorisms

danielmiessler/LifeOS

Manages a curated aphorism collection with full CRUD — content-based matching, themed search, thinker research, and database maintenance. Organizes quotes by author, theme, context, and newsletter usage history to prevent repetition. Four workflows: FindAphorism (analyze newsletter content, match themes, return 3-5 ranked recommendations with rationale), AddAphorism (parse quote + author, extract themes, validate uniqueness, update theme index), ResearchThinker (deep research on philosopher, add sourced quotes to database), SearchAphorisms (search by theme, keyword, or author). Database at ~/.claude/skills/aphorisms/Database/aphorisms.md — stores full quote text, author attribution, theme tags, context/background, source reference, and usage history per entry. Theme index supports 12+ categories: Work Ethic, Resilience, Learning, Stoicism, Risk, Wisdom, Truth-seeking, Excellence, Curiosity, Freedom, Rationality, Clarity. Supported thinkers: Hitchens, Feynman, Deutsch, Sam Harris, Spinoza, plus any requested a

2026-04-3016.1k

name	RootCauseAnalysis
description	Structured incident investigation grounded in Toyota Production System, Kaoru Ishikawa, James Reason's Swiss Cheese model, Dean Gano's Apollo method, and Google SRE blameless-postmortem culture. Five workflows: FiveWhys (linear/branching causal chain, single-thread incidents), Fishbone/Ishikawa (6 M's or 4 P's category mapping, multiple suspected areas), Postmortem (blameless timeline + contributing factors + action items, wraps other methods), FaultTree (AND/OR gate logic, safety-critical multi-path failures), KepnerTregoe IS/IS-NOT (distinction analysis, subtle hard-to-reproduce defects). Context files: Foundation.md (Toyoda, Ishikawa, Reason, Gano, Google SRE; canonical methods), MethodSelection.md (decision flow for workflow selection). Core axiom: proximate cause is where analysis starts, not ends. Humans are never root causes — if a human could make the mistake, the system allowed it. A cause is "root enough" when it's actionable. Also supports FMEA-style pre-launch risk inversion (what could fail before it does). Integrates with Science (hypothesis generation during investigation) and RedTeam (stress-test remediations). NOT FOR structural/systemic loops and feedback archetypes (use SystemsThinking) or axiom decomposition (use FirstPrinciples). USE WHEN root cause, RCA, 5 whys, fishbone, postmortem, incident analysis, why did this happen, fault tree, what really caused this, why does this keep failing, blameless, defect investigation, recurring bug, pre-launch risk.
effort	high
context	fork

Customization

Before executing, check for user customizations at: ~/.claude/PAI/USER/SKILLCUSTOMIZATIONS/RootCauseAnalysis/

If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.

MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)

You MUST send this notification BEFORE doing anything else when this skill is invoked.

Send voice notification:

curl -s -X POST http://localhost:31337/notify \
  -H "Content-Type: application/json" \
  -d '{"message": "Running the WORKFLOWNAME workflow in the RootCauseAnalysis skill to ACTION"}' \
  > /dev/null 2>&1 &

Output text notification:

Running the **WorkflowName** workflow in the **RootCauseAnalysis** skill to ACTION...

This is not optional. Execute this curl command immediately upon skill invocation.

RootCauseAnalysis Skill

Structured investigation of why something failed — beyond the proximate cause, down to the contributing factors and latent conditions that actually made the failure possible. Grounded in Toyota Production System (Sakichi Toyoda), Kaoru Ishikawa, James Reason's Swiss Cheese model, Dean Gano's Apollo method, and Google SRE / Etsy blameless-postmortem culture.

The goal is not to find "the" root cause — that framing is almost always wrong. The goal is to identify contributing factors that are actionable. A good RCA ends with changes that prevent a class of failure, not just the specific incident.

Core Concept

Five axioms this skill operates on:

Proximate cause ≠ root cause. "The deploy failed because X crashed" is usually where real analysis starts, not where it ends.
There is rarely one cause. Incidents typically have multiple contributing factors — active failures (what a human did) and latent conditions (what the system allowed). James Reason's Swiss Cheese model.
Humans are not root causes. "Operator error" is a stop sign for analysis, not a conclusion. If a human could make the mistake, the system allowed it. Go deeper.
Actionability is the stop condition. A cause is "root enough" when it points to a change you can actually make. Go too shallow and you miss the fix; go too deep ("physics") and you can't act on it.
RCA is a bias-fight. Hindsight bias, confirmation bias, single-cause bias, and outcome bias all actively corrupt investigations. Structure exists to resist them.

Use / Win

When to use:

Any incident or outage — production failure, security event, deploy gone bad.
Recurring defects — bugs of the same shape keep appearing despite fixes.
Quality problems — metrics drifting, users reporting the same class of issue.
Postmortems — structured, blameless review of an incident's causal chain.
Pre-launch risk analysis — inverting RCA with FMEA to catch failure modes before they happen.
Security investigations — chain of events, contributing controls, latent conditions.
Process failures — a person or team consistently missing a mark. Structure is probably the cause.

What you win:

Actionable contributing factors (plural) rather than a single blame target.
Latent conditions surfaced — the Swiss cheese holes lining up that nobody knew were there.
Durable fixes — structural changes, not patches to the specific failure.
Blame-free analysis — the team can be honest about what happened without self-protective omissions.
Cross-incident pattern recognition — after a few RCAs, the repeated latent conditions become visible.
Discipline against bias — structured methods force you past the first plausible story.

Default mental model: If the same failure class could happen again tomorrow, you haven't done RCA — you've done triage.

Workflow Routing

Route to the appropriate workflow based on the request.

Workflow	Trigger	File
FiveWhys	"5 whys", "five whys", quick causal chain, ask why until root	`Workflows/FiveWhys.md`
Fishbone	"fishbone", "ishikawa", categorized cause map, 6 M's / 4 P's / 8 M's	`Workflows/Fishbone.md`
Postmortem	"postmortem", "incident review", "blameless postmortem", production incident	`Workflows/Postmortem.md`
FaultTree	"fault tree", "fta", top-down deductive, safety-critical, AND/OR logic	`Workflows/FaultTree.md`
KepnerTregoe	"kepner tregoe", "is/is-not", "what changed", distinction analysis, subtle defects	`Workflows/KepnerTregoe.md`

Quick Reference

5 workflows — FiveWhys, Fishbone, Postmortem, FaultTree, KepnerTregoe
5 Whys: Linear/branching causal chain. Best for simple, single-thread incidents.
Fishbone: 6 M's (Manpower, Machine, Method, Material, Measurement, Mother-Nature) for manufacturing; 4 P's (People, Process, Policies, Procedures) for service. Use when multiple category causes are suspected.
Postmortem: Timeline + contributing factors + action items. Blameless framing mandatory.
Fault Tree: AND/OR gate logic, deductive, top-down. Best for safety-critical and complex multi-path failures.
Kepner-Tregoe IS/IS-NOT: Identify distinctions between where the problem occurred and where it did not. Best for subtle, hard-to-reproduce defects.

Context files (loaded on demand):

Foundation.md — Toyoda, Ishikawa, Reason, Gano, Google SRE; canonical methods
MethodSelection.md — decision flow for which workflow to use

Method Selection Guide

Situation	Preferred workflow
Single-thread incident, one clear failure point	FiveWhys
Multiple suspected categories (people, process, tools)	Fishbone
Production outage or security incident, needs formal review	Postmortem
Complex multi-path failure, safety-critical, need Boolean logic	FaultTree
Subtle defect, hard to reproduce, "why here and not there?"	KepnerTregoe

For non-trivial incidents: Postmortem wraps the others. Start with a Postmortem structure, use 5 Whys / Fishbone / FTA inside it as investigation tools.

Integration

Depends on: nothing — standalone analytical skill.

Works well with:

SystemsThinking — RCA stops at contributing factors; SystemsThinking continues down to structure and mental models. Pair them when patterns repeat across incidents.
FirstPrinciples — decompose a contributing factor to its fundamental truths before fixing.
RedTeam — "how would we cause this again?" is adversarial RCA. Use RedTeam to stress-test remediations.
Science — RCA is the scientific method applied to failures. Use Science for hypothesis generation during investigation.

Examples

Example 1: Production outage

User: "the payments service went down for 14 minutes last night"
→ Postmortem workflow
→ Timeline: deploy at 23:47 → health check passed → traffic shift 23:49 → p99 latency spike 23:51 → auto-rollback 00:01
→ 5 Whys inside: Why did p99 spike? Cold cache. Why cold? New pod group. Why no warm? No warm-up in deploy script. Why? Not in checklist. Why? Template predates the caching layer.
→ Contributing factors: deploy template stale (latent); no warm-up step (active); no cache-cold canary (latent)
→ Remediation: update deploy template, add warm-up step, add cold-cache canary gate

Example 2: Recurring defect

User: "users keep reporting the same kind of auth failure, we've fixed it 3 times"
→ Fishbone workflow
→ 6 M's expansion: People (ops auth rotates keys without notifying infra), Method (no key-rotation runbook), Machine (secret cache TTL exceeds rotation window), Material (shared key instead of per-service), Measurement (no key-expiry dashboard), Mother-Nature (none)
→ Root causes (multiple): Method + Material + Measurement all contribute. Single-point fix won't hold.

Example 3: Subtle defect

User: "this flaky test only fails in CI, not locally"
→ KepnerTregoe workflow
→ IS/IS-NOT table: fails on CI / passes locally; fails Tuesdays / not other days; fails on shared runners / not dedicated; fails with parallel test workers / not serial
→ Distinctions point to: time-zone + concurrency + shared file system
→ Hypothesis: test relies on local timezone assumption + race condition on shared /tmp — both only triggered in CI's environment.

Best Practices

Always blameless. The framing is "what system allowed this" not "who screwed up." Non-negotiable; corrupts the analysis otherwise.
Multiple causes, always. Single-root-cause conclusions are almost always wrong. Name at least three contributing factors before stopping.
Actionability test every cause. Can you change it? If no — go shallower. If yes — go one level deeper to make sure you've found the lever.
Timelines before theories. Reconstruct what happened before hypothesizing why. Hindsight bias compresses the timeline.
Ask "who else could make this mistake?" If the answer is "anyone on the team," it's a systemic cause, not individual error.
Separate investigation from judgment. Never let the incident review drift into performance conversations. Separate meeting.

Gotchas

"Human error" is a starting point, not a root cause. It's where the investigation begins. Every human error sits on top of a system that made the error possible or probable.
The first plausible cause is almost never the only one. Confirmation bias loves RCA. Keep going after you find one.
Stopping at proximate cause is failure. "X crashed because Y returned null." Why did Y return null? Why wasn't null handled? Why wasn't that tested? Go down.
Going too deep ≠ good RCA. "The fundamental cause is the second law of thermodynamics" is not actionable. Stop at the deepest actionable level.
Asking "why" more than ~5 times often means you switched causal chains. Re-draw as a tree, not a line.
Don't confuse correlation with cause. Two things happening together is a hypothesis to test, not a conclusion.
Outcome bias is sneaky. Decisions that turn out badly get judged harshly even if they were right given the information at the time. Separate process quality from outcome.

Attribution: Frameworks drawn from Sakichi Toyoda (5 Whys, Toyota Production System), Kaoru Ishikawa (Guide to Quality Control, 1968; Fishbone diagram), James Reason (Human Error, 1990; Swiss Cheese model), Dean Gano (Apollo Root Cause Analysis, 2008), Charles Kepner & Benjamin Tregoe (The Rational Manager, 1965), Google SRE book, Etsy blameless postmortem culture (John Allspaw).

Execution Log

After completing any workflow, append a single JSONL entry:

echo '{"ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","skill":"RootCauseAnalysis","workflow":"WORKFLOW_USED","input":"8_WORD_SUMMARY","status":"ok|error","duration_s":SECONDS}' >> ~/.claude/PAI/MEMORY/SKILLS/execution.jsonl