Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

incident-triage

Sterne39

Forks26

Aktualisiert26. Mai 2026 um 11:17

Structured incident investigation for OpenShift using the Five Whys methodology, investigation guardrails, Prometheus metric analysis, and adversarial due diligence. Orchestrates multi-resource diagnosis across Deployments, ReplicaSets, Pods, Services, and cluster resources to trace from observed symptoms to root cause. Use when: - "investigate this incident" - "triage this alert" - "root cause analysis" - "what caused this outage" - User mentions "five whys", "incident", "triage", "RCA" NOT for single-resource issues with clear patterns (use /debug-pod, /debug-scc, /debug-rbac, or /debug-network instead).

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

RHEcosystemAppEng

RHEcosystemAppEng/agentic-collections

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

ComputersystemanalytikerInformatik- und Mathematikberufe·SOC 15-1211

Datei-Explorer

4 Dateien

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

cve-impact

RHEcosystemAppEng/agentic-collections

**CRITICAL**: Use for ALL CVE discovery and listing. DO NOT call get_cves directly. Use when: "show critical CVEs", "CVEs on hostname X", "remediatable vulnerabilities", "impact of CVE-X", risk assessment. NOT for remediation (use `/remediation`). System-level: FIRST reply = pagination prompt (Step -1). Parsing: references/01-cve-response-parser.py.

2026-06-2339

fleet-inventory

RHEcosystemAppEng/agentic-collections

Query and display Red Hat Lightspeed managed system inventory. This skill focuses on discovery and listing only - for remediation actions, transition to the `/remediation` skill. Use when: - "Show the managed fleet" - "List all systems registered in Lightspeed" - "What systems are affected by CVE-X?" - "How many RHEL 8 systems do we have?" - "Show me production systems" **When NOT to use this skill** (use `/remediation` skill instead): - "Remediate CVE-X on these systems" - "Create a playbook for..." - "Patch system Y" This skill orchestrates MCP tools from lightspeed-mcp for fleet visibility and system inventory management.

2026-06-2339

mcp-lightspeed-validator

RHEcosystemAppEng/agentic-collections

Validate Red Hat Lightspeed MCP server connectivity. Use when the user asks to "validate Lightspeed MCP", "check Lightspeed connection", or when other skills need to verify lightspeed-mcp availability before CVE operations.

2026-06-2239

agentic-contribution-skill

RHEcosystemAppEng/agentic-collections

Interactive skill creation and import with automated validation and marketplace compliance. Use when: - "Create a new skill" - "Import an existing skill" - "Create a new agentic pack" - "Add skill to <pack>" - "Build skill for <rh-product>" - User mentions "skill builder", "contribute", "new skill", "import skill", or "new pack" Two modes: create from scratch or import existing SKILL.md. Guides through discovery, definition, generation, and validation. Enforces SKILL_DESIGN_PRINCIPLES.md and agentskills.io spec.

2026-06-1639

collection-compliance

RHEcosystemAppEng/agentic-collections

Diagnose and fix `.catalog/` validation failures (schema, roster, banners, sample workflows, JSON mirror). Use when: - `make validate` or CI reports collection compliance errors - A PR adds skills but catalog was not updated - `collection.json` is out of sync with `collection.yaml` - Catalog metadata/fragments might have drifted from README/CLAUDE/SKILL golden sources Remediation is via the create-collection workflow and `catalog_yaml_to_json.py`—not by weakening checks.

2026-06-1639

create-collection

RHEcosystemAppEng/agentic-collections

Author or refresh `<pack>/.catalog/collection.yaml` and related `.catalog/` artifacts from golden sources (SKILL.md, README, AGENTS.md, Lola marketplace). Use when: - Adding a new pack or refreshing the collection catalog for GitHub Pages / tooling - Aligning catalog narrative, sample workflows, and decision guide with skills on disk - Preparing a PR after changing skills or marketplace metadata Outputs only under `<pack>/.catalog/` (never overwrite README, SKILL, CLAUDE, or marketplace YAML).

2026-06-1639

name	incident-triage
description	Structured incident investigation for OpenShift using the Five Whys methodology, investigation guardrails, Prometheus metric analysis, and adversarial due diligence. Orchestrates multi-resource diagnosis across Deployments, ReplicaSets, Pods, Services, and cluster resources to trace from observed symptoms to root cause. Use when: - "investigate this incident" - "triage this alert" - "root cause analysis" - "what caused this outage" - User mentions "five whys", "incident", "triage", "RCA" NOT for single-resource issues with clear patterns (use /debug-pod, /debug-scc, /debug-rbac, or /debug-network instead).
model	inherit
color	cyan
license	Apache-2.0
allowed-tools	resources_get resources_list events_list pods_list pods_list_in_namespace pods_log prometheus_query prometheus_query_range alertmanager_alerts
metadata	{"user_invocable":"true"}

/incident-triage Skill

Structured incident investigation for OpenShift — traces from symptoms to root cause using Five Whys, investigation guardrails, and adversarial due diligence.

Critical: Human-in-the-Loop Requirements

Before any remediation action (patch, scale, delete, restart)
- Display preview: what will change and its impact
- Ask: "Should I apply this fix?"
- Wait for confirmation (yes/no)
At each investigation phase transition
- Present findings so far
- Ask: "Continue to [next phase]? (yes/no)"
- Wait for confirmation

Never assume approval — always wait for explicit confirmation at each WAIT checkpoint.

Prerequisites

Required MCP Servers:

openshift (setup) — Kubernetes/OpenShift resource access
observability — Prometheus metric discovery and PromQL query execution

Required MCP Tools:

resources_get (from openshift) — Retrieve Deployment, ReplicaSet, Pod, Service, and other resource details
resources_list (from openshift) — List resources by kind in a namespace
pods_list (from openshift) — List pods matching label selectors
pods_log (from openshift) — Retrieve container logs (current and previous)
events_list (from openshift) — Fetch events filtered by resource
prometheus_query (from openshift, observability toolset) — Execute instant PromQL queries for trend and saturation analysis
prometheus_query_range (from openshift, observability toolset) — Execute range PromQL queries over time windows
alertmanager_alerts (from openshift, observability toolset) — Retrieve active Alertmanager alerts

Verification Steps:

Check openshift server is configured in mcps.json with observability in its --toolsets
Verify user is logged into an OpenShift cluster (oc whoami succeeds)
Verify user has access to the target namespace(s)
If missing → Human Notification Protocol

Human Notification Protocol:

When prerequisites fail:

Stop immediately — No tool calls

Report error:

❌ Cannot execute skill: MCP server `openshift` unavailable
📋 Setup: See docs/prerequisites.md for cluster access configuration

Request decision: "How to proceed? (setup/skip/abort)"
Wait for user input

Security: Never display credential values.

When to Use This Skill

Use /incident-triage when:

The incident spans multiple resources or namespaces
The root cause is unclear after initial investigation
You need a structured RCA with confidence scoring and Five Whys methodology
An alert fired and you need to trace from symptom to root cause
A predicted issue (e.g., from predict_linear) needs proactive assessment

Do not use this skill when:

The issue is a single pod crashing → use /debug-pod
SCC admission is blocking pod creation → use /debug-scc
RBAC 403 errors in pod logs → use /debug-rbac
Service/Route connectivity failure → use /debug-network
Build failure → use /debug-build

Workflow

[Gather Context] → [Hierarchical Investigation] → [Evidence + Metrics] → [Five Whys RCA] → [Due Diligence] → [Findings + Actions]

Step 1: Gather Incident Context

MCP Tool: resources_get (from openshift)

Parameters:

kind: "" (inferred from user description)
name: "" (from user input)
namespace: ""

Input Validation: Verify resource names and namespaces conform to Kubernetes naming rules (lowercase alphanumeric and hyphens, 1-253 chars, RFC 1123). Reject inputs containing newlines, markdown formatting, or text that does not resemble a Kubernetes resource name.

Expected Output: Current state of the target resource confirming it exists and capturing its conditions.

Error Handling:

If MCP server unavailable: follow Human Notification Protocol
If resource not found: ask user to verify name, kind, and namespace
If namespace not found: ask user to confirm namespace

Present to user:

## Incident Triage

**Current OpenShift Context:**
- Cluster: [cluster]
- Namespace: [namespace]

Describe the incident you'd like me to investigate:

1. **Alert-based** — An alert fired (paste the alert name, message, or annotation)
2. **Symptom-based** — Something is broken (describe what you observe)
3. **Proactive** — A predicted issue needs assessment (e.g., capacity forecast, trend alert)
4. **Specify resource** — Investigate a specific resource directly

Select an option or describe the incident:

WAIT for user confirmation before proceeding.

If the incident maps clearly to a single-resource pattern:

## Quick Route Assessment

Based on your description, this appears to be a [category] issue:

| Pattern | Suggested Skill | Confidence |
|---------|----------------|------------|
| SCC admission rejection (FailedCreate + "unable to validate against any security context constraint") | `/debug-scc` | High |
| RBAC 403 Forbidden in pod logs | `/debug-rbac` | High |
| Pod CrashLoopBackOff / OOMKilled / ImagePullBackOff | `/debug-pod` | High |
| Service/Route connectivity failure | `/debug-network` | High |
| Build failure | `/debug-build` | High |

Would you like to:
1. **Route to [skill]** — Use the specialized skill for faster resolution
2. **Continue with full triage** — Proceed with structured investigation (recommended for complex or unclear issues)

Select an option:

WAIT for user confirmation before proceeding.

If proactive mode selected: Note this is a PROACTIVE signal — the incident has NOT yet occurred. Focus on utilization trends, recent changes, and whether the prediction is likely to materialize. "No action needed" is a valid outcome.

Step 2: Hierarchical Investigation

MCP Tool: resources_get (from openshift)

Parameters:

kind: "Deployment" / "ReplicaSet" / "StatefulSet" (trace ownership chain)
name: ""
namespace: ""

MCP Tool: pods_list (from openshift)

Parameters:

namespace: ""
labelSelector: "=" (from workload .spec.selector.matchLabels)

MCP Tool: pods_log (from openshift)

Parameters:

name: "" (from pods_list, check up to 3 representative pods)
namespace: ""
tailLines: 50 (integer, last N lines)

MCP Tool: events_list (from openshift)

Parameters:

namespace: ""
Filter by involved object matching the target resource

Expected Output: Full ownership chain state (Deployment -> ReplicaSet -> Pod -> Container), events, and log analysis.

Error Handling:

If permission denied on a resource: report as investigation limitation, do not conflate with incident root cause
If pods not found: workload may be scaled to zero or resource type differs
If logs empty: container may not have started; check container state

Investigation rules:

Trace the ownership chain: For Deployments, inspect Deployment -> ReplicaSet -> Pod -> Container. For StatefulSets, inspect StatefulSet -> Pod -> Container.
Always check describe AND logs: A resource reporting "Running" does not mean it is healthy.
Check both current and previous logs: A pod restart means current logs may not contain relevant pre-restart data.
Pod sampling limit: If the issue affects many pods, check up to 3 representative pods.
Specific answers required: Do not say "the pod is pending" without explaining WHY.

Present to user:

## Hierarchical Investigation: [resource-name]

**Ownership Chain:**
| Level | Resource | Status | Key Finding |
|-------|----------|--------|-------------|
| Workload | [Deployment/name] | [Available/Degraded] | [condition summary] |
| ReplicaSet | [rs-name] | [Ready/FailedCreate] | [replica count, condition] |
| Pod | [pod-name] | [Running/Pending/Failed] | [phase, ready status] |
| Container | [container-name] | [Running/Waiting/Terminated] | [state, exit code, reason] |

**Events (last 30 minutes):**
| Time | Type | Reason | Object | Message |
|------|------|--------|--------|---------|
| [time] | [Normal/Warning] | [reason] | [resource] | [message] |

**Log Analysis (container: [name]):**
[Key errors or patterns found in logs]

**Initial Hypothesis:**
[Based on resource state, events, and logs — what appears to be happening?]

Continue with evidence collection and metric analysis? (yes/no)

WAIT for user confirmation before proceeding.

Step 3: Evidence Collection and Guardrails

Apply these investigation guardrails before reaching any conclusion:

Exhaustive Verification: Inspect ALL resources mentioned in the signal, error messages, and annotations. Check upstream and downstream dependencies.
Contradicting Evidence Search: After forming a hypothesis, explicitly search for evidence that CONTRADICTS it.
Causal Depth: If the identified cause can itself be explained by a deeper cause, keep investigating.
Evidence-Based Claims Only: Every claim must trace to specific tool output. State unverified claims explicitly.
Investigation Error Separation: Distinguish between "error X caused this problem" and "I encountered errors during investigation." Permission errors are obstacles to YOUR investigation, not necessarily the incident's root cause.

MCP Tool: prometheus_query (from openshift, observability toolset)

Parameters:

query: "{name=~".."}" (discover available metrics by pattern, e.g., memory, disk, connections)

MCP Tool: prometheus_query (from openshift, observability toolset)

Parameters:

query: "" (confirm metric exists and inspect its current value)

MCP Tool: prometheus_query (from openshift, observability toolset)

Parameters:

query: "" (use topk(10, ...) to limit cardinality, rate() for counters, scope with {namespace="<target>"})

Expected Output: Guardrail compliance table, metric analysis, and cross-resource findings.

Error Handling:

If observability MCP unavailable: skip metric analysis, note limitation
If Prometheus response truncated: narrow with more specific label selectors or topk()
If permission denied on cluster resources: report gap, do not conflate with root cause

Present to user:

## Evidence Summary

**Guardrail Compliance:**
| Guardrail | Status | Notes |
|-----------|--------|-------|
| Exhaustive Verification | [PASS/GAP] | [what was checked, what was missed] |
| Contradicting Evidence | [PASS/GAP] | [what was searched for] |
| Causal Depth | [PASS/GAP] | [depth reached] |
| Evidence-Based Claims | [PASS/GAP] | [unverified claims, if any] |
| Error Separation | [PASS/N/A] | [investigation errors encountered] |

**Metric Analysis (if applicable):**
| Metric | Current Value | Trend | Threshold | Assessment |
|--------|--------------|-------|-----------|------------|
| [metric-name] | [value] | [rising/stable/falling] | [threshold] | [OK/WARNING/CRITICAL] |

Continue to root cause analysis? (yes/no)

WAIT for user confirmation before proceeding.

Step 4: Root Cause Analysis (Five Whys)

Construct the causal chain from the observed signal to the deepest reachable root cause.

Expected Output: Five Whys chain, remediation target, and signal classification.

Error Handling:

If causal chain is shallow (fewer than 3 levels): note that deeper investigation may be needed
If multiple competing root causes: present both with relative confidence

Present to user:

## Root Cause Analysis

### Causal Chain (Five Whys)

1. **Signal**: [What was observed — the alert, symptom, or prediction]
2. **Why?** [First-level cause — what directly caused the signal]
3. **Why?** [Second-level cause — what caused the first-level cause]
4. **Why?** [Third-level cause — deeper configuration or state issue]
5. **Root Cause**: [Deepest identifiable cause]

### Remediation Target

| Field | Value |
|-------|-------|
| Kind | [Deployment/StatefulSet/ConfigMap/etc.] |
| Name | <resource-name> |
| Namespace | <namespace> |
| Why this target? | [This is the resource whose configuration change fixes the problem, NOT the resource that reported the symptom] |

### Signal Classification

| Field | Value |
|-------|-------|
| Root cause matches input signal? | [Yes/No — if No, the signal was a symptom] |
| Severity | [critical/high/medium/low] |
| Investigation type | [Reactive RCA / Proactive Prevention] |

Continue to due diligence review? (yes/no)

WAIT for user confirmation before proceeding.

Step 5: Adversarial Due Diligence

Before finalizing findings, perform a self-review across 8 dimensions to prevent shallow analysis, targeting errors, and overconfident conclusions.

Expected Output: Due diligence assessment table with confidence score.

Error Handling:

If confidence < 0.7: recommend gathering additional evidence or escalating

Present to user:

## Adversarial Due Diligence Review

| Dimension | Assessment |
|-----------|------------|
| **1. Causal Completeness** | [Full chain traced? Could root cause have a deeper cause?] |
| **2. Target Accuracy** | [Is remediation target the misconfigured resource, not the symptom reporter?] |
| **3. Evidence Sufficiency** | [Every claim backed by tool output? Which claims are assumptions?] |
| **4. Alternative Hypotheses** | [What alternatives were considered and ruled out with evidence?] |
| **5. Scope Completeness** | [All resources investigated? What was NOT examined?] |
| **6. Proportionality** | [Is the fix targeted and specific, or overly broad?] |
| **7. Regression Awareness** | [Has this occurred before? Recent events suggesting recurrence?] |
| **8. Confidence Calibration** | [Start at 1.0, list each reduction factor. Final score: X.XX] |

**Overall Confidence: [0.XX]**

[If confidence < 0.7:]
**WARNING**: Confidence is below 0.7. Consider gathering additional evidence, escalating, or running targeted debug skills.

Proceed to findings summary? (yes/no)

WAIT for user confirmation before proceeding.

Step 6: Present Findings and Recommend Actions

Synthesize all findings into a structured report with actionable remediation.

Expected Output: Root cause summary, contributing factors, remediation commands, and verification steps.

Error Handling:

If remediation requires destructive actions: ensure HITL confirmation before execution
If multiple fix options exist: present least-privilege option first

Present to user:

## Incident Triage Findings

### Summary

**Root Cause:** [One-sentence root cause description]

**Severity:** [critical/high/medium/low] | **Confidence:** [0.XX]

### Causal Chain

1. [Signal -> first cause]
2. [First cause -> second cause]
3. [Second cause -> root cause]

### Remediation Target

**[Kind]/[name]** in namespace **[namespace]**

### Contributing Factors

- [Factor 1 — specific evidence]
- [Factor 2 — specific evidence]

### Recommended Actions

1. **[Primary fix]** — [description]
   ```bash
   [oc command to apply the fix]

[Secondary fix or preventive measure] — [description]
```
[oc command]
```

Verification

After applying the fix:

oc get <resource-type> <name> -n <namespace>
oc get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
oc get pods -n <namespace> -l <app-label>

Related Skills

For this follow-up...	Use skill
Fix SCC violations	`/debug-scc`
Restore RBAC bindings	`/debug-rbac`
Debug crashing pods	`/debug-pod`
Fix network/route issues	`/debug-network`
Redeploy after fix	`/deploy`

Reference

Kubernaut demo scenario golden transcripts — validated RCA examples with causal chains and due diligence assessments

Would you like me to:

Execute the primary recommended fix
Run a specialized debug skill for deeper analysis
Investigate a related resource
Export findings as a structured report
Exit triage

Select an option:


**WAIT for user confirmation before proceeding.**

## Dependencies

### Required MCP Servers
- `openshift` — Kubernetes/OpenShift resource access for Deployments, Pods, Events, Services, and cluster resources ([setup](docs/prerequisites.md))
- `observability` — Prometheus metric discovery, metadata, series, and PromQL query execution

### Required MCP Tools
- `resources_get` (from openshift) — Retrieve individual resource details
- `resources_list` (from openshift) — List resources by kind in a namespace
- `pods_list` (from openshift) — List pods matching label selectors
- `pods_log` (from openshift) — Retrieve container logs (current and previous)
- `events_list` (from openshift) — Fetch events filtered by involved object
- `prometheus_query` (from openshift, observability toolset) — Execute instant PromQL queries
- `prometheus_query_range` (from openshift, observability toolset) — Execute range PromQL queries over time windows
- `alertmanager_alerts` (from openshift, observability toolset) — Retrieve active Alertmanager alerts

### Related Skills
- `/debug-pod` — Single-pod failure diagnosis (CrashLoopBackOff, OOMKilled, ImagePullBackOff)
- `/debug-scc` — SCC admission violation diagnosis
- `/debug-rbac` — RBAC permission failure diagnosis
- `/debug-network` — Service/Route connectivity diagnosis
- `/debug-build` — Build failure diagnosis
- `/deploy` — Redeployment after fixes

### Reference Documentation
- **Internal:** [docs/debugging-patterns.md](docs/debugging-patterns.md) — Common error patterns and troubleshooting trees
- **Official:** [OpenShift Troubleshooting](https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-operator-issues.html)

## Example Usage

**User**: Alert `DatabaseConnectionPoolExhausted` fired in namespace `production`. Active connections are at 95% of max. What's going on?

**Skill response**: The skill gathers the alert context, traces the ownership chain from the PostgreSQL Deployment through its ReplicaSet and Pods, checks container logs for connection errors, queries Prometheus for `pg_stat_activity_count` trends and `max_connections` settings, applies investigation guardrails, and constructs a Five Whys chain identifying a connection-leaking sidecar as the root cause. It presents the findings with 0.92 confidence, recommending a targeted fix to the leaking container's connection pool configuration.