Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

sre

Sterne24

Forks3

Aktualisiert23. März 2026 um 02:56

SRE debugging methodology for Kubernetes incident investigation, root cause analysis, and failure diagnosis. Use when: (1) Pods not starting, stuck, or failing (CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending), (2) Debugging Kubernetes errors or investigating "why is my pod...", (3) Service degradation or unavailability, (4) Root cause analysis for any Kubernetes incident, (5) Network policy blocking traffic, (6) Stalled HelmReleases or Flux failures that need troubleshooting. Triggers: "pod not starting", "pod stuck", "CrashLoopBackOff", "ImagePullBackOff", "OOMKilled", "Pending pod", "why is my pod", "kubernetes error", "k8s error", "service not available", "can't reach service", "debug kubernetes", "troubleshoot k8s", "what's wrong with my pod", "deployment not working", "helm install failed", "flux not reconciling", "root cause", "5 whys", "incident", "network policy blocking", "hubble dropped", "stalled helmrelease", "live not updating", "promotion pipeline stuck", "artifact not promoted"

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

ionfury

ionfury/homelab

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

Netzwerk- und ComputersystemadministratorenInformatik- und Mathematikberufe·SOC 15-1244

Datei-Explorer

4 Dateien

SKILL.md

readonly

name	sre
description	SRE debugging methodology for Kubernetes incident investigation, root cause analysis, and failure diagnosis. Use when: (1) Pods not starting, stuck, or failing (CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending), (2) Debugging Kubernetes errors or investigating "why is my pod...", (3) Service degradation or unavailability, (4) Root cause analysis for any Kubernetes incident, (5) Network policy blocking traffic, (6) Stalled HelmReleases or Flux failures that need troubleshooting. Triggers: "pod not starting", "pod stuck", "CrashLoopBackOff", "ImagePullBackOff", "OOMKilled", "Pending pod", "why is my pod", "kubernetes error", "k8s error", "service not available", "can't reach service", "debug kubernetes", "troubleshoot k8s", "what's wrong with my pod", "deployment not working", "helm install failed", "flux not reconciling", "root cause", "5 whys", "incident", "network policy blocking", "hubble dropped", "stalled helmrelease", "live not updating", "promotion pipeline stuck", "artifact not promoted"
user-invocable	false

Cluster access (--context patterns) and internal service URLs are in the k8s skill.

Debugging Kubernetes Incidents

Core Principles

5 Whys Analysis — NEVER stop at symptoms. Ask "why" until you reach the root cause.
Multi-Source Correlation — Combine logs, events, metrics for a complete picture.
Zero Alert Tolerance — Every firing alert must be addressed: fix the root cause, or as a last resort, create a declarative Silence CR with justification. Never ignore or defer.

The 5 Whys Analysis (CRITICAL)

Apply 5 Whys before concluding any investigation. Stopping at symptoms leads to ineffective fixes.

Example:

Symptom: Helm install failed with "context deadline exceeded"

Why #1: Pods never became Ready
Why #2: Pods stuck in Pending state
Why #3: PVCs couldn't bind (StorageClass "fast" not found)
Why #4: longhorn-storage Kustomization failed to apply
Why #5: numberOfReplicas was integer instead of string

ROOT CAUSE: YAML type coercion issue
FIX: Use properly typed variable for StorageClass parameters

See investigation-guide.md for red flags that you haven't reached root cause.

Investigation Phases

Phase 1 — Triage: Confirm cluster (ask user: dev/integration/live) → assess severity (P1 down / P2 degraded / P3 minor) → identify scope.

Phase 2 — Data Collection: Use scripts/cluster-health.sh [namespace] for a quick snapshot. For targeted collection:

# Pod status and events
kubectl get pods -n <namespace>
kubectl describe pod <pod> -n <namespace>

# Logs (current and previous)
kubectl logs <pod> -n <namespace> --tail=100
kubectl logs <pod> -n <namespace> --previous

# Events timeline
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Resource usage
kubectl top pods -n <namespace>

Metrics and alerts (Prometheus is behind OAuth2 Proxy — DNS URLs won't work for API queries):

# Check firing alerts
kubectl --context <cluster> exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/alerts' | jq '.data.alerts[] | select(.state == "firing")'

Phase 3 — Correlation: Extract timestamps from logs, events, metrics → identify what happened FIRST → trace cascade.

Phase 4 — Root Cause: Apply 5 Whys. Validate: temporal (before symptom?), causal (logically explains it?), evidence (supporting data?), complete (asked "why" enough times?).

Phase 5 — Remediation: Use AskUserQuestion when multiple valid approaches exist. Provide recommendations only (read-only on integration/live):

Immediate: rollback, scale, restart
Permanent: code/config fixes
Prevention: alerts, quotas, tests

For symptom → first check → common cause mapping, see investigation-guide.md.

Network Policy Debugging (Cilium + Hubble)

All traffic is implicitly denied. Missing labels are the most common cause of blocked traffic.

# Setup Hubble access (run once per session)
kubectl --context <cluster> port-forward -n kube-system svc/hubble-relay 4245:80 &

# See dropped traffic in a namespace
hubble observe --verdict DROPPED --namespace <namespace> --since 5m

# Check specific traffic flow
hubble observe --from-namespace <source> --to-namespace <dest> --since 5m

Check namespace labels:

kubectl --context <cluster> get ns <namespace> -o jsonpath='{.metadata.labels}' | jq
# Required: network-policy.homelab/profile: standard|internal|internal-egress|isolated
# Optional: access.network-policy.homelab/postgres|garage-s3|kube-api: "true"

Emergency escape hatch (triggers alert after 5m):

kubectl --context <cluster> label namespace <ns> network-policy.homelab/enforcement=disabled
# Re-enable after fixing:
kubectl --context <cluster> label namespace <ns> network-policy.homelab/enforcement-

See docs/runbooks/network-policy-escape-hatch.md for full procedure.

Kickstarting Stalled HelmReleases

HelmReleases can get stuck in Stalled/RetriesExceeded even after the underlying issue is resolved. Suspend and resume to reset the failure counter:

flux --context <cluster> suspend helmrelease <name> -n flux-system
flux --context <cluster> resume helmrelease <name> -n flux-system

Common self-healed causes: missing Secret/ConfigMap (ExternalSecret eventually created it), missing CRD, transient image pull failure, temporary resource quota exceeded. Ensure proper dependsOn ordering to prevent recurrence.

Promotion Pipeline Debugging

Symptom: "Live cluster not updating after merge"

Walk through each stage in order — see investigation-guide.md for the failure mode table. Quick diagnostic flow:

1. PR merged → did build-platform-artifact.yaml trigger?
   └─ If not: was kubernetes/ modified? (paths filter)

2. OCI artifact in GHCR?
   └─ flux list artifact oci://ghcr.io/<repo>/platform | grep integration

3. Integration OCIRepository seeing new version?
   └─ kubectl --context integration get ocirepository -n flux-system
   └─ Semver constraint must be ">= 0.0.0-0" to accept RCs

4. Integration Kustomization healthy?
   └─ flux --context integration get kustomizations -n flux-system

5. Flux Alert fired repository_dispatch?
   └─ kubectl --context integration describe alert validation-success -n flux-system

6. tag-validated-artifact.yaml ran?
   └─ GitHub Actions → "Tag Validated Artifact" workflow

7. Live OCIRepository seeing stable semver?
   └─ kubectl --context live get ocirepository -n flux-system
   └─ Semver constraint must be ">= 0.0.0" (stable only, no RCs)

See .github/CLAUDE.md for full pipeline architecture and rollback procedures.

Keywords

kubernetes, debugging, crashloopbackoff, oomkilled, pending, root cause analysis, 5 whys, incident investigation, pod logs, events, troubleshooting, network policy, hubble, stalled helmrelease, promotion pipeline, live not updating

Mehr aus diesem Repository

gleiches Repository

dashboard-design

ionfury/homelab

Visual design and layout for Grafana dashboards — panel hierarchy, type selection, color/threshold design, and iterative screenshot-based refinement. Use when: (1) Deciding what panels belong on a new dashboard, (2) Choosing panel types for specific data patterns, (3) Structuring visual hierarchy and layout, (4) Applying color and thresholds to communicate status, (5) Reviewing dashboard appearance via Playwright screenshots, (6) Iterating on readability and density Triggers: "dashboard design", "visual design", "layout design", "panel type", "color scheme", "screenshot review", "iterate dashboard", "dashboard looks", "visual feedback", "refine dashboard", "dashboard hierarchy", "information density"

2026-05-1924

architecture-review

ionfury/homelab

Architecture evaluation criteria and technology standards for the homelab. Preloaded into the designer agent to ground design decisions in established patterns and principles. Use when: (1) Evaluating a proposed technology addition, (2) Reviewing architecture decisions, (3) Assessing stack fit for a new component, (4) Comparing implementation approaches. Triggers: "architecture review", "evaluate technology", "stack fit", "should we use", "technology comparison", "design review", "architecture decision"

2026-04-0824

deploy-app

ionfury/homelab

End-to-end application deployment orchestration for the Kubernetes homelab. Covers research, worktree setup, Flux ResourceSet configuration, dev cluster testing, monitoring integration, and PR creation. Use when: (1) Deploying a new application to the cluster, (2) Adding a new Helm release to the platform, (3) Setting up monitoring, alerting, and health checks for a new service, (4) Testing deployment on dev cluster before GitOps promotion. Triggers: "deploy app", "add new application", "deploy to kubernetes", "install helm chart", "/deploy-app", "set up new service", "add monitoring for", "deploy with monitoring"

2026-03-3024

secrets

ionfury/homelab

Secret management patterns for the Kubernetes homelab platform. Covers secret-generator, ExternalSecret, app-secrets Terragrunt module, and cross-namespace replication via kubernetes-replicator. Use when: (1) Adding secrets for a new application, (2) Deciding between secret-generator and ExternalSecret, (3) Configuring cross-namespace secret replication, (4) Creating persistent secrets via the app-secrets Terragrunt module, (5) Debugging secret sync failures. Triggers: "secret", "ExternalSecret", "secret-generator", "aws ssm", "parameter store", "kubernetes-replicator", "replicate secret", "app-secrets", "persistent secret", "cross-namespace secret", "secret not syncing", "ClusterSecretStore"

2026-03-3024

cnpg-database

ionfury/homelab

CloudNative-PG (CNPG) PostgreSQL database management for the Kubernetes homelab. Covers shared platform cluster, dedicated per-app clusters, credential provisioning, cross-namespace replication via kubernetes-replicator, and monitoring. Use when: (1) Adding a new database for an application, (2) Creating a dedicated CNPG cluster, (3) Setting up database credentials and cross-namespace replication, (4) Debugging database connectivity or CNPG cluster health, (5) Adding PostgreSQL extensions for specialized workloads. Triggers: "database", "postgresql", "postgres", "cnpg", "cloudnative-pg", "pooler", "pgbouncer", "database credentials", "db password", "managed roles", "Database CRD", "database cluster", "shared database", "dedicated database", "cnpg cluster"

2026-03-2324

gateway-routing

ionfury/homelab

Gateway API routing, TLS certificates, and WAF configuration for the homelab Kubernetes platform. Use when: (1) Exposing a service via HTTPRoute, (2) Choosing between internal and external gateways, (3) Debugging TLS or routing issues, (4) Understanding or tuning WAF (Coraza) behavior. Triggers: "httproute", "gateway", "expose service", "add route", "certificate", "tls", "coraza", "waf", "internal gateway", "external gateway", "dns", "ingress", "routing", "cert-manager", "letsencrypt", "homelab-ca"

2026-03-2324

name	sre
description	SRE debugging methodology for Kubernetes incident investigation, root cause analysis, and failure diagnosis. Use when: (1) Pods not starting, stuck, or failing (CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending), (2) Debugging Kubernetes errors or investigating "why is my pod...", (3) Service degradation or unavailability, (4) Root cause analysis for any Kubernetes incident, (5) Network policy blocking traffic, (6) Stalled HelmReleases or Flux failures that need troubleshooting. Triggers: "pod not starting", "pod stuck", "CrashLoopBackOff", "ImagePullBackOff", "OOMKilled", "Pending pod", "why is my pod", "kubernetes error", "k8s error", "service not available", "can't reach service", "debug kubernetes", "troubleshoot k8s", "what's wrong with my pod", "deployment not working", "helm install failed", "flux not reconciling", "root cause", "5 whys", "incident", "network policy blocking", "hubble dropped", "stalled helmrelease", "live not updating", "promotion pipeline stuck", "artifact not promoted"
user-invocable	false

Cluster access (--context patterns) and internal service URLs are in the k8s skill.

Debugging Kubernetes Incidents

Core Principles

5 Whys Analysis — NEVER stop at symptoms. Ask "why" until you reach the root cause.
Multi-Source Correlation — Combine logs, events, metrics for a complete picture.
Zero Alert Tolerance — Every firing alert must be addressed: fix the root cause, or as a last resort, create a declarative Silence CR with justification. Never ignore or defer.

The 5 Whys Analysis (CRITICAL)

Apply 5 Whys before concluding any investigation. Stopping at symptoms leads to ineffective fixes.

Example:

Symptom: Helm install failed with "context deadline exceeded"

Why #1: Pods never became Ready
Why #2: Pods stuck in Pending state
Why #3: PVCs couldn't bind (StorageClass "fast" not found)
Why #4: longhorn-storage Kustomization failed to apply
Why #5: numberOfReplicas was integer instead of string

ROOT CAUSE: YAML type coercion issue
FIX: Use properly typed variable for StorageClass parameters

See investigation-guide.md for red flags that you haven't reached root cause.

Investigation Phases

Phase 1 — Triage: Confirm cluster (ask user: dev/integration/live) → assess severity (P1 down / P2 degraded / P3 minor) → identify scope.

Phase 2 — Data Collection: Use scripts/cluster-health.sh [namespace] for a quick snapshot. For targeted collection:

# Pod status and events
kubectl get pods -n <namespace>
kubectl describe pod <pod> -n <namespace>

# Logs (current and previous)
kubectl logs <pod> -n <namespace> --tail=100
kubectl logs <pod> -n <namespace> --previous

# Events timeline
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Resource usage
kubectl top pods -n <namespace>

Metrics and alerts (Prometheus is behind OAuth2 Proxy — DNS URLs won't work for API queries):

# Check firing alerts
kubectl --context <cluster> exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/alerts' | jq '.data.alerts[] | select(.state == "firing")'

Phase 3 — Correlation: Extract timestamps from logs, events, metrics → identify what happened FIRST → trace cascade.

Phase 4 — Root Cause: Apply 5 Whys. Validate: temporal (before symptom?), causal (logically explains it?), evidence (supporting data?), complete (asked "why" enough times?).

Phase 5 — Remediation: Use AskUserQuestion when multiple valid approaches exist. Provide recommendations only (read-only on integration/live):

Immediate: rollback, scale, restart
Permanent: code/config fixes
Prevention: alerts, quotas, tests

For symptom → first check → common cause mapping, see investigation-guide.md.

Network Policy Debugging (Cilium + Hubble)

All traffic is implicitly denied. Missing labels are the most common cause of blocked traffic.

# Setup Hubble access (run once per session)
kubectl --context <cluster> port-forward -n kube-system svc/hubble-relay 4245:80 &

# See dropped traffic in a namespace
hubble observe --verdict DROPPED --namespace <namespace> --since 5m

# Check specific traffic flow
hubble observe --from-namespace <source> --to-namespace <dest> --since 5m

Check namespace labels:

kubectl --context <cluster> get ns <namespace> -o jsonpath='{.metadata.labels}' | jq
# Required: network-policy.homelab/profile: standard|internal|internal-egress|isolated
# Optional: access.network-policy.homelab/postgres|garage-s3|kube-api: "true"

Emergency escape hatch (triggers alert after 5m):

kubectl --context <cluster> label namespace <ns> network-policy.homelab/enforcement=disabled
# Re-enable after fixing:
kubectl --context <cluster> label namespace <ns> network-policy.homelab/enforcement-

See docs/runbooks/network-policy-escape-hatch.md for full procedure.

Kickstarting Stalled HelmReleases

HelmReleases can get stuck in Stalled/RetriesExceeded even after the underlying issue is resolved. Suspend and resume to reset the failure counter:

flux --context <cluster> suspend helmrelease <name> -n flux-system
flux --context <cluster> resume helmrelease <name> -n flux-system

Promotion Pipeline Debugging

Symptom: "Live cluster not updating after merge"

Walk through each stage in order — see investigation-guide.md for the failure mode table. Quick diagnostic flow:

1. PR merged → did build-platform-artifact.yaml trigger?
   └─ If not: was kubernetes/ modified? (paths filter)

2. OCI artifact in GHCR?
   └─ flux list artifact oci://ghcr.io/<repo>/platform | grep integration

3. Integration OCIRepository seeing new version?
   └─ kubectl --context integration get ocirepository -n flux-system
   └─ Semver constraint must be ">= 0.0.0-0" to accept RCs

4. Integration Kustomization healthy?
   └─ flux --context integration get kustomizations -n flux-system

5. Flux Alert fired repository_dispatch?
   └─ kubectl --context integration describe alert validation-success -n flux-system

6. tag-validated-artifact.yaml ran?
   └─ GitHub Actions → "Tag Validated Artifact" workflow

7. Live OCIRepository seeing stable semver?
   └─ kubectl --context live get ocirepository -n flux-system
   └─ Semver constraint must be ">= 0.0.0" (stable only, no RCs)

See .github/CLAUDE.md for full pipeline architecture and rollback procedures.