with one click
kubernetes-debug
// Kubernetes debugging methodology and scripts. Use for pod crashes, CrashLoopBackOff, OOMKilled, deployment issues, resource problems, or container failures.
// Kubernetes debugging methodology and scripts. Use for pod crashes, CrashLoopBackOff, OOMKilled, deployment issues, resource problems, or container failures.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | kubernetes-debug |
| description | Kubernetes debugging methodology and scripts. Use for pod crashes, CrashLoopBackOff, OOMKilled, deployment issues, resource problems, or container failures. |
ALWAYS start by discovering clusters via the gateway. Do NOT use kubectl directly — this sandbox has no direct k8s API access. All k8s queries go through the k8s-gateway.
python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py
python .claude/skills/infrastructure-kubernetes/scripts/list_namespaces.py --cluster-id <CLUSTER_ID>
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n production --cluster-id <CLUSTER_ID>
NEVER run kubectl directly. NEVER run scripts without --cluster-id. If list_clusters.py returns no clusters, tell the user they need to install the k8s-agent on their cluster first.
Gateway-capable scripts: list_pods, get_events, get_logs, describe_pod, describe_deployment, list_namespaces. Direct-only scripts (not available in SaaS): describe_node, get_resources.
ALWAYS check pod events BEFORE logs. Events explain 80% of issues faster:
All scripts are in .claude/skills/infrastructure-kubernetes/scripts/
python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py
python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py --json
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>] [--cluster-id <id>]
# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo --label app.kubernetes.io/name=payment
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n production --cluster-id abc123
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace> [--cluster-id <id>]
# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n production --cluster-id abc123
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME] [--cluster-id <id>]
# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --tail 100
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --container payment
python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace> [--cluster-id <id>]
python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py <deployment-name> -n <namespace> [--cluster-id <id>]
# Example:
python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py payment -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/list_namespaces.py [--cluster-id <id>]
python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py <node-name>
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py --all
# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py ip-10-0-1-42.ec2.internal
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py --all --json
list_pods.py - Check pod statusget_events.py - Look for scheduling/pull/crash eventsdescribe_pod.py - Check conditions and container statesget_logs.py - Only if events don't explainget_events.py - Check for OOMKilled or error eventsget_resources.py - Compare usage vs limitsget_logs.py - Check for errors before crashdescribe_pod.py - Check restart count and statedescribe_deployment.py - Check replica counts and rollout historylist_pods.py - Find stuck podsget_events.py - Check events on stuck podsdescribe_node.py --all - Check all nodes for conditions and resource usagedescribe_node.py <node> - Deep dive into specific nodelist_pods.py - Check if pods are Pending/FailedSchedulingget_events.py - Look for FailedScheduling with resource reasons| Event Reason | Meaning | Action |
|---|---|---|
| OOMKilled | Container exceeded memory limit | Increase limits or fix memory leak |
| ImagePullBackOff | Can't pull image | Check image name, registry auth |
| CrashLoopBackOff | Container keeps crashing | Check logs for startup errors |
| FailedScheduling | No node can run pod | Check node resources, taints |
| Unhealthy | Liveness probe failed | Check probe config, app health |
When reporting findings, use this structure:
## Kubernetes Analysis
**Pod**: <name>
**Namespace**: <namespace>
**Status**: <phase> (Restarts: N)
### Events
- [timestamp] <reason>: <message>
### Issues Found
1. [Issue description with evidence]
### Root Cause Hypothesis
[Based on events and logs]
### Recommended Action
[Specific remediation step]