| name | pod-troubleshooting |
| type | skill |
| description | Systematic diagnosis of Kubernetes pod failures — CrashLoopBackOff, OOMKilled, Pending, ImagePullBackOff, and service connectivity issues. Use when the user encounters pods not starting, container restart loops, scheduling failures, or service unreachability in a K8s cluster. |
| related-rules | ["resource-governance.md","workload-security.md"] |
| allowed-tools | Read, Bash |
Skill: Pod Troubleshooting
Expertise: Systematic K8s failure diagnosis — from symptom to root cause in under 10 commands.
When to load
When a pod is not Running, a service is unreachable, or a deployment is stuck.
Diagnostic Decision Tree
Pod not Running?
├── Status: Pending
│ ├── No nodes match → check node selectors, taints, resource requests
│ └── PVC not bound → check StorageClass, PV availability
├── Status: CrashLoopBackOff
│ ├── Exit code 0 → process exited cleanly but K8s restarts it → check command
│ ├── Exit code 1 → app error → check logs
│ ├── Exit code 137 → OOMKilled → increase memory limit
│ └── Exit code 143 → SIGTERM not handled → fix graceful shutdown
├── Status: ImagePullBackOff
│ ├── Image doesn't exist → check tag/digest
│ └── Registry auth fails → check imagePullSecret
└── Status: Error / Init:Error
└── Init container failed → check init container logs
Command Cheatsheet
kubectl get pods -n <ns> -o wide
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous
kubectl logs <pod> -n <ns> -c <container>
kubectl exec -it <pod> -n <ns> -- /bin/sh
kubectl top nodes
kubectl top pods -n <ns>
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20
kubectl debug -it <pod> -n <ns> --image=busybox:latest --target=<container>
kubectl debug node/<node-name> -it --image=ubuntu
CrashLoopBackOff Runbook
kubectl describe pod <pod> -n <ns> | grep -A5 "Last State:"
kubectl logs <pod> -n <ns> --previous --tail=100
kubectl describe pod <pod> -n <ns> | grep -i "OOMKilled\|Reason:"
Pending Pod Runbook
kubectl describe pod <pod> -n <ns> | grep -A20 "Events:"
kubectl describe nodes | grep -A5 "Allocated resources:"
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
kubectl get nodes --show-labels
Service Connectivity Runbook
kubectl get endpoints <svc> -n <ns>
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup <svc>.<ns>.svc.cluster.local
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl -v http://<svc>.<ns>.svc.cluster.local:<port>/health
kubectl exec -n kube-system <cilium-pod> -- cilium monitor --from-pod <src-pod>
OOMKilled Prevention
kubectl top pods -n <ns> --sort-by=memory
kubectl describe vpa <name> -n <ns> | grep -A10 "Recommendation:"