| name | kubernetes-admin |
| description | Kubernetes administration and troubleshooting — covers pod debugging (CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending), node issues, CNI/networking, CoreDNS, PVC/storage, HPA/VPA autoscaling, and EKS-specific patterns. Includes decision trees for common failure modes. |
Kubernetes Admin Skill
Quick Decision Trees
Pod Not Running
kubectl get pod POD -o wide -- check STATUS column
- Pending:
kubectl describe pod POD -> check Events section
- "Insufficient cpu/memory" -> node capacity issue, check
kubectl top nodes
- "no nodes available" -> check node taints/tolerations, affinity rules
- "Unschedulable" ->
kubectl get nodes check for cordoned nodes
- PVC pending ->
kubectl get pvc check storage class
- CrashLoopBackOff:
kubectl logs POD -c CONTAINER --previous -- check last crash logs
- Common: OOMKilled, config errors, missing dependencies, health check failures
kubectl describe pod POD -> check Exit Code (137=OOM, 1=app error, 127=binary not found)
- ImagePullBackOff:
- Check image name/tag:
kubectl get pod POD -o jsonpath='{.spec.containers[*].image}'
- Check pull secret:
kubectl get pod POD -o jsonpath='{.spec.imagePullSecrets}'
- ECR token expiry: tokens expire every 12 hours
- Running but not Ready:
- Check readiness probe:
kubectl get pod POD -o jsonpath='{.spec.containers[*].readinessProbe}'
- Exec into pod:
kubectl exec -it POD -- curl localhost:PORT/health
OOMKilled Remediation (Idempotent)
When a pod is OOMKilled (Exit Code 137), follow this sequence to safely adjust memory limits without duplicate patching:
-
Confirm OOMKilled:
kubectl get pod POD -n NAMESPACE -o jsonpath='{range .status.containerStatuses[*]}{.name}{": reason="}{.lastState.terminated.reason}{", exitCode="}{.lastState.terminated.exitCode}{"\n"}{end}'
-
Read current resource requests and limits BEFORE making changes:
kubectl get pod POD -n NAMESPACE -o jsonpath='{range .spec.containers[*]}{.name}{": requests.memory="}{.resources.requests.memory}{", limits.memory="}{.resources.limits.memory}{"\n"}{end}'
kubectl get deployment DEPLOY -n NAMESPACE -o jsonpath='{range .spec.template.spec.containers[*]}{.name}{": requests.memory="}{.resources.requests.memory}{", limits.memory="}{.resources.limits.memory}{"\n"}{end}'
-
Idempotency check — determine if patching is needed:
-
Apply the memory adjustment (only if idempotency check passes):
kubectl patch deployment DEPLOY -n NAMESPACE --type='json' -p='[
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "NEW_LIMIT"},
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "NEW_REQUEST"}
]'
-
Verify the patch was applied correctly:
kubectl get deployment DEPLOY -n NAMESPACE -o jsonpath='{range .spec.template.spec.containers[*]}{.name}{": requests.memory="}{.resources.requests.memory}{", limits.memory="}{.resources.limits.memory}{"\n"}{end}'
-
Monitor rollout:
kubectl rollout status deployment DEPLOY -n NAMESPACE --timeout=120s
Sizing guidance for OOMKilled:
- Start by increasing the memory limit by 25–50% above peak observed usage
- Check actual memory usage history:
kubectl top pod POD -n NAMESPACE --containers
- If metrics-server is available, compare usage to limits to right-size rather than blindly doubling
Node Issues
kubectl get nodes -- check STATUS (Ready/NotReady)
kubectl describe node NODE -- check Conditions section
- MemoryPressure, DiskPressure, PIDPressure -> resource exhaustion
- NetworkUnavailable -> CNI issue
kubectl top node NODE -- current CPU/memory usage
- Node capacity:
kubectl get node NODE -o jsonpath='{.status.allocatable}' | jq .
- Pods on node:
kubectl get pods --all-namespaces --field-selector spec.nodeName=NODE
Service/Networking Issues
kubectl get svc SERVICE -- check TYPE, CLUSTER-IP, EXTERNAL-IP, PORTS
kubectl get endpoints SERVICE -- verify backends exist
- No endpoints -> check selector matches pod labels:
kubectl get pods -l key=value
- DNS:
kubectl exec -it debug-pod -- nslookup SERVICE.NAMESPACE.svc.cluster.local
- CoreDNS:
kubectl get pods -n kube-system -l k8s-app=kube-dns
- Network Policy:
kubectl get networkpolicy -n NAMESPACE
Storage Issues
kubectl get pvc -n NAMESPACE -- check STATUS (Bound/Pending)
- PVC Pending -> check Events with targeted output:
kubectl get events -n NAMESPACE --field-selector involvedObject.name=PVC --sort-by='.lastTimestamp'
- StorageClass not found ->
kubectl get sc
- Provisioner failed -> check CSI driver pods:
kubectl get pods -n kube-system | grep csi
- Access mode conflicts: check ReadWriteOnce vs ReadWriteMany
- EBS: check AZ affinity (EBS volumes are AZ-bound)
- Mount failures: check pod events:
kubectl get events -n NAMESPACE --field-selector involvedObject.name=POD --sort-by='.lastTimestamp' | grep -i mount
HPA/VPA Issues
kubectl get hpa -- check TARGETS column (current/target)
- "unknown" targets -> metrics-server issue:
kubectl get pods -n kube-system | grep metrics-server
- Not scaling: check min/max replicas, check metrics:
kubectl get hpa HPA -o jsonpath='{.status.conditions[*].message}'
- Scaling too slow: check
--horizontal-pod-autoscaler-sync-period and stabilization window
Output Size Management
CRITICAL: Always prefer targeted queries over kubectl describe to avoid tool-result-too-large errors. Use --field-selector, -o jsonpath, and label selectors to narrow output.
Preferred Patterns (Small Output)
kubectl get pod POD -n NAMESPACE -o jsonpath='{.status.phase}{"\n"}{range .status.conditions[*]}{.type}{"="}{.status}{" "}{.message}{"\n"}{end}'
kubectl get pod POD -n NAMESPACE -o jsonpath='{range .status.containerStatuses[*]}{.name}{": ready="}{.ready}{", restarts="}{.restartCount}{", state="}{.state}{"\n"}{end}'
kubectl get events -n NAMESPACE --field-selector involvedObject.name=POD --sort-by='.lastTimestamp'
kubectl get node NODE -o jsonpath='{range .status.conditions[*]}{.type}{"="}{.status}{" "}{.message}{"\n"}{end}'
kubectl get node NODE -o jsonpath='{"cpu: "}{.status.allocatable.cpu}{"\nmemory: "}{.status.allocatable.memory}{"\npods: "}{.status.allocatable.pods}{"\n"}'
kubectl get pods --all-namespaces --field-selector 'status.phase!=Running,status.phase!=Succeeded'
kubectl get pods --all-namespaces --field-selector spec.nodeName=NODE
kubectl get pods -n NAMESPACE -l app=APPNAME -o wide
Handling Large Outputs
kubectl get events -n NAMESPACE --sort-by='.lastTimestamp' | tail -20
kubectl get pods -n NAMESPACE -o custom-columns='NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount,NODE:.spec.nodeName'
kubectl logs POD -n NAMESPACE --tail=100
kubectl logs POD -n NAMESPACE --since=10m --tail=50
kubectl get pods -n NAMESPACE -o custom-columns='POD:.metadata.name,CONTAINER:.spec.containers[0].name,CPU_REQ:.spec.containers[0].resources.requests.cpu,CPU_LIM:.spec.containers[0].resources.limits.cpu,MEM_REQ:.spec.containers[0].resources.requests.memory,MEM_LIM:.spec.containers[0].resources.limits.memory'
When describe Is Necessary
If you must use kubectl describe, narrow the scope first:
- Identify the specific field you need
- Try jsonpath first
- If describe is the only option, pipe through grep:
kubectl describe pod POD -n NAMESPACE | grep -A 20 "^Events:"
kubectl describe node NODE | grep -A 20 "^Conditions:"
Common Patterns
Debug Containers
kubectl debug -it POD --image=busybox --target=CONTAINER
kubectl run debug --image=nicolaka/netshoot --rm -it -- /bin/bash
kubectl debug POD --copy-to=debug-pod --image=ubuntu --share-processes
Resource Investigation
kubectl get pods --all-namespaces --sort-by='.status.containerStatuses[0].restartCount' | tail -20
kubectl get events --sort-by='.lastTimestamp' -n NAMESPACE | tail -30
kubectl top pods -n NAMESPACE --sort-by=cpu
kubectl top pods -n NAMESPACE --sort-by=memory
kubectl get pods --all-namespaces --field-selector 'status.phase!=Running,status.phase!=Succeeded'
kubectl get pods -n NAMESPACE -o custom-columns='POD:.metadata.name,CPU_REQ:.spec.containers[0].resources.requests.cpu,CPU_LIM:.spec.containers[0].resources.limits.cpu,MEM_REQ:.spec.containers[0].resources.requests.memory,MEM_LIM:.spec.containers[0].resources.limits.memory'
kubectl get pods -n NAMESPACE -o jsonpath='{range .items[*]}{range .status.containerStatuses[*]}{.name}{" lastState.reason="}{.lastState.terminated.reason}{" restarts="}{.restartCount}{"\n"}{end}{end}' | grep -i oom
Idempotent Remediation Checklist
Before applying any resource change (memory, CPU, replica count), always:
- Read current state — query the exact field you intend to change using jsonpath
- Compare to desired state — if already at target, skip and report "no change needed"
- Apply the change — use
kubectl patch with precise JSON path
- Verify post-change — re-read the field to confirm it matches the target
- Monitor rollout — ensure new pods start successfully after the change
This prevents:
- Duplicate patches that trigger unnecessary rollouts
- Drift from unexpected double-application (e.g., doubling memory twice)
- Unnecessary pod restarts in production environments