Run any Skill in Manus with one click

kubernetes-troubleshooting

Debug Kubernetes pods, services, networking, and scaling issues. Use this skill when troubleshooting K8s deployments, investigating pod failures, or diagnosing cluster problems.

Run Skill in Manus

Stars45

Forks14

UpdatedFebruary 28, 2026 at 04:11

Source

diegosouzapw

diegosouzapw/awesome-omni-skill

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

File Explorer

2 files

SKILL.md

readonly

Kubernetes Troubleshooting

You are a Kubernetes expert. Use these systematic debugging patterns when investigating K8s issues.

Diagnostic Decision Tree

Pod not running?
├── Pending → Resource constraints or scheduling issues
│   ├── kubectl describe pod <name> → check Events
│   ├── Insufficient CPU/memory → scale cluster or reduce requests
│   ├── Node selector/affinity not matching → check node labels
│   └── PVC not bound → check storage class and PV availability
├── CrashLoopBackOff → Application crashing on startup
│   ├── kubectl logs <pod> → check application logs
│   ├── kubectl logs <pod> --previous → check last crash logs
│   ├── OOMKilled → increase memory limits
│   ├── Exit code 1 → application error (bad config, missing env)
│   └── Exit code 137 → killed by OOM or liveness probe
├── ImagePullBackOff → Can't pull container image
│   ├── Image name typo → verify image:tag exists
│   ├── Private registry → check imagePullSecrets
│   └── Rate limited → Docker Hub pull limit, use mirror
├── Running but not Ready → Readiness probe failing
│   ├── Check readiness probe config
│   ├── Application not listening on expected port
│   └── Dependency not available (database, cache)
└── Evicted → Node pressure
    ├── Disk pressure → clean up images, expand disk
    └── Memory pressure → reduce workload or add nodes

Essential Debug Commands

Pod Investigation

# Overview
kubectl get pods -A                          # All pods, all namespaces
kubectl get pods -o wide                     # With node and IP info
kubectl get pods --sort-by='.status.startTime' # Sorted by age

# Deep inspect
kubectl describe pod <name>                  # Events, conditions, volumes
kubectl logs <name>                          # Current logs
kubectl logs <name> --previous               # Previous crash logs
kubectl logs <name> -c <container>           # Specific container in multi-container pod
kubectl logs <name> --tail=100 -f            # Follow last 100 lines

# Interactive debug
kubectl exec -it <name> -- /bin/sh           # Shell into pod
kubectl exec -it <name> -- env               # Check environment
kubectl exec -it <name> -- cat /etc/resolv.conf  # Check DNS config

# Resource usage
kubectl top pods                             # CPU/memory per pod
kubectl top nodes                            # CPU/memory per node

Service & Networking

# Check service endpoints
kubectl get endpoints <service>              # Are pods registered?
kubectl get svc <service> -o yaml            # Service config

# DNS resolution (from inside a pod)
kubectl exec -it <pod> -- nslookup <service>
kubectl exec -it <pod> -- wget -qO- http://<service>:<port>/health

# Test connectivity
kubectl run debug --image=nicolaka/netshoot -it --rm -- /bin/bash
# Then: curl, dig, nslookup, tcpdump, ping

# Ingress
kubectl get ingress -A
kubectl describe ingress <name>

Cluster Health

kubectl get nodes                            # Node status
kubectl describe node <name>                 # Node conditions, allocatable resources
kubectl get events --sort-by='.lastTimestamp' # Recent cluster events
kubectl cluster-info                         # API server status

Common Issues and Fixes

CrashLoopBackOff

# 1. Check logs
kubectl logs <pod> --previous

# 2. Common causes:
# - Missing environment variable → check deployment env/configmap/secret
# - Database not reachable → check network policy, service DNS
# - Port conflict → check containerPort in deployment
# - Permissions → check SecurityContext, ServiceAccount

# 3. Debug with overridden command
kubectl run debug --image=<same-image> --command -- sleep 3600
kubectl exec -it debug -- /bin/sh
# Manually run the entrypoint to see errors

OOMKilled (Exit Code 137)

# Check current limits
kubectl describe pod <name> | grep -A 5 "Limits"

# Fix: increase memory limit
# In deployment spec:
resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"  # Increase this

# Monitor actual usage first
kubectl top pod <name>

Service Not Reachable

# Checklist:
# 1. Pod is Running and Ready?
kubectl get pods -l app=<name>

# 2. Service has endpoints?
kubectl get endpoints <service>
# If empty → labels don't match between Service and Pod

# 3. Port correct?
kubectl get svc <service> -o jsonpath='{.spec.ports[*]}'
# targetPort must match containerPort

# 4. NetworkPolicy blocking?
kubectl get networkpolicy -A

Persistent Volume Issues

# PVC stuck in Pending
kubectl describe pvc <name>
# Common: no matching PV, storage class missing, capacity insufficient

# Check storage classes
kubectl get storageclass

# Check PVs
kubectl get pv

Resource Right-Sizing

Requests vs Limits

resources:
  requests:          # Guaranteed minimum — scheduler uses this
    cpu: "100m"      # 0.1 CPU core
    memory: "128Mi"
  limits:            # Maximum allowed — killed if exceeded (memory), throttled (CPU)
    cpu: "500m"
    memory: "256Mi"

Rules of thumb:

requests = average usage + 20% buffer
limits = peak usage + 30% buffer
Never set limits without requests
CPU limits cause throttling — some teams only set requests for CPU
Memory limits are hard — OOMKilled if exceeded

HPA (Horizontal Pod Autoscaler)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Quick Reference

Symptom	First Command	Likely Cause
Pod pending	`kubectl describe pod`	Resource constraints
Pod crashing	`kubectl logs --previous`	App error or OOM
Service unreachable	`kubectl get endpoints`	Label mismatch or no ready pods
Slow response	`kubectl top pods`	CPU throttling or memory pressure
DNS not resolving	`kubectl exec -- nslookup`	CoreDNS issue or network policy
Storage error	`kubectl describe pvc`	No matching PV or storage class

name	kubernetes-troubleshooting
description	Debug Kubernetes pods, services, networking, and scaling issues. Use this skill when troubleshooting K8s deployments, investigating pod failures, or diagnosing cluster problems.
alwaysApply	false

kubernetes-troubleshooting

More from this repository

More from this repository

Kubernetes Troubleshooting

Diagnostic Decision Tree

Essential Debug Commands

Pod Investigation

Service & Networking

Cluster Health

Common Issues and Fixes

CrashLoopBackOff

OOMKilled (Exit Code 137)

Service Not Reachable

Persistent Volume Issues

Resource Right-Sizing

Requests vs Limits

HPA (Horizontal Pod Autoscaler)

Quick Reference

Kubernetes Troubleshooting

Diagnostic Decision Tree

Essential Debug Commands

Pod Investigation

Service & Networking

Cluster Health

Common Issues and Fixes

CrashLoopBackOff

OOMKilled (Exit Code 137)

Service Not Reachable

Persistent Volume Issues

Resource Right-Sizing

Requests vs Limits

HPA (Horizontal Pod Autoscaler)

Quick Reference