Run any Skill in Manus with one click

kubernetes-debug

Stars631

Forks76

UpdatedMarch 3, 2026 at 06:09

Kubernetes debugging methodology and scripts. Use for pod crashes, CrashLoopBackOff, OOMKilled, deployment issues, resource problems, or container failures.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

incidentfox

incidentfox/incidentfox

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Network and Computer Systems AdministratorsComputer and Mathematical Occupations·SOC 15-1244

File Explorer

11 files

SKILL.md

readonly

name	kubernetes-debug
description	Kubernetes debugging methodology and scripts. Use for pod crashes, CrashLoopBackOff, OOMKilled, deployment issues, resource problems, or container failures.

Kubernetes Debugging

Core Principle: Gateway First, Events Before Logs

ALWAYS start by discovering clusters via the gateway. Do NOT use kubectl directly — this sandbox has no direct k8s API access. All k8s queries go through the k8s-gateway.

Step 1: Discover clusters (MANDATORY first step)

python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py

Step 2: Use --cluster-id on all scripts

python .claude/skills/infrastructure-kubernetes/scripts/list_namespaces.py --cluster-id <CLUSTER_ID>
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n production --cluster-id <CLUSTER_ID>

NEVER run kubectl directly. NEVER run scripts without --cluster-id. If list_clusters.py returns no clusters, tell the user they need to install the k8s-agent on their cluster first.

Gateway-capable scripts: list_pods, get_events, get_logs, describe_pod, describe_deployment, list_namespaces. Direct-only scripts (not available in SaaS): describe_node, get_resources.

ALWAYS check pod events BEFORE logs. Events explain 80% of issues faster:

OOMKilled → Memory limit exceeded
ImagePullBackOff → Image not found or auth issue
FailedScheduling → No nodes with enough resources
CrashLoopBackOff → Container crashing repeatedly

Available Scripts

All scripts are in .claude/skills/infrastructure-kubernetes/scripts/

list_clusters.py - Discover available remote clusters

python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py
python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py --json

list_pods.py - List pods with status

python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>] [--cluster-id <id>]

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo --label app.kubernetes.io/name=payment
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n production --cluster-id abc123

get_events.py - Get pod events (USE FIRST!)

python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace> [--cluster-id <id>]

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n production --cluster-id abc123

get_logs.py - Get pod logs

python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME] [--cluster-id <id>]

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --tail 100
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --container payment

describe_pod.py - Detailed pod info

python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace> [--cluster-id <id>]

describe_deployment.py - Deployment status and rollout history

python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py <deployment-name> -n <namespace> [--cluster-id <id>]

# Example:
python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py payment -n otel-demo

list_namespaces.py - List all namespaces

python .claude/skills/infrastructure-kubernetes/scripts/list_namespaces.py [--cluster-id <id>]

get_resources.py - Resource usage vs limits (direct-only)

python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>

describe_node.py - Node status, conditions, and resource usage (direct-only)

python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py <node-name>
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py --all

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py ip-10-0-1-42.ec2.internal
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py --all --json

Debugging Workflows

Pod Not Starting (Pending/CrashLoopBackOff)

list_pods.py - Check pod status
get_events.py - Look for scheduling/pull/crash events
describe_pod.py - Check conditions and container states
get_logs.py - Only if events don't explain

Pod Restarting (OOMKilled/Crashes)

get_events.py - Check for OOMKilled or error events
get_resources.py - Compare usage vs limits
get_logs.py - Check for errors before crash
describe_pod.py - Check restart count and state

Deployment Not Progressing

describe_deployment.py - Check replica counts and rollout history
list_pods.py - Find stuck pods
get_events.py - Check events on stuck pods

Node Resource Issues (High CPU/Memory, FailedScheduling)

describe_node.py --all - Check all nodes for conditions and resource usage
describe_node.py <node> - Deep dive into specific node
list_pods.py - Check if pods are Pending/FailedScheduling
get_events.py - Look for FailedScheduling with resource reasons

Common Issues & Solutions

Event Reason	Meaning	Action
OOMKilled	Container exceeded memory limit	Increase limits or fix memory leak
ImagePullBackOff	Can't pull image	Check image name, registry auth
CrashLoopBackOff	Container keeps crashing	Check logs for startup errors
FailedScheduling	No node can run pod	Check node resources, taints
Unhealthy	Liveness probe failed	Check probe config, app health

Output Format

When reporting findings, use this structure:

## Kubernetes Analysis

**Pod**: <name>
**Namespace**: <namespace>
**Status**: <phase> (Restarts: N)

### Events
- [timestamp] <reason>: <message>

### Issues Found
1. [Issue description with evidence]

### Root Cause Hypothesis
[Based on events and logs]

### Recommended Action
[Specific remediation step]

Kubernetes Debugging

Core Principle: Gateway First, Events Before Logs

ALWAYS start by discovering clusters via the gateway. Do NOT use kubectl directly — this sandbox has no direct k8s API access. All k8s queries go through the k8s-gateway.

Step 1: Discover clusters (MANDATORY first step)

python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py

Step 2: Use --cluster-id on all scripts

python .claude/skills/infrastructure-kubernetes/scripts/list_namespaces.py --cluster-id <CLUSTER_ID>
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n production --cluster-id <CLUSTER_ID>

NEVER run kubectl directly. NEVER run scripts without --cluster-id. If list_clusters.py returns no clusters, tell the user they need to install the k8s-agent on their cluster first.

Gateway-capable scripts: list_pods, get_events, get_logs, describe_pod, describe_deployment, list_namespaces. Direct-only scripts (not available in SaaS): describe_node, get_resources.

ALWAYS check pod events BEFORE logs. Events explain 80% of issues faster:

OOMKilled → Memory limit exceeded
ImagePullBackOff → Image not found or auth issue
FailedScheduling → No nodes with enough resources
CrashLoopBackOff → Container crashing repeatedly

Available Scripts

All scripts are in .claude/skills/infrastructure-kubernetes/scripts/

list_clusters.py - Discover available remote clusters

python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py
python .claude/skills/infrastructure-kubernetes/scripts/list_clusters.py --json

list_pods.py - List pods with status

python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n <namespace> [--label <selector>] [--cluster-id <id>]

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n otel-demo --label app.kubernetes.io/name=payment
python .claude/skills/infrastructure-kubernetes/scripts/list_pods.py -n production --cluster-id abc123

get_events.py - Get pod events (USE FIRST!)

python .claude/skills/infrastructure-kubernetes/scripts/get_events.py <pod-name> -n <namespace> [--cluster-id <id>]

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n otel-demo
python .claude/skills/infrastructure-kubernetes/scripts/get_events.py payment-7f8b9c6d5-x2k4m -n production --cluster-id abc123

get_logs.py - Get pod logs

python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py <pod-name> -n <namespace> [--tail N] [--container NAME] [--cluster-id <id>]

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --tail 100
python .claude/skills/infrastructure-kubernetes/scripts/get_logs.py payment-7f8b9c6d5-x2k4m -n otel-demo --container payment

describe_pod.py - Detailed pod info

python .claude/skills/infrastructure-kubernetes/scripts/describe_pod.py <pod-name> -n <namespace> [--cluster-id <id>]

describe_deployment.py - Deployment status and rollout history

python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py <deployment-name> -n <namespace> [--cluster-id <id>]

# Example:
python .claude/skills/infrastructure-kubernetes/scripts/describe_deployment.py payment -n otel-demo

list_namespaces.py - List all namespaces

python .claude/skills/infrastructure-kubernetes/scripts/list_namespaces.py [--cluster-id <id>]

get_resources.py - Resource usage vs limits (direct-only)

python .claude/skills/infrastructure-kubernetes/scripts/get_resources.py <pod-name> -n <namespace>

describe_node.py - Node status, conditions, and resource usage (direct-only)

python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py <node-name>
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py --all

# Examples:
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py ip-10-0-1-42.ec2.internal
python .claude/skills/infrastructure-kubernetes/scripts/describe_node.py --all --json

Debugging Workflows

Pod Not Starting (Pending/CrashLoopBackOff)

list_pods.py - Check pod status
get_events.py - Look for scheduling/pull/crash events
describe_pod.py - Check conditions and container states
get_logs.py - Only if events don't explain

Pod Restarting (OOMKilled/Crashes)

get_events.py - Check for OOMKilled or error events
get_resources.py - Compare usage vs limits
get_logs.py - Check for errors before crash
describe_pod.py - Check restart count and state

Deployment Not Progressing

describe_deployment.py - Check replica counts and rollout history
list_pods.py - Find stuck pods
get_events.py - Check events on stuck pods

Node Resource Issues (High CPU/Memory, FailedScheduling)

describe_node.py --all - Check all nodes for conditions and resource usage
describe_node.py <node> - Deep dive into specific node
list_pods.py - Check if pods are Pending/FailedScheduling
get_events.py - Look for FailedScheduling with resource reasons

Common Issues & Solutions

Event Reason	Meaning	Action
OOMKilled	Container exceeded memory limit	Increase limits or fix memory leak
ImagePullBackOff	Can't pull image	Check image name, registry auth
CrashLoopBackOff	Container keeps crashing	Check logs for startup errors
FailedScheduling	No node can run pod	Check node resources, taints
Unhealthy	Liveness probe failed	Check probe config, app health

Output Format

When reporting findings, use this structure:

## Kubernetes Analysis

**Pod**: <name>
**Namespace**: <namespace>
**Status**: <phase> (Restarts: N)

### Events
- [timestamp] <reason>: <message>

### Issues Found
1. [Issue description with evidence]

### Root Cause Hypothesis
[Based on events and logs]

### Recommended Action
[Specific remediation step]

kubernetes-debug

Kubernetes Debugging

Core Principle: Gateway First, Events Before Logs

Step 1: Discover clusters (MANDATORY first step)

Step 2: Use --cluster-id on all scripts

Available Scripts

list_clusters.py - Discover available remote clusters

list_pods.py - List pods with status

get_events.py - Get pod events (USE FIRST!)

get_logs.py - Get pod logs

describe_pod.py - Detailed pod info

describe_deployment.py - Deployment status and rollout history

list_namespaces.py - List all namespaces

get_resources.py - Resource usage vs limits (direct-only)

describe_node.py - Node status, conditions, and resource usage (direct-only)

Debugging Workflows

Pod Not Starting (Pending/CrashLoopBackOff)

Pod Restarting (OOMKilled/Crashes)

Deployment Not Progressing

Node Resource Issues (High CPU/Memory, FailedScheduling)

Common Issues & Solutions

Output Format

More from this repository

Kubernetes Debugging

Core Principle: Gateway First, Events Before Logs

Step 1: Discover clusters (MANDATORY first step)

Step 2: Use --cluster-id on all scripts

Available Scripts

list_clusters.py - Discover available remote clusters

list_pods.py - List pods with status

get_events.py - Get pod events (USE FIRST!)

get_logs.py - Get pod logs

describe_pod.py - Detailed pod info

describe_deployment.py - Deployment status and rollout history

list_namespaces.py - List all namespaces

get_resources.py - Resource usage vs limits (direct-only)

describe_node.py - Node status, conditions, and resource usage (direct-only)

Debugging Workflows

Pod Not Starting (Pending/CrashLoopBackOff)

Pod Restarting (OOMKilled/Crashes)

Deployment Not Progressing

Node Resource Issues (High CPU/Memory, FailedScheduling)

Common Issues & Solutions

Output Format

More from this repository