Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

k8s-debug-pending-pod

Name: K8s Debug Pending Pod
Author: Project-HAMi

// Use when pods are stuck in Pending, CrashLoopBackOff, or ImagePullBackOff state. Performs event-driven triage to quickly identify root cause, then deep-dives into scheduling failures, resource exhaustion, image pull errors, and crash loops.

In Manus ausführen

$ git log --oneline --stat

stars:3.503

forks:570

updated:8. Mai 2026 um 06:20

SKILL.md

readonly

related-skills.json

gleiches Repository

hami-vgpu-metrics-summarizer.md

from "Project-HAMi/HAMi"

A comprehensive analysis skill for summarizing HAMi vGPU metrics from Prometheus-style `/metrics` output. It organizes GPU allocation by node, device, pod, and namespace, and produces clear reports covering vGPU core allocation, memory allocation, allocation-based utilization, sharing density, and namespace-level usage patterns.

2026-05-083.5k

package.json

"author": "Project-HAMi"

"repository": "Project-HAMi/HAMi"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

Netzwerk- und ComputersystemadministratorenInformatik- und Mathematikberufe15-1244L4

Input

Required

Default

Namespace

Yes

—

Kubeconfig file path

~/.kube/config

Context name

current-context in the kubeconfig

Event Reason

Points to

FailedScheduling

Step 3 (Pending)

CrashLoopBackOff / BackOff

Step 4 (Crash)

ImagePullBackOff / ErrImagePull

Step 5 (Image)

OOMKilled

Step 4 (Crash)

FailedAttachVolume / FailedMount

PVC issue — note in report

Unhealthy

Probe misconfiguration — note in report

Evicted

Node pressure — note in report

NodeNotReady

Infrastructure issue — note in report

Event Message Pattern

Root Cause

Remediation

Insufficient cpu / Insufficient memory

Resource exhaustion

Reduce requests, scale cluster, or check quota (→ Step 6)

0/N nodes are available + taint

Taint/toleration mismatch

Add toleration or untaint node

0/N nodes are available + node selector

No matching nodes

Fix nodeSelector or add matching nodes

0/N nodes are available + affinity

Affinity rule unsatisfiable

Relax affinity rules

persistentvolumeclaim "X" not found

Missing PVC

Create PVC or fix claim name

pod has unbound immediate PersistentVolumeClaims

PVC not bound

Check StorageClass and PV availability

exceeded quota

Quota limit hit

→ Step 6

Termination Reason

Root Cause

Remediation

OOMKilled

Memory limit exceeded

Increase memory limits, profile usage

exitCode 1

Application failure

Check logs, configs, secrets, dependencies

exitCode 137

SIGKILL / OOM

Check node pressure and memory

exitCode 139

Segfault

Debug binary compatibility

exitCode 143

SIGTERM not handled

Review graceful shutdown

Error Pattern

Root Cause

Remediation

repository does not exist

Wrong image name/tag

Verify image name

unauthorized

Missing credentials

Fix imagePullSecrets

manifest unknown

Tag doesn't exist

Verify tag in registry

connection refused

Registry unreachable

Check network/DNS

x509: certificate

TLS issue

Fix CA trust chain

kubectl get pods -n <namespace> -o custom-columns='NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,CPU_LIM:.spec.containers[*].resources.limits.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory,MEM_LIM:.spec.containers[*].resources.limits.memory' kubectl top pods -n <namespace>

Kubernetes Pod Troubleshooting Report ===================================== Namespace: <namespace> Cluster: <context> (via <kubeconfig>) Scope: all unhealthy pods in namespace Primary Issue - <confirmed root cause with evidence> Additional Findings - <secondary issues discovered during scan> Pod Status Summary - Total: N | Running: N | Pending: N | CrashLoop: N | ImagePull: N Detailed Findings - [Per-pod breakdown with cause and evidence] Recommended Actions (priority order) 1. <highest impact fix> 2. <next> 3. <next> Skipped Checks - <any steps skipped due to permissions, unreachable API, or missing metrics-server>

Input

Required

Default

Namespace

Yes

—

Kubeconfig file path

~/.kube/config

Context name

current-context in the kubeconfig

Event Reason

Points to

FailedScheduling

Step 3 (Pending)

CrashLoopBackOff / BackOff

Step 4 (Crash)

ImagePullBackOff / ErrImagePull

Step 5 (Image)

OOMKilled

Step 4 (Crash)

FailedAttachVolume / FailedMount

PVC issue — note in report

Unhealthy

Probe misconfiguration — note in report

Evicted

Node pressure — note in report

NodeNotReady

Infrastructure issue — note in report

Event Message Pattern

Root Cause

Remediation

Insufficient cpu / Insufficient memory

Resource exhaustion

Reduce requests, scale cluster, or check quota (→ Step 6)

0/N nodes are available + taint

Taint/toleration mismatch

Add toleration or untaint node

0/N nodes are available + node selector

No matching nodes

Fix nodeSelector or add matching nodes

0/N nodes are available + affinity

Affinity rule unsatisfiable

Relax affinity rules

persistentvolumeclaim "X" not found

Missing PVC

Create PVC or fix claim name

pod has unbound immediate PersistentVolumeClaims

PVC not bound

Check StorageClass and PV availability

exceeded quota

Quota limit hit

→ Step 6

Termination Reason

Root Cause

Remediation

OOMKilled

Memory limit exceeded

Increase memory limits, profile usage

exitCode 1

Application failure

Check logs, configs, secrets, dependencies

exitCode 137

SIGKILL / OOM

Check node pressure and memory

exitCode 139

Segfault

Debug binary compatibility

exitCode 143

SIGTERM not handled

Review graceful shutdown

Error Pattern

Root Cause

Remediation

repository does not exist

Wrong image name/tag

Verify image name

unauthorized

Missing credentials

Fix imagePullSecrets

manifest unknown

Tag doesn't exist

Verify tag in registry

connection refused

Registry unreachable

Check network/DNS

x509: certificate

TLS issue

Fix CA trust chain

k8s-debug-pending-pod

Interaction

Command Convention

Diagnostic Workflow

Step 1: Global Pod Overview

Step 2: Event Triage

Step 3: Pending Pod Root Cause Analysis

Step 4: CrashLoopBackOff / Error Diagnosis

Step 5: ImagePullBackOff Diagnosis

Step 6: Resource Quota Check (Conditional)

Step 7: Report

Interaction

Command Convention

Diagnostic Workflow

Step 1: Global Pod Overview

Step 2: Event Triage

Step 3: Pending Pod Root Cause Analysis

Step 4: CrashLoopBackOff / Error Diagnosis

Step 5: ImagePullBackOff Diagnosis

Step 6: Resource Quota Check (Conditional)

Step 7: Report

name	k8s-debug-pending-pod
description	Use when pods are stuck in Pending, CrashLoopBackOff, or ImagePullBackOff state. Performs event-driven triage to quickly identify root cause, then deep-dives into scheduling failures, resource exhaustion, image pull errors, and crash loops.

k8s-debug-pending-pod

Mehr aus diesem Repository

Interaction

Command Convention

Diagnostic Workflow

Step 1: Global Pod Overview

Step 2: Event Triage

Step 3: Pending Pod Root Cause Analysis

Step 4: CrashLoopBackOff / Error Diagnosis

Step 5: ImagePullBackOff Diagnosis

Step 6: Resource Quota Check (Conditional)

Step 7: Report

Interaction

Command Convention

Diagnostic Workflow

Step 1: Global Pod Overview

Step 2: Event Triage

Step 3: Pending Pod Root Cause Analysis

Step 4: CrashLoopBackOff / Error Diagnosis

Step 5: ImagePullBackOff Diagnosis

Step 6: Resource Quota Check (Conditional)

Step 7: Report

Mehr aus diesem Repository