Run any Skill in Manus with one click

$pwd:

diagnose

Name: Diagnose
Author: apecloud

// Use when collecting diagnostics from failing or stuck KubeBlocks Cluster instances, including Cluster, Component, Pod, logs, events, OpsRequest, and KubeBlocks controller evidence.

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:May 6, 2026 at 04:43

SKILL.md

readonly

package.json

"author": "apecloud"

"repository": "apecloud/kubeblocks-addon-docs"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	diagnose
description	Use when collecting diagnostics from failing or stuck KubeBlocks Cluster instances, including Cluster, Component, Pod, logs, events, OpsRequest, and KubeBlocks controller evidence.

Reference resolution: when this source-derived skill mentions docs/..., resolve it from the shared support package beside the installed user skills: ~/.codex/skills/kubeblocks-addon-source-docs/docs/... for Codex or ~/.claude/skills/kubeblocks-addon-source-docs/docs/... for Claude Code. In the shared kubeblocks-addon-docs checkout, the same files live under skills/kubeblocks-addon-source-docs/docs/.... When it mentions scripts/..., resolve it from the same support package under scripts/.... If you are working inside a checkout of the original apecloud/kubeblocks-addon-skills, repo-relative paths are also valid.

Collect comprehensive diagnostics from failing or stuck KubeBlocks cluster instances.

Target: $ARGUMENTS (KB cluster name, e.g., kb-test-redis. Omitted → auto-discovers all non-Running clusters.)

Step 0: Load Environment

SCRIPT_DIR="$(git rev-parse --show-toplevel 2>/dev/null || pwd)"
[ -f "$SCRIPT_DIR/.env" ] && source "$SCRIPT_DIR/.env"
[ -n "$KUBECONFIG" ] && export KUBECONFIG

kubectl cluster-info --request-timeout=5s \
  || { echo "ERROR: kubectl cannot reach the cluster. Check KUBECONFIG in .env"; exit 1; }
echo "KUBECONFIG=${KUBECONFIG:-~/.kube/config}"

Step 1: Find Target Clusters

If a cluster name was provided:

kubectl get cluster <cluster-name> -o jsonpath='{.status.phase}'
kubectl get cluster <cluster-name> -o jsonpath='{.status.components}' | python3 -m json.tool 2>/dev/null

If no name provided — auto-discover non-healthy clusters:

kubectl get cluster -o json | python3 -c "
import sys, json
data = json.load(sys.stdin)
healthy = {'Running', 'Stopped', 'Stopping', 'Deleting'}
for item in data.get('items', []):
    phase = item.get('status', {}).get('phase', 'Unknown')
    name = item['metadata']['name']
    if phase not in healthy:
        print(f'{name}  phase={phase}')
"

Note: KB v1 cluster phases are: Creating, Running, Updating, Stopping, Stopped, Deleting. There is no Failed or Error phase at the cluster level — pods and components carry the failure signals.

If no non-healthy clusters found: report "All clusters are healthy" and stop.

Step 2: Full Diagnostics for Each Target Cluster

For each cluster, collect all of the following:

2a. Cluster and component overview

CLUSTER=<cluster-name>

echo "===== CLUSTER OVERVIEW ====="
kubectl get cluster "$CLUSTER" -o yaml

echo "===== COMPONENT STATUS ====="
kubectl get component -l "app.kubernetes.io/instance=$CLUSTER" -o wide

echo "===== POD STATUS ====="
kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER" -o wide

2b. Per-pod logs and events

for POD in $(kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER" \
             -o jsonpath='{.items[*].metadata.name}'); do
  echo ""
  echo "===== POD: $POD ====="
  kubectl describe pod "$POD"

  # Init containers (often reveal startup sequence failures)
  for C in $(kubectl get pod "$POD" \
             -o jsonpath='{.spec.initContainers[*].name}' 2>/dev/null); do
    echo "--- Init container: $POD/$C ---"
    kubectl logs "$POD" -c "$C" --tail=100 --previous 2>&1 || true
    kubectl logs "$POD" -c "$C" --tail=100 2>&1
  done

  # Main containers
  for C in $(kubectl get pod "$POD" \
             -o jsonpath='{.spec.containers[*].name}'); do
    echo "--- Container: $POD/$C ---"
    kubectl logs "$POD" -c "$C" --tail=100 --previous 2>&1 || true
    kubectl logs "$POD" -c "$C" --tail=100 2>&1
  done

  # Pod-level events (sorted by time)
  echo "--- Events for $POD ---"
  kubectl get events \
    --field-selector "involvedObject.name=$POD" \
    --sort-by='.lastTimestamp' 2>/dev/null
done

2c. Cluster-level events

kubectl get events \
  --field-selector "involvedObject.name=$CLUSTER" \
  --sort-by='.lastTimestamp' 2>/dev/null

2d. KubeBlocks operator logs

KB_POD=$(kubectl get pods -n kb-system \
  -l app.kubernetes.io/name=kubeblocks \
  -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)

if [[ -n "$KB_POD" ]]; then
  echo "===== KubeBlocks Operator: $KB_POD ====="
  kubectl logs -n kb-system "$KB_POD" --tail=80
else
  echo "KubeBlocks operator pod not found in kb-system"
  kubectl get pods -n kb-system
fi

Step 3: Analyze and Classify

After collecting all output, identify the root cause and classify it:

Application / Configuration Error → fixable in YAML

Evidence	Interpretation	Fix
Init container exits with config parse error	Config file has wrong parameters for this version	Fix config ConfigMap for this version
Main container exits immediately with `exec format error`	Wrong image architecture (amd64 vs arm64)	Check ComponentVersion image for correct arch
`roleProbe` command returning wrong output repeatedly	roleProbe command wrong for this engine version	Fix `lifecycleActions.roleProbe.exec.command`
Operator log: `unknown field "xxx"`	Field name mismatch in ComponentDefinition YAML	Check `docs/kb-api-reference.md` for correct field name
Operator log: `configmap "xxx" not found`	configs[].template references nonexistent ConfigMap	Fix ConfigMap name or ensure it's applied
Operator log: `field is immutable`	Missing `apps.kubeblocks.io/skip-immutable-check: "true"`	Add annotation to the resource
Container exits with DB-specific error (wrong auth, wrong data dir, wrong port)	Lifecycle script or env var wrong for this version	Fix the specific script or env var

Infrastructure Issue → NOT fixable in YAML

Evidence	Interpretation	Action
`ErrImagePull` / `ImagePullBackOff`	Image doesn't exist at that tag	Skip this version's test; run `/sync-image` if needed
`FailedScheduling: Insufficient cpu/memory`	Cluster lacks resources	Reduce resource request or provision more capacity
`FailedMount` / `VolumeBindingFailed`	No matching StorageClass	Use a different `storageClassName` or provision a StorageClass
Container OOMKilled repeatedly	Memory limit too low for this engine	Increase test resource limits

Step 4: Report and Next Steps

Format the diagnosis output:

## Diagnosis Report

Cluster: <name>
Cluster Phase: <phase>   (note: KB v1 has no Failed/Error cluster phase)

### Root Cause
<Clear description of what is failing and why>

### Key Evidence
- Pod <name>/<container> log: "<relevant log line>"
- Event: "<relevant event message>"
- Operator log: "<relevant operator log line>"

### Classification
[Application Error | Config Error | Infrastructure Issue | Unknown]

### Recommended Action

If Application/Config Error: Return to code phase with: "Fix — . For example: the roleProbe command assumes Redis 7.x ROLE command format but version 5.x uses different output."

If Infrastructure Issue: Report clearly: "This is an infrastructure issue unrelated to the addon YAML. ."

If ErrImagePull for a valid version: Report: "Image : is not yet published in ${IMAGE_REGISTRY}/apecloud. The addon YAML is correct. Skip this version's tests and proceed to finish."

name	diagnose
description	Use when collecting diagnostics from failing or stuck KubeBlocks Cluster instances, including Cluster, Component, Pod, logs, events, OpsRequest, and KubeBlocks controller evidence.