with one click
diagnose
// Use when collecting diagnostics from failing or stuck KubeBlocks Cluster instances, including Cluster, Component, Pod, logs, events, OpsRequest, and KubeBlocks controller evidence.
// Use when collecting diagnostics from failing or stuck KubeBlocks Cluster instances, including Cluster, Component, Pod, logs, events, OpsRequest, and KubeBlocks controller evidence.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | diagnose |
| description | Use when collecting diagnostics from failing or stuck KubeBlocks Cluster instances, including Cluster, Component, Pod, logs, events, OpsRequest, and KubeBlocks controller evidence. |
Reference resolution: when this source-derived skill mentions docs/..., resolve it from the shared support package beside the installed user skills: ~/.codex/skills/kubeblocks-addon-source-docs/docs/... for Codex or ~/.claude/skills/kubeblocks-addon-source-docs/docs/... for Claude Code. In the shared kubeblocks-addon-docs checkout, the same files live under skills/kubeblocks-addon-source-docs/docs/.... When it mentions scripts/..., resolve it from the same support package under scripts/.... If you are working inside a checkout of the original apecloud/kubeblocks-addon-skills, repo-relative paths are also valid.
Collect comprehensive diagnostics from failing or stuck KubeBlocks cluster instances.
Target: $ARGUMENTS
(KB cluster name, e.g., kb-test-redis. Omitted → auto-discovers all non-Running clusters.)
SCRIPT_DIR="$(git rev-parse --show-toplevel 2>/dev/null || pwd)"
[ -f "$SCRIPT_DIR/.env" ] && source "$SCRIPT_DIR/.env"
[ -n "$KUBECONFIG" ] && export KUBECONFIG
kubectl cluster-info --request-timeout=5s \
|| { echo "ERROR: kubectl cannot reach the cluster. Check KUBECONFIG in .env"; exit 1; }
echo "KUBECONFIG=${KUBECONFIG:-~/.kube/config}"
If a cluster name was provided:
kubectl get cluster <cluster-name> -o jsonpath='{.status.phase}'
kubectl get cluster <cluster-name> -o jsonpath='{.status.components}' | python3 -m json.tool 2>/dev/null
If no name provided — auto-discover non-healthy clusters:
kubectl get cluster -o json | python3 -c "
import sys, json
data = json.load(sys.stdin)
healthy = {'Running', 'Stopped', 'Stopping', 'Deleting'}
for item in data.get('items', []):
phase = item.get('status', {}).get('phase', 'Unknown')
name = item['metadata']['name']
if phase not in healthy:
print(f'{name} phase={phase}')
"
Note: KB v1 cluster phases are: Creating, Running, Updating, Stopping, Stopped, Deleting. There is no Failed or Error phase at the cluster level — pods and components carry the failure signals.
If no non-healthy clusters found: report "All clusters are healthy" and stop.
For each cluster, collect all of the following:
CLUSTER=<cluster-name>
echo "===== CLUSTER OVERVIEW ====="
kubectl get cluster "$CLUSTER" -o yaml
echo "===== COMPONENT STATUS ====="
kubectl get component -l "app.kubernetes.io/instance=$CLUSTER" -o wide
echo "===== POD STATUS ====="
kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER" -o wide
for POD in $(kubectl get pods -l "app.kubernetes.io/instance=$CLUSTER" \
-o jsonpath='{.items[*].metadata.name}'); do
echo ""
echo "===== POD: $POD ====="
kubectl describe pod "$POD"
# Init containers (often reveal startup sequence failures)
for C in $(kubectl get pod "$POD" \
-o jsonpath='{.spec.initContainers[*].name}' 2>/dev/null); do
echo "--- Init container: $POD/$C ---"
kubectl logs "$POD" -c "$C" --tail=100 --previous 2>&1 || true
kubectl logs "$POD" -c "$C" --tail=100 2>&1
done
# Main containers
for C in $(kubectl get pod "$POD" \
-o jsonpath='{.spec.containers[*].name}'); do
echo "--- Container: $POD/$C ---"
kubectl logs "$POD" -c "$C" --tail=100 --previous 2>&1 || true
kubectl logs "$POD" -c "$C" --tail=100 2>&1
done
# Pod-level events (sorted by time)
echo "--- Events for $POD ---"
kubectl get events \
--field-selector "involvedObject.name=$POD" \
--sort-by='.lastTimestamp' 2>/dev/null
done
kubectl get events \
--field-selector "involvedObject.name=$CLUSTER" \
--sort-by='.lastTimestamp' 2>/dev/null
KB_POD=$(kubectl get pods -n kb-system \
-l app.kubernetes.io/name=kubeblocks \
-o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
if [[ -n "$KB_POD" ]]; then
echo "===== KubeBlocks Operator: $KB_POD ====="
kubectl logs -n kb-system "$KB_POD" --tail=80
else
echo "KubeBlocks operator pod not found in kb-system"
kubectl get pods -n kb-system
fi
After collecting all output, identify the root cause and classify it:
| Evidence | Interpretation | Fix |
|---|---|---|
| Init container exits with config parse error | Config file has wrong parameters for this version | Fix config ConfigMap for this version |
Main container exits immediately with exec format error | Wrong image architecture (amd64 vs arm64) | Check ComponentVersion image for correct arch |
roleProbe command returning wrong output repeatedly | roleProbe command wrong for this engine version | Fix lifecycleActions.roleProbe.exec.command |
Operator log: unknown field "xxx" | Field name mismatch in ComponentDefinition YAML | Check docs/kb-api-reference.md for correct field name |
Operator log: configmap "xxx" not found | configs[].template references nonexistent ConfigMap | Fix ConfigMap name or ensure it's applied |
Operator log: field is immutable | Missing apps.kubeblocks.io/skip-immutable-check: "true" | Add annotation to the resource |
| Container exits with DB-specific error (wrong auth, wrong data dir, wrong port) | Lifecycle script or env var wrong for this version | Fix the specific script or env var |
| Evidence | Interpretation | Action |
|---|---|---|
ErrImagePull / ImagePullBackOff | Image doesn't exist at that tag | Skip this version's test; run /sync-image if needed |
FailedScheduling: Insufficient cpu/memory | Cluster lacks resources | Reduce resource request or provision more capacity |
FailedMount / VolumeBindingFailed | No matching StorageClass | Use a different storageClassName or provision a StorageClass |
| Container OOMKilled repeatedly | Memory limit too low for this engine | Increase test resource limits |
Format the diagnosis output:
## Diagnosis Report
Cluster: <name>
Cluster Phase: <phase> (note: KB v1 has no Failed/Error cluster phase)
### Root Cause
<Clear description of what is failing and why>
### Key Evidence
- Pod <name>/<container> log: "<relevant log line>"
- Event: "<relevant event message>"
- Operator log: "<relevant operator log line>"
### Classification
[Application Error | Config Error | Infrastructure Issue | Unknown]
### Recommended Action
If Application/Config Error:
Return to code phase with: "Fix — . For example: the roleProbe command assumes Redis 7.x ROLE command format but version 5.x uses different output."
If Infrastructure Issue: Report clearly: "This is an infrastructure issue unrelated to the addon YAML. ."
If ErrImagePull for a valid version:
Report: "Image : is not yet published in ${IMAGE_REGISTRY}/apecloud. The addon YAML is correct. Skip this version's tests and proceed to finish."