| name | dynamo-troubleshoot |
| description | Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up. |
| license | Apache-2.0 |
| metadata | {"author":"Dan Gil <dagil@nvidia.com>","tags":["dynamo","kubernetes","troubleshooting","day-2"]} |
Dynamo Troubleshoot
Purpose
Turn a Dynamo failure into a clear problem class, strongest signal, and next
action. Start with read-only evidence, avoid secrets, and fix one layer at a
time.
Prerequisites
- Python 3.10+ on the operator machine.
kubectl configured with read access to the target namespace.
- Permission to read pods, events, jobs, PVCs, and
DynamoGraphDeployment resources (NOT secrets).
- Network reachability to the cluster API server.
Instructions
1. Collect A Read-Only Bundle
Run:
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}"
If the user names a deployment, include it:
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}" \
--deployment-name <deployment-name>
Do not collect Kubernetes secrets. Do not print Hugging Face tokens.
2. Classify The Failure
Use references/failure-decision-tree.md and classify into one primary bucket:
- cluster/platform
- namespace/secret
- model cache/PVC/download
- image pull/runtime image
- GPU scheduling/resources
- operator/DynamoGraphDeployment reconciliation
- frontend/router
- worker/backend
- endpoint/API
- benchmark/perf job
3. Debug Top Down
Check in this order:
- namespace, storage class, GPU nodes, and HF secret existence
- PVC and model-download job
DynamoGraphDeployment status and events
- pod status,
describe pod, and container logs
- frontend service and port-forward
/v1/models
/v1/chat/completions
- benchmark job only after endpoint smoke test passes
4. Fix One Layer At A Time
Prefer the smallest reversible change:
- create missing namespace or HF secret
- patch
storageClassName
- patch image tag or image pull secret
- reduce GPU request only if the recipe can still be valid
- switch KV router to approximate mode only if workers do not publish events
- restart failed jobs after fixing the underlying config
After each fix, rerun the relevant readiness check before moving deeper.
Available Scripts
| Script | Purpose | Arguments |
|---|
scripts/collect_dynamo_debug_bundle.py | Collect a read-only debug bundle (pods, events, jobs, PVCs, CR status) | --namespace, --deployment-name, --output-dir |
Invoke via the agentskills.io run_script() protocol:
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])
Examples
Collect everything in a namespace for triage:
python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo
Scope to a single failing deployment:
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace dynamo-demo \
--deployment-name qwen-vllm-disagg
Equivalent through the agent protocol:
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])
Output Contract
Return:
- problem class
- evidence checked
- strongest signal
- likely cause
- exact next command or patch
- what was ruled out
- whether it is safe to continue deployment or benchmarking
Limitations
- Read-only. Never mutates the cluster; remediation commands are returned, not executed.
- Will not collect secrets or print Hugging Face tokens; some failure modes (auth) may need user-side inspection.
- Bundle size grows with deployment size; on very large namespaces, scope with
--deployment-name.
- Does not validate disagg transport — use
dynamo-interconnect-check for that.
Troubleshooting
| Symptom | Likely cause | Next step |
|---|
kubectl returns Forbidden on events/pods | Service account lacks read RBAC | Ask operator for read-only role binding on the namespace |
Bundle missing DynamoGraphDeployment status | Operator not installed or different namespace | Verify dynamo-platform operator is installed and watching the namespace |
Model-download job in Pending | PVC unbound or HF secret missing | Fix PVC binding or create the named HF secret, then rerun the job |
Worker pods CrashLoopBackOff | Image/runtime mismatch or GPU not available | Inspect container logs; check nvidia.com/gpu allocatable on nodes |
Benchmark
See BENCHMARK.md for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run /nvskills-ci on an upstream PR touching this skill.
References
- Read
references/failure-decision-tree.md for bucket-specific checks.
- Use
scripts/collect_dynamo_debug_bundle.py for read-only bundle collection.