一键导入
dynamo-troubleshoot
// Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up.
// Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up.
| name | dynamo-troubleshoot |
| description | Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up. |
| license | Apache-2.0 |
| metadata | {"author":"Dan Gil <dagil@nvidia.com>","tags":["dynamo","kubernetes","troubleshooting","day-2"]} |
Turn a Dynamo failure into a clear problem class, strongest signal, and next action. Start with read-only evidence, avoid secrets, and fix one layer at a time.
kubectl configured with read access to the target namespace.DynamoGraphDeployment resources (NOT secrets).Run:
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}"
If the user names a deployment, include it:
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace "${NAMESPACE}" \
--deployment-name <deployment-name>
Do not collect Kubernetes secrets. Do not print Hugging Face tokens.
Use references/failure-decision-tree.md and classify into one primary bucket:
Check in this order:
DynamoGraphDeployment status and eventsdescribe pod, and container logs/v1/models/v1/chat/completionsPrefer the smallest reversible change:
storageClassNameAfter each fix, rerun the relevant readiness check before moving deeper.
| Script | Purpose | Arguments |
|---|---|---|
scripts/collect_dynamo_debug_bundle.py | Collect a read-only debug bundle (pods, events, jobs, PVCs, CR status) | --namespace, --deployment-name, --output-dir |
Invoke via the agentskills.io run_script() protocol:
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])
Collect everything in a namespace for triage:
python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo
Scope to a single failing deployment:
python3 scripts/collect_dynamo_debug_bundle.py \
--namespace dynamo-demo \
--deployment-name qwen-vllm-disagg
Equivalent through the agent protocol:
run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])
Return:
--deployment-name.dynamo-interconnect-check for that.| Symptom | Likely cause | Next step |
|---|---|---|
kubectl returns Forbidden on events/pods | Service account lacks read RBAC | Ask operator for read-only role binding on the namespace |
Bundle missing DynamoGraphDeployment status | Operator not installed or different namespace | Verify dynamo-platform operator is installed and watching the namespace |
Model-download job in Pending | PVC unbound or HF secret missing | Fix PVC binding or create the named HF secret, then rerun the job |
Worker pods CrashLoopBackOff | Image/runtime mismatch or GPU not available | Inspect container logs; check nvidia.com/gpu allocatable on nodes |
See BENCHMARK.md for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run /nvskills-ci on an upstream PR touching this skill.
references/failure-decision-tree.md for bucket-specific checks.scripts/collect_dynamo_debug_bundle.py for read-only bundle collection.Audit Dynamo Rust hot-path `.clone()` calls, explain which clones are removable and why, and only apply clone-removal patches when explicitly requested.
Validate that a Dynamo deployment's NIXL/UCX/NCCL interconnect is ready for disaggregated serving over RDMA/NVLink. Use after recipe-runner brings a deployment up (especially disagg/multi-node) to confirm the KV transport is correct; use troubleshoot for diagnosing already-failed pods.
Select, validate, patch, and deploy existing NVIDIA Dynamo Kubernetes recipes. Use for model/backend/GPU/deployment-mode recipe bring-up; use router-starter for router-only mode work and troubleshoot for broken deployments.
Start or patch Dynamo router modes and run router endpoint smoke checks. Use for round-robin, KV-aware, least-loaded, or device-aware routing setup; use recipe-runner for recipe deployment and troubleshoot for failure diagnosis.
Start a debugging session with worklog file
Create or update Dynamo Enhancement Proposals as GitHub issues, including lightweight DEPs, implementation plans, and retroactive DEPs for ai-dynamo/dynamo.