Run any Skill in Manus with one click

$pwd:

dynamo-troubleshoot

Name: Dynamo Troubleshoot
Author: ai-dynamo

// Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up.

Run Skill in Manus

$ git log --oneline --stat

stars:7,132

forks:1,190

updated:May 28, 2026 at 23:29

File Explorer

7 files

SKILL.md

readonly

name	dynamo-troubleshoot
description	Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up.
license	Apache-2.0
metadata	{"author":"Dan Gil <dagil@nvidia.com>","tags":["dynamo","kubernetes","troubleshooting","day-2"]}

Dynamo Troubleshoot

Purpose

Turn a Dynamo failure into a clear problem class, strongest signal, and next action. Start with read-only evidence, avoid secrets, and fix one layer at a time.

Prerequisites

Python 3.10+ on the operator machine.
kubectl configured with read access to the target namespace.
Permission to read pods, events, jobs, PVCs, and DynamoGraphDeployment resources (NOT secrets).
Network reachability to the cluster API server.

Instructions

1. Collect A Read-Only Bundle

Run:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}"

If the user names a deployment, include it:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}" \
  --deployment-name <deployment-name>

Do not collect Kubernetes secrets. Do not print Hugging Face tokens.

2. Classify The Failure

Use references/failure-decision-tree.md and classify into one primary bucket:

cluster/platform
namespace/secret
model cache/PVC/download
image pull/runtime image
GPU scheduling/resources
operator/DynamoGraphDeployment reconciliation
frontend/router
worker/backend
endpoint/API
benchmark/perf job

3. Debug Top Down

Check in this order:

namespace, storage class, GPU nodes, and HF secret existence
PVC and model-download job
DynamoGraphDeployment status and events
pod status, describe pod, and container logs
frontend service and port-forward
/v1/models
/v1/chat/completions
benchmark job only after endpoint smoke test passes

4. Fix One Layer At A Time

Prefer the smallest reversible change:

create missing namespace or HF secret
patch storageClassName
patch image tag or image pull secret
reduce GPU request only if the recipe can still be valid
switch KV router to approximate mode only if workers do not publish events
restart failed jobs after fixing the underlying config

After each fix, rerun the relevant readiness check before moving deeper.

Available Scripts

Script	Purpose	Arguments
`scripts/collect_dynamo_debug_bundle.py`	Collect a read-only debug bundle (pods, events, jobs, PVCs, CR status)	`--namespace`, `--deployment-name`, `--output-dir`

Invoke via the agentskills.io run_script() protocol:

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])

Examples

Collect everything in a namespace for triage:

python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo

Scope to a single failing deployment:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace dynamo-demo \
  --deployment-name qwen-vllm-disagg

Equivalent through the agent protocol:

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])

Output Contract

Return:

problem class
evidence checked
strongest signal
likely cause
exact next command or patch
what was ruled out
whether it is safe to continue deployment or benchmarking

Limitations

Read-only. Never mutates the cluster; remediation commands are returned, not executed.
Will not collect secrets or print Hugging Face tokens; some failure modes (auth) may need user-side inspection.
Bundle size grows with deployment size; on very large namespaces, scope with --deployment-name.
Does not validate disagg transport — use dynamo-interconnect-check for that.

Troubleshooting

Symptom	Likely cause	Next step
`kubectl` returns Forbidden on events/pods	Service account lacks read RBAC	Ask operator for read-only role binding on the namespace
Bundle missing `DynamoGraphDeployment` status	Operator not installed or different namespace	Verify `dynamo-platform` operator is installed and watching the namespace
Model-download job in `Pending`	PVC unbound or HF secret missing	Fix PVC binding or create the named HF secret, then rerun the job
Worker pods `CrashLoopBackOff`	Image/runtime mismatch or GPU not available	Inspect container logs; check `nvidia.com/gpu` allocatable on nodes

Benchmark

See BENCHMARK.md for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run /nvskills-ci on an upstream PR touching this skill.

References

Read references/failure-decision-tree.md for bucket-specific checks.
Use scripts/collect_dynamo_debug_bundle.py for read-only bundle collection.

related-skills.json

same repository

dynamo-clone-hotpath-audit.md

from "ai-dynamo/dynamo"

Audit Dynamo Rust hot-path `.clone()` calls, explain which clones are removable and why, and only apply clone-removal patches when explicitly requested.

2026-05-297.1k

dynamo-interconnect-check.md

from "ai-dynamo/dynamo"

Validate that a Dynamo deployment's NIXL/UCX/NCCL interconnect is ready for disaggregated serving over RDMA/NVLink. Use after recipe-runner brings a deployment up (especially disagg/multi-node) to confirm the KV transport is correct; use troubleshoot for diagnosing already-failed pods.

2026-05-287.1k

dynamo-recipe-runner.md

from "ai-dynamo/dynamo"

Select, validate, patch, and deploy existing NVIDIA Dynamo Kubernetes recipes. Use for model/backend/GPU/deployment-mode recipe bring-up; use router-starter for router-only mode work and troubleshoot for broken deployments.

2026-05-287.1k

dynamo-router-starter.md

from "ai-dynamo/dynamo"

Start or patch Dynamo router modes and run router endpoint smoke checks. Use for round-robin, KV-aware, least-loaded, or device-aware routing setup; use recipe-runner for recipe deployment and troubleshoot for failure diagnosis.

2026-05-287.1k

debug-session.md

from "ai-dynamo/dynamo"

Start a debugging session with worklog file

2026-05-277.1k

dep-create.md

from "ai-dynamo/dynamo"

Create or update Dynamo Enhancement Proposals as GitHub issues, including lightweight DEPs, implementation plans, and retroactive DEPs for ai-dynamo/dynamo.

2026-05-277.1k

package.json

"author": "ai-dynamo"

"repository": "ai-dynamo/dynamo"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

name	dynamo-troubleshoot
description	Diagnose failed or unhealthy Dynamo deployments. Use when pods, model-cache jobs, PVCs, workers, frontend/router health, endpoints, or benchmark jobs fail; use recipe-runner/router-starter before this for normal bring-up.
license	Apache-2.0
metadata	{"author":"Dan Gil <dagil@nvidia.com>","tags":["dynamo","kubernetes","troubleshooting","day-2"]}

Dynamo Troubleshoot

Purpose

Turn a Dynamo failure into a clear problem class, strongest signal, and next action. Start with read-only evidence, avoid secrets, and fix one layer at a time.

Prerequisites

Python 3.10+ on the operator machine.
kubectl configured with read access to the target namespace.
Permission to read pods, events, jobs, PVCs, and DynamoGraphDeployment resources (NOT secrets).
Network reachability to the cluster API server.

Instructions

1. Collect A Read-Only Bundle

Run:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}"

If the user names a deployment, include it:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace "${NAMESPACE}" \
  --deployment-name <deployment-name>

Do not collect Kubernetes secrets. Do not print Hugging Face tokens.

2. Classify The Failure

Use references/failure-decision-tree.md and classify into one primary bucket:

cluster/platform
namespace/secret
model cache/PVC/download
image pull/runtime image
GPU scheduling/resources
operator/DynamoGraphDeployment reconciliation
frontend/router
worker/backend
endpoint/API
benchmark/perf job

3. Debug Top Down

Check in this order:

namespace, storage class, GPU nodes, and HF secret existence
PVC and model-download job
DynamoGraphDeployment status and events
pod status, describe pod, and container logs
frontend service and port-forward
/v1/models
/v1/chat/completions
benchmark job only after endpoint smoke test passes

4. Fix One Layer At A Time

Prefer the smallest reversible change:

create missing namespace or HF secret
patch storageClassName
patch image tag or image pull secret
reduce GPU request only if the recipe can still be valid
switch KV router to approximate mode only if workers do not publish events
restart failed jobs after fixing the underlying config

After each fix, rerun the relevant readiness check before moving deeper.

Available Scripts

Script	Purpose	Arguments
`scripts/collect_dynamo_debug_bundle.py`	Collect a read-only debug bundle (pods, events, jobs, PVCs, CR status)	`--namespace`, `--deployment-name`, `--output-dir`

Invoke via the agentskills.io run_script() protocol:

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo"])

Examples

Collect everything in a namespace for triage:

python3 scripts/collect_dynamo_debug_bundle.py --namespace dynamo-demo

Scope to a single failing deployment:

python3 scripts/collect_dynamo_debug_bundle.py \
  --namespace dynamo-demo \
  --deployment-name qwen-vllm-disagg

Equivalent through the agent protocol:

run_script("scripts/collect_dynamo_debug_bundle.py", args=["--namespace", "dynamo-demo", "--deployment-name", "qwen-vllm-disagg"])

Output Contract

Return:

problem class
evidence checked
strongest signal
likely cause
exact next command or patch
what was ruled out
whether it is safe to continue deployment or benchmarking

Limitations

Read-only. Never mutates the cluster; remediation commands are returned, not executed.
Will not collect secrets or print Hugging Face tokens; some failure modes (auth) may need user-side inspection.
Bundle size grows with deployment size; on very large namespaces, scope with --deployment-name.
Does not validate disagg transport — use dynamo-interconnect-check for that.

Troubleshooting

Symptom	Likely cause	Next step
`kubectl` returns Forbidden on events/pods	Service account lacks read RBAC	Ask operator for read-only role binding on the namespace
Bundle missing `DynamoGraphDeployment` status	Operator not installed or different namespace	Verify `dynamo-platform` operator is installed and watching the namespace
Model-download job in `Pending`	PVC unbound or HF secret missing	Fix PVC binding or create the named HF secret, then rerun the job
Worker pods `CrashLoopBackOff`	Image/runtime mismatch or GPU not available	Inspect container logs; check `nvidia.com/gpu` allocatable on nodes

Benchmark

See BENCHMARK.md for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run /nvskills-ci on an upstream PR touching this skill.

References

Read references/failure-decision-tree.md for bucket-specific checks.
Use scripts/collect_dynamo_debug_bundle.py for read-only bundle collection.

dynamo-troubleshoot

Dynamo Troubleshoot

Purpose

Prerequisites

Instructions

1. Collect A Read-Only Bundle

2. Classify The Failure

3. Debug Top Down

4. Fix One Layer At A Time

Available Scripts

Examples

Output Contract

Limitations

Troubleshooting

Benchmark

References

More from this repository

More from this repository

Dynamo Troubleshoot

Purpose

Prerequisites

Instructions

1. Collect A Read-Only Bundle

2. Classify The Failure

3. Debug Top Down

4. Fix One Layer At A Time

Available Scripts

Examples

Output Contract

Limitations

Troubleshooting

Benchmark

References