Run any Skill in Manus with one click

dgx-diagnose

Name: Dgx Diagnose
Author: NVIDIA

Diagnose common DGX Station GB300 issues — CUDA crashes, wrong-GPU targeting, vLLM/SGLang container bugs, MIG state problems, NVLink/Fabric Manager errors, X/Vulkan failures, HuggingFace auth, and port conflicts. Use when the user reports a GPU error, inference server crash, MIG problem, or any unexplained DGX Station failure.

Run Skill in Manus

Skill metadata

Stars918

Forks218

UpdatedMay 30, 2026 at 11:49

SKILL.md

readonly

name	dgx-diagnose
description	Diagnose common DGX Station GB300 issues — CUDA crashes, wrong-GPU targeting, vLLM/SGLang container bugs, MIG state problems, NVLink/Fabric Manager errors, X/Vulkan failures, HuggingFace auth, and port conflicts. Use when the user reports a GPU error, inference server crash, MIG problem, or any unexplained DGX Station failure.
metadata	{"publisher":"nvidia","hardware":"DGX Station GB300"}

DGX Station Diagnostics

Diagnose common DGX Station issues. Run through the checks below to identify the problem.

Step 1. Gather system state

Run these commands and analyze the output:

# GPU status
nvidia-smi

# GPU device list with indices
nvidia-smi --query-gpu=index,name,memory.used,memory.total --format=csv,noheader

# Driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1

# MIG state
nvidia-smi -i 1 -q 2>/dev/null | grep -i "MIG Mode" || echo "Could not query MIG on device 1"

# Fabric Manager
systemctl is-active nvidia-fabricmanager

# GPU processes
sudo fuser -v /dev/nvidia* 2>/dev/null || echo "No GPU processes found"

# Docker containers using GPUs
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" 2>/dev/null

Step 2. Match symptoms to known issues

Based on the gathered state and the user's reported problem, check for these known issues:

CUDA crashes with `--gpus all`

Cause: Mixed coherency — GB300 (ATS) and RTX PRO (non-ATS) cannot share a CUDA context. Fix: Use --gpus '"device=N"' targeting only the GB300.

Model running on wrong GPU (RTX PRO instead of GB300)

Check: The device index in the docker command vs actual GPU indices. Fix: Verify with nvidia-smi --query-gpu=index,name --format=csv,noheader and correct the --gpus flag.

vLLM crash / FlashInfer buffer overflow

Check: Container version — docker inspect vllm-server | grep Image Fix: Use nvcr.io/nvidia/vllm:26.01-py3. Version 25.10 has a known FlashInfer bug on DGX Station.

SGLang CUDA errors

Check: Container tag — must be cu130 for Blackwell SM103. Fix: Use lmsysorg/sglang:latest-cu130.

CUDA OOM despite 279 GB HBM

Check: --max-model-len / --context-length and memory utilization settings. Fix: Reduce context length or lower --gpu-memory-utilization / --mem-fraction-static.

`nvidia-smi -mig 1` returns "In use by another client"

Check: sudo fuser -v /dev/nvidia* — GPU processes must be stopped first. Fix: Stop all GPU workloads, then retry.

NVLink errors after disabling MIG

Check: systemctl is-active nvidia-fabricmanager Fix: sudo systemctl start nvidia-fabricmanager

X server crash after nvidia-xconfig -a

Fix: sudo cp /etc/X11/xorg.conf.nvidia-xconfig-original /etc/X11/xorg.conf

Vulkan VK_ERROR_INITIALIZATION_FAILED

Cause: CUDA initialized before Vulkan, binding to GB300. Fix: Run CUDA and Vulkan workloads in separate processes. For Vulkan apps: __GL_DeviceModalityPreference=2 ./your_app

HuggingFace 401 / token errors

Fix: Pass token inline: -e HF_TOKEN="hf_...". Don't rely on shell export for background Docker tasks.

Port already in use

Check: lsof -i :<PORT> Fix: Stop the conflicting process or use a different host port: -p 8001:8000.

Step 3. Report findings

Tell the user:

What the issue is
Why it happens (root cause)
The specific command to fix it
How to verify the fix worked

More from this repository

same repository

mig-configure

NVIDIA/dgx-spark-playbooks

Configure NVIDIA MIG (Multi-Instance GPU) partitions on the DGX Station GB300, including enabling MIG mode, choosing a profile layout, creating instances, and retrieving MIG UUIDs. Use when the user asks to partition the GB300, set up MIG, run multiple models in isolation on one GPU, or reconfigure existing MIG instances.

2026-05-30918

sglang-setup

NVIDIA/dgx-spark-playbooks

Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station.

2026-05-30918

vllm-setup

NVIDIA/dgx-spark-playbooks

Deploy a vLLM inference server on an NVIDIA DGX Station GB300 with validated container, GPU targeting, and tuning parameters. Use when the user asks to serve a model with vLLM, start a vLLM endpoint, or set up OpenAI-compatible inference on DGX Station.

2026-05-30918

analysis-methods

NVIDIA/dgx-spark-playbooks

Teaches the analyst agent how to write correct, robust Python analysis code for FHIR clinical data using pandas, matplotlib, and scipy.

2026-05-26918

case-summary

NVIDIA/dgx-spark-playbooks

Prepare a complete clinical case summary for a patient from FHIR endpoints. Use when asked to summarize a patient, compile a case, or prepare for tumor board.

2026-05-26918

clinical-delegation

NVIDIA/dgx-spark-playbooks

How to delegate clinical tasks to specialist agents. Always use sub-agent runtime with explicit agentId — never ACP. Never call FHIR via web_fetch.

2026-05-26918

Source

NVIDIA

NVIDIA/dgx-spark-playbooks

View GitHub Repository View Creator Repositories

Install

Download

Run Skill in Manus

Useful forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

name	dgx-diagnose
description	Diagnose common DGX Station GB300 issues — CUDA crashes, wrong-GPU targeting, vLLM/SGLang container bugs, MIG state problems, NVLink/Fabric Manager errors, X/Vulkan failures, HuggingFace auth, and port conflicts. Use when the user reports a GPU error, inference server crash, MIG problem, or any unexplained DGX Station failure.
metadata	{"publisher":"nvidia","hardware":"DGX Station GB300"}

DGX Station Diagnostics

Diagnose common DGX Station issues. Run through the checks below to identify the problem.

Step 1. Gather system state

Run these commands and analyze the output:

# GPU status
nvidia-smi

# GPU device list with indices
nvidia-smi --query-gpu=index,name,memory.used,memory.total --format=csv,noheader

# Driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1

# MIG state
nvidia-smi -i 1 -q 2>/dev/null | grep -i "MIG Mode" || echo "Could not query MIG on device 1"

# Fabric Manager
systemctl is-active nvidia-fabricmanager

# GPU processes
sudo fuser -v /dev/nvidia* 2>/dev/null || echo "No GPU processes found"

# Docker containers using GPUs
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" 2>/dev/null

Step 2. Match symptoms to known issues

Based on the gathered state and the user's reported problem, check for these known issues:

CUDA crashes with `--gpus all`

Cause: Mixed coherency — GB300 (ATS) and RTX PRO (non-ATS) cannot share a CUDA context. Fix: Use --gpus '"device=N"' targeting only the GB300.

Model running on wrong GPU (RTX PRO instead of GB300)

Check: The device index in the docker command vs actual GPU indices. Fix: Verify with nvidia-smi --query-gpu=index,name --format=csv,noheader and correct the --gpus flag.

vLLM crash / FlashInfer buffer overflow

Check: Container version — docker inspect vllm-server | grep Image Fix: Use nvcr.io/nvidia/vllm:26.01-py3. Version 25.10 has a known FlashInfer bug on DGX Station.

SGLang CUDA errors

Check: Container tag — must be cu130 for Blackwell SM103. Fix: Use lmsysorg/sglang:latest-cu130.

CUDA OOM despite 279 GB HBM

Check: --max-model-len / --context-length and memory utilization settings. Fix: Reduce context length or lower --gpu-memory-utilization / --mem-fraction-static.

`nvidia-smi -mig 1` returns "In use by another client"

Check: sudo fuser -v /dev/nvidia* — GPU processes must be stopped first. Fix: Stop all GPU workloads, then retry.

NVLink errors after disabling MIG

Check: systemctl is-active nvidia-fabricmanager Fix: sudo systemctl start nvidia-fabricmanager

X server crash after nvidia-xconfig -a

Fix: sudo cp /etc/X11/xorg.conf.nvidia-xconfig-original /etc/X11/xorg.conf

Vulkan VK_ERROR_INITIALIZATION_FAILED

Cause: CUDA initialized before Vulkan, binding to GB300. Fix: Run CUDA and Vulkan workloads in separate processes. For Vulkan apps: __GL_DeviceModalityPreference=2 ./your_app

HuggingFace 401 / token errors

Fix: Pass token inline: -e HF_TOKEN="hf_...". Don't rely on shell export for background Docker tasks.

Port already in use

Check: lsof -i :<PORT> Fix: Stop the conflicting process or use a different host port: -p 8001:8000.

Step 3. Report findings

Tell the user:

What the issue is
Why it happens (root cause)
The specific command to fix it
How to verify the fix worked

dgx-diagnose

DGX Station Diagnostics

Step 1. Gather system state

Step 2. Match symptoms to known issues

CUDA crashes with --gpus all

Model running on wrong GPU (RTX PRO instead of GB300)

vLLM crash / FlashInfer buffer overflow

SGLang CUDA errors

CUDA OOM despite 279 GB HBM

nvidia-smi -mig 1 returns "In use by another client"

NVLink errors after disabling MIG

X server crash after nvidia-xconfig -a

Vulkan VK_ERROR_INITIALIZATION_FAILED

HuggingFace 401 / token errors

Port already in use

Step 3. Report findings

More from this repository

More from this repository

DGX Station Diagnostics

Step 1. Gather system state

Step 2. Match symptoms to known issues

CUDA crashes with --gpus all

Model running on wrong GPU (RTX PRO instead of GB300)

vLLM crash / FlashInfer buffer overflow

SGLang CUDA errors

CUDA OOM despite 279 GB HBM

nvidia-smi -mig 1 returns "In use by another client"

NVLink errors after disabling MIG

X server crash after nvidia-xconfig -a

Vulkan VK_ERROR_INITIALIZATION_FAILED

HuggingFace 401 / token errors

Port already in use

Step 3. Report findings

CUDA crashes with `--gpus all`

`nvidia-smi -mig 1` returns "In use by another client"

CUDA crashes with `--gpus all`

`nvidia-smi -mig 1` returns "In use by another client"