| name | dgx-diagnose |
| description | Diagnose common DGX Station GB300 issues — CUDA crashes, wrong-GPU targeting, vLLM/SGLang container bugs, MIG state problems, NVLink/Fabric Manager errors, X/Vulkan failures, HuggingFace auth, and port conflicts. Use when the user reports a GPU error, inference server crash, MIG problem, or any unexplained DGX Station failure. |
| metadata | {"publisher":"nvidia","hardware":"DGX Station GB300"} |
DGX Station Diagnostics
Diagnose common DGX Station issues. Run through the checks below to identify the problem.
Step 1. Gather system state
Run these commands and analyze the output:
nvidia-smi
nvidia-smi --query-gpu=index,name,memory.used,memory.total --format=csv,noheader
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1
nvidia-smi -i 1 -q 2>/dev/null | grep -i "MIG Mode" || echo "Could not query MIG on device 1"
systemctl is-active nvidia-fabricmanager
sudo fuser -v /dev/nvidia* 2>/dev/null || echo "No GPU processes found"
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" 2>/dev/null
Step 2. Match symptoms to known issues
Based on the gathered state and the user's reported problem, check for these known issues:
CUDA crashes with --gpus all
Cause: Mixed coherency — GB300 (ATS) and RTX PRO (non-ATS) cannot share a CUDA context.
Fix: Use --gpus '"device=N"' targeting only the GB300.
Model running on wrong GPU (RTX PRO instead of GB300)
Check: The device index in the docker command vs actual GPU indices.
Fix: Verify with nvidia-smi --query-gpu=index,name --format=csv,noheader and correct the --gpus flag.
vLLM crash / FlashInfer buffer overflow
Check: Container version — docker inspect vllm-server | grep Image
Fix: Use nvcr.io/nvidia/vllm:26.01-py3. Version 25.10 has a known FlashInfer bug on DGX Station.
SGLang CUDA errors
Check: Container tag — must be cu130 for Blackwell SM103.
Fix: Use lmsysorg/sglang:latest-cu130.
CUDA OOM despite 279 GB HBM
Check: --max-model-len / --context-length and memory utilization settings.
Fix: Reduce context length or lower --gpu-memory-utilization / --mem-fraction-static.
nvidia-smi -mig 1 returns "In use by another client"
Check: sudo fuser -v /dev/nvidia* — GPU processes must be stopped first.
Fix: Stop all GPU workloads, then retry.
NVLink errors after disabling MIG
Check: systemctl is-active nvidia-fabricmanager
Fix: sudo systemctl start nvidia-fabricmanager
X server crash after nvidia-xconfig -a
Fix: sudo cp /etc/X11/xorg.conf.nvidia-xconfig-original /etc/X11/xorg.conf
Vulkan VK_ERROR_INITIALIZATION_FAILED
Cause: CUDA initialized before Vulkan, binding to GB300.
Fix: Run CUDA and Vulkan workloads in separate processes. For Vulkan apps: __GL_DeviceModalityPreference=2 ./your_app
HuggingFace 401 / token errors
Fix: Pass token inline: -e HF_TOKEN="hf_...". Don't rely on shell export for background Docker tasks.
Port already in use
Check: lsof -i :<PORT>
Fix: Stop the conflicting process or use a different host port: -p 8001:8000.
Step 3. Report findings
Tell the user:
- What the issue is
- Why it happens (root cause)
- The specific command to fix it
- How to verify the fix worked