Run any Skill in Manus with one click

vllm-setup

Name: Vllm Setup
Author: NVIDIA

Deploy a vLLM inference server on an NVIDIA DGX Station GB300 with validated container, GPU targeting, and tuning parameters. Use when the user asks to serve a model with vLLM, start a vLLM endpoint, or set up OpenAI-compatible inference on DGX Station.

Run Skill in Manus

Skill metadata

Stars918

Forks218

UpdatedMay 30, 2026 at 11:49

SKILL.md

readonly

name	vllm-setup
description	Deploy a vLLM inference server on an NVIDIA DGX Station GB300 with validated container, GPU targeting, and tuning parameters. Use when the user asks to serve a model with vLLM, start a vLLM endpoint, or set up OpenAI-compatible inference on DGX Station.
metadata	{"publisher":"nvidia","hardware":"DGX Station GB300"}

vLLM Setup on DGX Station

Deploy a vLLM inference server on DGX Station with validated configuration.

Steps

Find the GB300 GPU index. Run:
```
nvidia-smi --query-gpu=index,name --format=csv,noheader
```
Identify the device index for the GB300 (typically device 1). Use this index for --gpus below. Do NOT use --gpus all — mixed coherency will cause CUDA failures.
Ask the user which model to serve. If they don't have a preference, suggest:
- nvidia/Qwen3-235B-A22B-NVFP4 — large MoE model, fits in 279 GB HBM
- meta-llama/Llama-3.1-70B-Instruct — solid general-purpose model
- Qwen/Qwen3-8B — small model for testing
Check if the user has an HF_TOKEN. Many models require HuggingFace authentication. The token must be passed inline with -e HF_TOKEN="..." — do not rely on shell export in background Docker tasks.

Deploy the container. Use this validated configuration:

docker pull nvcr.io/nvidia/vllm:26.01-py3

docker run -d \
  --name vllm-server \
  --gpus '"device=<GB300_INDEX>"' \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN="<TOKEN>" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "<MODEL>" \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9

Container version: Use nvcr.io/nvidia/vllm:26.01-py3. Do NOT use 25.10 — it has a FlashInfer buffer overflow on DGX Station.

Wait for the server to be ready. Monitor logs:
```
docker logs -f vllm-server
```
Wait for the line indicating the server is listening on port 8000.

Test the server:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<MODEL>",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 64
  }'

Report the result to the user, including:
- Model loaded and serving on port 8000
- GPU memory utilization
- How to stop: docker stop vllm-server && docker rm vllm-server

Tuning parameters

Adjust these based on the user's workload:

Parameter	Default	Agent workloads	Throughput workloads
`--max-model-len`	32768	32768-65536	8192-16384
`--gpu-memory-utilization`	0.9	0.85-0.90	0.90-0.92
`--enable-prefix-caching`	off	Enable (multi-turn reuse)	Enable
`--max-num-seqs`	default	4-16 (lower latency)	32+ (higher throughput)

More from this repository

same repository

dgx-diagnose

NVIDIA/dgx-spark-playbooks

Diagnose common DGX Station GB300 issues — CUDA crashes, wrong-GPU targeting, vLLM/SGLang container bugs, MIG state problems, NVLink/Fabric Manager errors, X/Vulkan failures, HuggingFace auth, and port conflicts. Use when the user reports a GPU error, inference server crash, MIG problem, or any unexplained DGX Station failure.

2026-05-30918

mig-configure

NVIDIA/dgx-spark-playbooks

Configure NVIDIA MIG (Multi-Instance GPU) partitions on the DGX Station GB300, including enabling MIG mode, choosing a profile layout, creating instances, and retrieving MIG UUIDs. Use when the user asks to partition the GB300, set up MIG, run multiple models in isolation on one GPU, or reconfigure existing MIG instances.

2026-05-30918

sglang-setup

NVIDIA/dgx-spark-playbooks

Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station.

2026-05-30918

analysis-methods

NVIDIA/dgx-spark-playbooks

Teaches the analyst agent how to write correct, robust Python analysis code for FHIR clinical data using pandas, matplotlib, and scipy.

2026-05-26918

case-summary

NVIDIA/dgx-spark-playbooks

Prepare a complete clinical case summary for a patient from FHIR endpoints. Use when asked to summarize a patient, compile a case, or prepare for tumor board.

2026-05-26918

clinical-delegation

NVIDIA/dgx-spark-playbooks

How to delegate clinical tasks to specialist agents. Always use sub-agent runtime with explicit agentId — never ACP. Never call FHIR via web_fetch.

2026-05-26918

Source

NVIDIA

NVIDIA/dgx-spark-playbooks

View GitHub Repository View Creator Repositories

Install

Download

Run Skill in Manus

Useful forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

name	vllm-setup
description	Deploy a vLLM inference server on an NVIDIA DGX Station GB300 with validated container, GPU targeting, and tuning parameters. Use when the user asks to serve a model with vLLM, start a vLLM endpoint, or set up OpenAI-compatible inference on DGX Station.
metadata	{"publisher":"nvidia","hardware":"DGX Station GB300"}

vLLM Setup on DGX Station

Deploy a vLLM inference server on DGX Station with validated configuration.

Steps

Find the GB300 GPU index. Run:
```
nvidia-smi --query-gpu=index,name --format=csv,noheader
```
Identify the device index for the GB300 (typically device 1). Use this index for --gpus below. Do NOT use --gpus all — mixed coherency will cause CUDA failures.
Ask the user which model to serve. If they don't have a preference, suggest:
- nvidia/Qwen3-235B-A22B-NVFP4 — large MoE model, fits in 279 GB HBM
- meta-llama/Llama-3.1-70B-Instruct — solid general-purpose model
- Qwen/Qwen3-8B — small model for testing
Check if the user has an HF_TOKEN. Many models require HuggingFace authentication. The token must be passed inline with -e HF_TOKEN="..." — do not rely on shell export in background Docker tasks.

Deploy the container. Use this validated configuration:

docker pull nvcr.io/nvidia/vllm:26.01-py3

docker run -d \
  --name vllm-server \
  --gpus '"device=<GB300_INDEX>"' \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN="<TOKEN>" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "<MODEL>" \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9

Container version: Use nvcr.io/nvidia/vllm:26.01-py3. Do NOT use 25.10 — it has a FlashInfer buffer overflow on DGX Station.

Wait for the server to be ready. Monitor logs:
```
docker logs -f vllm-server
```
Wait for the line indicating the server is listening on port 8000.

Test the server:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<MODEL>",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 64
  }'

Report the result to the user, including:
- Model loaded and serving on port 8000
- GPU memory utilization
- How to stop: docker stop vllm-server && docker rm vllm-server

Tuning parameters

Adjust these based on the user's workload:

Parameter	Default	Agent workloads	Throughput workloads
`--max-model-len`	32768	32768-65536	8192-16384
`--gpu-memory-utilization`	0.9	0.85-0.90	0.90-0.92
`--enable-prefix-caching`	off	Enable (multi-turn reuse)	Enable
`--max-num-seqs`	default	4-16 (lower latency)	32+ (higher throughput)