| name | sglang-setup |
| description | Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station. |
| metadata | {"publisher":"nvidia","hardware":"DGX Station GB300"} |
SGLang Setup on DGX Station
Deploy an SGLang inference server on DGX Station with validated configuration.
Steps
-
Find the GB300 GPU index. Run:
nvidia-smi --query-gpu=index,name --format=csv,noheader
Identify the device index for the GB300 (typically device 1). Use this index for --gpus below. Do NOT use --gpus all — mixed coherency will cause CUDA failures.
-
Ask the user which model to serve. If they don't have a preference, suggest:
Qwen/Qwen3-8B — small, fast, good for testing
Qwen/Qwen3-32B — medium, good balance
meta-llama/Llama-3.1-70B-Instruct — large general-purpose
-
Check if the user has an HF_TOKEN. Pass inline with -e HF_TOKEN="...".
-
Deploy the container. Use this validated configuration:
docker pull lmsysorg/sglang:latest-cu130
docker run -d \
--name sglang-server \
--gpus '"device=<GB300_INDEX>"' \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 30000:30000 \
-e HF_TOKEN="<TOKEN>" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
lmsysorg/sglang:latest-cu130 \
sglang serve --model-path "<MODEL>" \
--host 0.0.0.0 \
--port 30000 \
--context-length 32768 \
--mem-fraction-static 0.85
Container version: Use lmsysorg/sglang:latest-cu130. The cu130 tag is required for Blackwell SM103 support.
First launch downloads the model and compiles kernels. This takes extra time — subsequent starts are faster.
-
Wait for the server to be ready. Monitor logs:
docker logs -f sglang-server
-
Test the server:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<MODEL>",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 64
}'
-
Report the result to the user, including:
- Model loaded and serving on port 30000
- How to stop:
docker stop sglang-server && docker rm sglang-server
Key features
- RadixAttention — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with:
docker logs sglang-server 2>&1 | grep "cached-token" | tail -5
- Structured JSON output — use
response_format.json_schema in API requests for guaranteed valid JSON.
- Chunked prefill — add
--chunked-prefill-size 8192 to break long prefills into chunks, reducing time-to-first-token.
Tuning parameters
| Parameter | Default | Agent workloads | Throughput workloads |
|---|
--context-length | 32768 | 32768-65536 | 8192-16384 |
--mem-fraction-static | 0.85 | 0.80-0.85 | 0.85-0.88 |
--chunked-prefill-size | off | 4096-8192 | 8192 |
--enable-metrics | off | Optional | Recommended |
Structured output example
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<MODEL>",
"messages": [{"role": "user", "content": "List three programming languages."}],
"max_tokens": 512,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "languages",
"schema": {
"type": "object",
"properties": {
"languages": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"primary_use": {"type": "string"}
},
"required": ["name", "primary_use"]
}
}
},
"required": ["languages"]
}
}
}
}'