Run any Skill in Manus with one click

sglang-setup

Name: Sglang Setup
Author: NVIDIA

Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station.

Run Skill in Manus

Skill metadata

Stars918

Forks218

UpdatedMay 30, 2026 at 11:49

SKILL.md

readonly

name	sglang-setup
description	Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station.
metadata	{"publisher":"nvidia","hardware":"DGX Station GB300"}

SGLang Setup on DGX Station

Deploy an SGLang inference server on DGX Station with validated configuration.

Steps

Find the GB300 GPU index. Run:
```
nvidia-smi --query-gpu=index,name --format=csv,noheader
```
Identify the device index for the GB300 (typically device 1). Use this index for --gpus below. Do NOT use --gpus all — mixed coherency will cause CUDA failures.
Ask the user which model to serve. If they don't have a preference, suggest:
- Qwen/Qwen3-8B — small, fast, good for testing
- Qwen/Qwen3-32B — medium, good balance
- meta-llama/Llama-3.1-70B-Instruct — large general-purpose
Check if the user has an HF_TOKEN. Pass inline with -e HF_TOKEN="...".

Deploy the container. Use this validated configuration:

docker pull lmsysorg/sglang:latest-cu130

docker run -d \
  --name sglang-server \
  --gpus '"device=<GB300_INDEX>"' \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 30000:30000 \
  -e HF_TOKEN="<TOKEN>" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  lmsysorg/sglang:latest-cu130 \
  sglang serve --model-path "<MODEL>" \
    --host 0.0.0.0 \
    --port 30000 \
    --context-length 32768 \
    --mem-fraction-static 0.85

Container version: Use lmsysorg/sglang:latest-cu130. The cu130 tag is required for Blackwell SM103 support.

First launch downloads the model and compiles kernels. This takes extra time — subsequent starts are faster.

Wait for the server to be ready. Monitor logs:
```
docker logs -f sglang-server
```

Test the server:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<MODEL>",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 64
  }'

Report the result to the user, including:
- Model loaded and serving on port 30000
- How to stop: docker stop sglang-server && docker rm sglang-server

Key features

RadixAttention — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with: docker logs sglang-server 2>&1 | grep "cached-token" | tail -5
Structured JSON output — use response_format.json_schema in API requests for guaranteed valid JSON.
Chunked prefill — add --chunked-prefill-size 8192 to break long prefills into chunks, reducing time-to-first-token.

Tuning parameters

Parameter	Default	Agent workloads	Throughput workloads
`--context-length`	32768	32768-65536	8192-16384
`--mem-fraction-static`	0.85	0.80-0.85	0.85-0.88
`--chunked-prefill-size`	off	4096-8192	8192
`--enable-metrics`	off	Optional	Recommended

Structured output example

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<MODEL>",
    "messages": [{"role": "user", "content": "List three programming languages."}],
    "max_tokens": 512,
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "languages",
        "schema": {
          "type": "object",
          "properties": {
            "languages": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": {"type": "string"},
                  "primary_use": {"type": "string"}
                },
                "required": ["name", "primary_use"]
              }
            }
          },
          "required": ["languages"]
        }
      }
    }
  }'

More from this repository

same repository

dgx-diagnose

NVIDIA/dgx-spark-playbooks

Diagnose common DGX Station GB300 issues — CUDA crashes, wrong-GPU targeting, vLLM/SGLang container bugs, MIG state problems, NVLink/Fabric Manager errors, X/Vulkan failures, HuggingFace auth, and port conflicts. Use when the user reports a GPU error, inference server crash, MIG problem, or any unexplained DGX Station failure.

2026-05-30918

mig-configure

NVIDIA/dgx-spark-playbooks

Configure NVIDIA MIG (Multi-Instance GPU) partitions on the DGX Station GB300, including enabling MIG mode, choosing a profile layout, creating instances, and retrieving MIG UUIDs. Use when the user asks to partition the GB300, set up MIG, run multiple models in isolation on one GPU, or reconfigure existing MIG instances.

2026-05-30918

vllm-setup

NVIDIA/dgx-spark-playbooks

Deploy a vLLM inference server on an NVIDIA DGX Station GB300 with validated container, GPU targeting, and tuning parameters. Use when the user asks to serve a model with vLLM, start a vLLM endpoint, or set up OpenAI-compatible inference on DGX Station.

2026-05-30918

analysis-methods

NVIDIA/dgx-spark-playbooks

Teaches the analyst agent how to write correct, robust Python analysis code for FHIR clinical data using pandas, matplotlib, and scipy.

2026-05-26918

case-summary

NVIDIA/dgx-spark-playbooks

Prepare a complete clinical case summary for a patient from FHIR endpoints. Use when asked to summarize a patient, compile a case, or prepare for tumor board.

2026-05-26918

clinical-delegation

NVIDIA/dgx-spark-playbooks

How to delegate clinical tasks to specialist agents. Always use sub-agent runtime with explicit agentId — never ACP. Never call FHIR via web_fetch.

2026-05-26918

Source

NVIDIA

NVIDIA/dgx-spark-playbooks

View GitHub Repository View Creator Repositories

Install

Download

Run Skill in Manus

Useful forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

name	sglang-setup
description	Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station.
metadata	{"publisher":"nvidia","hardware":"DGX Station GB300"}

SGLang Setup on DGX Station

Deploy an SGLang inference server on DGX Station with validated configuration.

Steps

Find the GB300 GPU index. Run:
```
nvidia-smi --query-gpu=index,name --format=csv,noheader
```
Identify the device index for the GB300 (typically device 1). Use this index for --gpus below. Do NOT use --gpus all — mixed coherency will cause CUDA failures.
Ask the user which model to serve. If they don't have a preference, suggest:
- Qwen/Qwen3-8B — small, fast, good for testing
- Qwen/Qwen3-32B — medium, good balance
- meta-llama/Llama-3.1-70B-Instruct — large general-purpose
Check if the user has an HF_TOKEN. Pass inline with -e HF_TOKEN="...".

Deploy the container. Use this validated configuration:

docker pull lmsysorg/sglang:latest-cu130

docker run -d \
  --name sglang-server \
  --gpus '"device=<GB300_INDEX>"' \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 30000:30000 \
  -e HF_TOKEN="<TOKEN>" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  lmsysorg/sglang:latest-cu130 \
  sglang serve --model-path "<MODEL>" \
    --host 0.0.0.0 \
    --port 30000 \
    --context-length 32768 \
    --mem-fraction-static 0.85

Container version: Use lmsysorg/sglang:latest-cu130. The cu130 tag is required for Blackwell SM103 support.

First launch downloads the model and compiles kernels. This takes extra time — subsequent starts are faster.

Wait for the server to be ready. Monitor logs:
```
docker logs -f sglang-server
```

Test the server:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<MODEL>",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 64
  }'

Report the result to the user, including:
- Model loaded and serving on port 30000
- How to stop: docker stop sglang-server && docker rm sglang-server

Key features

RadixAttention — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with: docker logs sglang-server 2>&1 | grep "cached-token" | tail -5
Structured JSON output — use response_format.json_schema in API requests for guaranteed valid JSON.
Chunked prefill — add --chunked-prefill-size 8192 to break long prefills into chunks, reducing time-to-first-token.

Tuning parameters

Parameter	Default	Agent workloads	Throughput workloads
`--context-length`	32768	32768-65536	8192-16384
`--mem-fraction-static`	0.85	0.80-0.85	0.85-0.88
`--chunked-prefill-size`	off	4096-8192	8192
`--enable-metrics`	off	Optional	Recommended

Structured output example

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<MODEL>",
    "messages": [{"role": "user", "content": "List three programming languages."}],
    "max_tokens": 512,
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "languages",
        "schema": {
          "type": "object",
          "properties": {
            "languages": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "name": {"type": "string"},
                  "primary_use": {"type": "string"}
                },
                "required": ["name", "primary_use"]
              }
            }
          },
          "required": ["languages"]
        }
      }
    }
  }'