Run any Skill in Manus with one click

$pwd:

deployment

Name: Deployment
Author: NVIDIA

// Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).

Run Skill in Manus

$ git log --oneline --stat

stars:2,749

forks:405

updated:May 21, 2026 at 22:16

File Explorer

9 files

SKILL.md

readonly

related-skills.json

same repository

evaluation.md

from "NVIDIA/Model-Optimizer"

Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq), deploying/serving models (use deployment), or comparing completed baseline-vs-quantized results (use compare-results).

2026-05-222.7k

compare-results.md

from "NVIDIA/Model-Optimizer"

Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).

2026-05-212.7k

launching-evals.md

from "NVIDIA/Model-Optimizer"

Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.

2026-05-212.7k

monitor.md

from "NVIDIA/Model-Optimizer"

Monitor submitted jobs (PTQ, evaluation, deployment) on SLURM clusters. Use when the user asks "check job status", "is my job done", "monitor my evaluation", "what's the status of the PTQ", "check on job <slurm_job_id>", or after any skill submits a long-running job. Also triggers on "nel status", "squeue", or any request to check progress of a previously submitted job.

2026-05-212.7k

ptq.md

from "NVIDIA/Model-Optimizer"

This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

2026-05-212.7k

release-cherry-pick.md

from "NVIDIA/Model-Optimizer"

Cherry-pick merged PRs labeled for a release branch into that branch, then open a PR and apply the cherry-pick-done label. Use when asked to "cherry-pick PRs for release/X.Y.Z", "pick PRs to release branch", or "cherry-pick labeled PRs".

2026-04-272.7k

package.json

"author": "NVIDIA"

"repository": "NVIDIA/Model-Optimizer"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

name	deployment
description	Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).
license	Apache-2.0

Deployment Skill

Serve a model checkpoint as an OpenAI-compatible inference endpoint. Supports vLLM, SGLang, and TRT-LLM (including AutoDeploy).

Quick Start

Prefer scripts/deploy.sh for standard local deployments — it handles quant detection, health checks, and server lifecycle. Use the raw framework commands in Step 4 when you need flags the script doesn't support, or for remote deployment.

# Start vLLM server with a ModelOpt checkpoint
scripts/deploy.sh start --model ./qwen3-0.6b-fp8

# Start with SGLang and tensor parallelism
scripts/deploy.sh start --model ./llama-70b-nvfp4 --framework sglang --tp 4

# Start from HuggingFace hub
scripts/deploy.sh start --model nvidia/Llama-3.1-8B-Instruct-FP8

# Test the API
scripts/deploy.sh test

# Check status
scripts/deploy.sh status

# Stop
scripts/deploy.sh stop

The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4), server lifecycle (start/stop/restart/status), health check polling, and API testing.

Decision Flow

0. Check workspace (multi-user / Slack bot)

If MODELOPT_WORKSPACE_ROOT is set, read skills/common/workspace-management.md. Before creating a new workspace, check the current session for existing model workspaces — especially if deploying a checkpoint from a prior PTQ run:

ls "$MODELOPT_WORKSPACE_ROOT/<session_id>/" 2>/dev/null

If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and cd into it. The checkpoint should be in that workspace's output directory.

1. Identify the checkpoint

Determine what the user wants to deploy:

Local quantized checkpoint (from ptq skill or manual export): look for hf_quant_config.json in the directory. If coming from a prior PTQ run in the same workspace, check common output locations: output/, outputs/, exported_model/, or the --export_path used in the PTQ command.
HuggingFace model hub (e.g., nvidia/Llama-3.1-8B-Instruct-FP8): use directly
Unquantized model: deploy as-is (BF16) or suggest quantizing first with the ptq skill

Note: This skill expects HF-format checkpoints (from PTQ with --export_fmt hf). TRT-LLM format checkpoints should be deployed directly with TRT-LLM — see references/trtllm.md.

Check the quantization format if applicable:

cat <checkpoint_path>/hf_quant_config.json 2>/dev/null || echo "No hf_quant_config.json"

If not found, also check config.json for a quantization_config section with quant_method: "modelopt". If neither exists, the checkpoint is unquantized.

2. Choose the framework

If the user hasn't specified a framework, recommend based on this priority:

Situation	Recommended	Why
General use	vLLM	Widest ecosystem, easy setup, OpenAI-compatible
Best SGLang model support	SGLang	Strong DeepSeek/Llama 4 support
Maximum optimization	TRT-LLM	Best throughput via engine compilation
Mixed-precision / AutoQuant	TRT-LLM AutoDeploy	Only option for AutoQuant checkpoints

Check the support matrix in references/support-matrix.md to confirm the model + format + framework combination is supported.

3. Check the environment

Read skills/common/environment-setup.md for GPU detection, local vs remote, and SLURM/Docker/bare metal detection. After completing it you should know: GPU model/count, local or remote, and execution environment.

Then check the deployment framework is installed:

python -c "import vllm; print(f'vLLM {vllm.__version__}')" 2>/dev/null || echo "vLLM not installed"
python -c "import sglang; print(f'SGLang {sglang.__version__}')" 2>/dev/null || echo "SGLang not installed"
python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')" 2>/dev/null || echo "TRT-LLM not installed"

If not installed, consult references/setup.md.

GPU memory estimate (to determine tensor parallelism):

BF16: params × 2 bytes (8B ≈ 16 GB)
FP8: params × 1 byte (8B ≈ 8 GB)
FP4: params × 0.5 bytes (8B ≈ 4 GB)
Add ~2-4 GB for KV cache and framework overhead

If the model exceeds single GPU memory, use tensor parallelism (-tp <num_gpus>).

4. Deploy

Read the framework-specific reference for detailed instructions:

Framework	Reference file
vLLM	`references/vllm.md`
SGLang	`references/sglang.md`
TRT-LLM	`references/trtllm.md`

Quick-start commands (for common cases):

vLLM

# Serve as OpenAI-compatible endpoint
python -m vllm.entrypoints.openai.api_server \
    --model <checkpoint_path> \
    --quantization modelopt \
    --tensor-parallel-size <num_gpus> \
    --host 0.0.0.0 --port 8000

For NVFP4 checkpoints, use --quantization modelopt_fp4.

SGLang

python -m sglang.launch_server \
    --model-path <checkpoint_path> \
    --quantization modelopt \
    --tp <num_gpus> \
    --host 0.0.0.0 --port 8000

TRT-LLM (direct)

from tensorrt_llm import LLM, SamplingParams
llm = LLM(model="<checkpoint_path>")
outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8, top_p=0.95))

TRT-LLM AutoDeploy

For AutoQuant or mixed-precision checkpoints, see references/trtllm.md.

5. Verify the deployment

After the server starts, verify it's healthy:

# Health check
curl -s http://localhost:8000/health

# List models
curl -s http://localhost:8000/v1/models | python -m json.tool

# Test generation
curl -s http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "<model_name>",
        "prompt": "The capital of France is",
        "max_tokens": 32
    }' | python -m json.tool

All checks must pass before reporting success to the user.

6. Remote deployment (SSH/SLURM)

If a cluster config exists (~/.config/modelopt/clusters.yaml or .claude/clusters.yaml), or the user mentions running on a remote machine:

Check container registry auth — before submitting any SLURM job with a container image, verify credentials exist on the cluster per skills/common/slurm-setup.md section 6. If credentials are missing for the image's registry, ask the user to fix auth or switch to an image on an authenticated registry (e.g., NGC). Do not submit until auth is confirmed.

Source remote utilities:

source .claude/skills/common/remote_exec.sh
remote_load_cluster
remote_check_ssh
remote_detect_env

Sync the checkpoint (only if it was produced locally):

If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with remote_run "ls <checkpoint_path>/config.json". Only sync if the checkpoint is local:
```
remote_sync_to <local_checkpoint_path> <session_id>/<model>/checkpoints/
```
Deploy based on remote environment:
- SLURM — see skills/common/slurm-setup.md for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g., python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt). After submitting, register the job and set up monitoring per the monitor skill. Get the node hostname from squeue -j $JOBID -o %N.
- Bare metal / Docker — use remote_run to start the server directly:
```
remote_run "nohup python -m vllm.entrypoints.openai.api_server --model <path> --port 8000 > deploy.log 2>&1 &"
```

Verify remotely:

remote_run "curl -s http://localhost:8000/health"
remote_run "curl -s http://localhost:8000/v1/models"

Report the endpoint — include the remote hostname and port so the user can connect (e.g., http://<node_hostname>:8000). For SLURM, note that the port is only reachable from within the cluster network.

For NEL-managed deployment (evaluation with self-deployment), use the evaluation skill instead — NEL handles SLURM container deployment, health checks, and teardown automatically.

Error Handling

Error	Cause	Fix
`CUDA out of memory`	Model too large for GPU(s)	Increase `--tensor-parallel-size` or use a smaller model
`quantization="modelopt" not recognized`	vLLM/SGLang version too old	Upgrade: vLLM >= 0.10.1, SGLang >= 0.4.10
`hf_quant_config.json not found`	Not a ModelOpt-exported checkpoint	Re-export with `export_hf_checkpoint()`, or remove `--quantization` flag
`Connection refused` on health check	Server still starting	Wait 30-60s for large models; check logs for errors
`modelopt_fp4 not supported`	Framework doesn't support FP4 for this model	Check support matrix in `references/support-matrix.md`

Unsupported Models

If the model is not in the validated support matrix (references/support-matrix.md), deployment may fail due to weight key mismatches, missing architecture mappings, or quantized/unquantized layer confusion. Read references/unsupported-models.md for the iterative debug loop: run → read error → diagnose → patch framework source → re-run. For kernel-level issues, escalate to the framework team rather than attempting fixes.

Success Criteria

Server process is running and healthy (/health returns 200)
Model is listed at /v1/models
Test generation produces coherent output
Server URL and port are reported to the user
If benchmarking was requested, throughput/latency numbers are reported

deployment

More from this repository

More from this repository

Deployment Skill

Quick Start

Decision Flow

0. Check workspace (multi-user / Slack bot)

1. Identify the checkpoint

2. Choose the framework

3. Check the environment

4. Deploy

vLLM

SGLang

TRT-LLM (direct)

TRT-LLM AutoDeploy

5. Verify the deployment

6. Remote deployment (SSH/SLURM)

Error Handling

Unsupported Models

Success Criteria

Deployment Skill

Quick Start

Decision Flow

0. Check workspace (multi-user / Slack bot)

1. Identify the checkpoint

2. Choose the framework

3. Check the environment

4. Deploy

vLLM

SGLang

TRT-LLM (direct)

TRT-LLM AutoDeploy

5. Verify the deployment

6. Remote deployment (SSH/SLURM)

Error Handling

Unsupported Models

Success Criteria