| name | vllm-bench-serve |
| description | Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs. |
vLLM Bench Serve
Benchmark vLLM or any OpenAI-compatible serving endpoint using the vllm bench serve CLI. Measures throughput, latency (TTFT, TPOT), and goodput against configurable request load.
Reference: vLLM Bench Serve Documentation
Prerequisites
- vLLM installed (or any OpenAI-compatible server running)
- A vLLM server or API endpoint already serving a model
- Python environment with vLLM for the benchmark client
Quick Start
Basic benchmark against local vLLM server (default random dataset, 1000 prompts):
vllm bench serve \
--backend openai-chat \
--host 127.0.0.1 \
--port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions
Save results to JSON:
vllm bench serve \
--backend openai-chat \
--host 127.0.0.1 \
--port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--save-result \
--result-dir ./bench-results \
--metadata "version=0.6.0" "tp=1"
Note: When using --backend openai-chat, you must specify --endpoint /v1/chat/completions (default is /v1/completions).
Core Arguments
| Argument | Default | Description |
|---|
--backend | openai | Backend type: openai, openai-chat, openai-embeddings, vllm, vllm-pooling, vllm-rerank, etc. |
--host | 127.0.0.1 | Server host |
--port | 8000 | Server port |
--base-url | - | Alternative: full base URL instead of host:port |
--endpoint | /v1/completions | API endpoint; use /v1/chat/completions for openai-chat |
--model | (from /v1/models) | Model name |
--num-prompts | 1000 | Number of prompts to process |
--request-rate | inf | Requests per second; inf = burst all at once |
--max-concurrency | - | Max concurrent requests (caps parallelism) |
--num-warmups | 0 | Warmup requests before measuring |
Datasets
--dataset-name | Use Case |
|---|
random | Synthetic random prompts (default) |
sharegpt | ShareGPT conversation format; requires --dataset-path |
sonnet | Sonnet-style prompts |
hf | HuggingFace dataset; requires --dataset-path (dataset ID) |
custom / custom_mm | Custom dataset; requires --dataset-path |
prefix_repetition | Prefix repetition benchmark |
random-mm | Random multimodal (images/videos) |
spec_bench | Spec bench dataset |
Dataset-specific options (examples):
--dataset-name random --random-input-len 1024 --random-output-len 128
--dataset-name sonnet --sonnet-input-len 550 --sonnet-output-len 150
--dataset-name hf --dataset-path "lmarena-ai/VisionArena-Chat" --hf-split test
--input-len 512 --output-len 256
Load Control
--request-rate 10
--request-rate 10 --burstiness 0.5
--ramp-up-strategy linear --ramp-up-start-rps 1 --ramp-up-end-rps 50
--max-concurrency 32
Results and Metrics
| Argument | Description |
|---|
--save-result | Save benchmark results to JSON |
--save-detailed | Include per-request TTFT, TPOT, errors in JSON |
--append-result | Append to existing result file |
--result-dir | Directory for result files |
--result-filename | Custom filename (default: {label}-{request_rate}qps-{model}-{timestamp}.json) |
--percentile-metrics | Metrics for percentiles: ttft, tpot, itl, e2el (default: ttft,tpot,itl) |
--metric-percentiles | Percentile values, e.g. 25,50,99 (default: 99) |
--goodput | SLO for goodput: ttft:500 tpot:50 (ms) |
Sampling Parameters (OpenAI-compatible backends)
--temperature 0.7 --top-p 0.95 --top-k 50
--frequency-penalty 0 --presence-penalty 0 --repetition-penalty 1.0
Common Workflows
1. Throughput test with random dataset (burst):
vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 500 --random-input-len 512 --random-output-len 128
2. Latency test with fixed QPS:
vllm bench serve --backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--request-rate 5 --num-prompts 200 \
--save-result --percentile-metrics ttft,tpot --metric-percentiles 50,99
3. Benchmark against remote API (base-url):
vllm bench serve --backend openai-chat \
--base-url "https://api.example.com/v1" \
--model my-model \
--header "Authorization=Bearer $API_KEY"
4. Run inside Docker (when vLLM client not on host):
docker exec <container-name> vllm bench serve \
--backend openai-chat --host 127.0.0.1 --port 8000 \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random --num-prompts 100
Troubleshooting
- Connection refused: Ensure the server is running and
--host/--port or --base-url are correct.
- Model not found: Pass
--model explicitly or ensure /v1/models returns the model.
- URL must end with chat/completions: Use
--endpoint /v1/chat/completions when --backend openai-chat.
- Rate limit / 429: Reduce
--request-rate or --max-concurrency.
- Ready check: Use
--ready-check-timeout-sec 60 to wait for the endpoint before benchmarking.
- SSL: Use
--insecure for self-signed certificates.
Notes
- For embeddings/rerank benchmarks, use
--backend openai-embeddings, vllm-pooling, or vllm-rerank.
--profile requires --profiler-config on the server for vLLM profiling.
- Goodput SLOs are useful for SLA-style analysis; see DistServe paper for details.