Execute qualquer Skill no Manus
com um clique

Execute qualquer Skill no Manus com um clique

$pwd:

vllm-bench-random-synthetic

Name: Vllm Bench Random Synthetic
Author: vllm-project

// Run vLLM performance benchmark using synthetic random data to measure throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and other key performance metrics. Use when the user wants to quickly test vLLM serving performance without downloading external datasets.

Executar no Manus

$ git log --oneline --stat

stars:76

forks:22

updated:3 de abril de 2026 às 14:11

SKILL.md

readonly

name	vllm-bench-random-synthetic
description	Run vLLM performance benchmark using synthetic random data to measure throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and other key performance metrics. Use when the user wants to quickly test vLLM serving performance without downloading external datasets.

vLLM Benchmark with Random Synthetic Data

Run a quick performance benchmark on a vLLM server using synthetic random data. This skill measures core serving metrics including request throughput, token throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and inter-token latency.

When to use

User wants to quickly benchmark vLLM serving performance
User wants to measure throughput and latency metrics without downloading datasets
User wants to test a vLLM deployment with synthetic workload
User wants baseline performance numbers for a specific model

Prerequisites

vLLM must be installed (pip install vllm)
A vLLM server must be running (or can be started as part of the benchmark)
For GPU models, NVIDIA GPU with appropriate drivers must be available

Quick Start

The simplest way to run the benchmark:

# Start vLLM server (in background or separate terminal)
vllm serve Qwen/Qwen2.5-1.5B-Instruct

# Run benchmark with random synthetic data
vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 10

Note:

Use --backend openai-chat with endpoint /v1/chat/completions for online benchmarks.

Parameters

Parameter	Description	Default
`--backend`	Backend type: `vllm`, `openai`, `openai-chat`	`vllm`
`--model`	Model name (must match the server)	Required
`--endpoint`	API endpoint path	`/v1/completions` or `/v1/chat/completions`
`--dataset-name`	Dataset to use	`random` (synthetic)
`--num-prompts`	Number of requests to send	`10`
`--port`	Server port	`8000`
`--max-concurrency`	Maximum concurrent requests	Auto
`--save-result`	Save results to file	Off
`--result-dir`	Directory to save results	`./`

Expected Output

When successful, you will see output like:

============ Serving Benchmark Result ============
Successful requests:                     10
Benchmark duration (s):                  5.78
Total input tokens:                      1369
Total generated tokens:                  2212
Request throughput (req/s):              1.73
Output token throughput (tok/s):         382.89
Total token throughput (tok/s):          619.85
---------------Time to First Token----------------
Mean TTFT (ms):                          71.54
Median TTFT (ms):                        73.88
P99 TTFT (ms):                           79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.91
Median TPOT (ms):                        7.96
P99 TPOT (ms):                           8.03
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.74
Median ITL (ms):                         7.70
P99 ITL (ms):                            8.39
==================================================

Advanced Usage

With more prompts for better statistics

vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 100

Save results to file

vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 50 \
  --save-result \
  --result-dir ./benchmark-results/

Custom port and concurrency

vllm bench serve \
  --backend openai-chat \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 100 \
  --port 8001 \
  --max-concurrency 4

Model Recommendations

For quick testing (small models, fast):

Qwen/Qwen2.5-1.5B-Instruct (recommended for quick tests)
facebook/opt-125m
facebook/opt-350m

For realistic benchmarks (medium models):

Qwen/Qwen2.5-7B-Instruct
meta-llama/Llama-3.1-8B-Instruct
mistralai/Mistral-7B-Instruct-v0.3

Workflow

Check if vLLM is installed: Run vllm --version to verify
Check if server is already running: Run curl http://localhost:8000/health to check
Start vLLM server if needed: Run vllm serve <model-name> (wait for "Application startup complete")
Run benchmark: Execute vllm bench serve with appropriate parameters
Review results: Check throughput and latency metrics
Clean up: If the agent skill started the vLLM server (not a pre-existing one), stop it after benchmark completion using kill <PID>

Troubleshooting

Server not responding:

Check if server is running: curl http://localhost:8000/health
Verify port matches: Use --port flag if server is on different port

Model not found:

Ensure model name matches exactly between server and benchmark
Check HuggingFace access: export HF_TOKEN=<your_token> if needed

Out of memory:

Use a smaller model (e.g., Qwen2.5-1.5B-Instruct)
Reduce --num-prompts or --max-concurrency

Connection refused:

Server may still be starting (wait for "Application startup complete")
Check firewall or network settings

Notes

The random dataset generates synthetic prompts automatically
Benchmark duration scales with --num-prompts
For production benchmarking, use at least 100 prompts for stable statistics
Results may vary based on hardware, model size, and system load
First run may be slower due to model loading and compilation
Important: If the agent skill starts a vLLM server for benchmarking, it must stop the server after the benchmark completes to free up resources. Do not stop pre-existing servers that were already running before the benchmark.

related-skills.json

mesmo repositório

vllm-bench-serve.md

from "vllm-project/vllm-skills"

Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs.

2026-04-0376

vllm-deploy-k8s.md

from "vllm-project/vllm-skills"

Deploy vLLM to Kubernetes (K8s) with GPU support, health probes, and OpenAI-compatible API endpoint. Use this skill whenever the user wants to deploy, run, or serve vLLM on a Kubernetes cluster, including creating deployments, services, checking existing deployments, or managing vLLM on K8s.

2026-04-0376

vllm-deploy-simple.md

from "vllm-project/vllm-skills"

Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.

2026-04-0376

vllm-prefix-cache-bench.md

from "vllm-project/vllm-skills"

This is a skill for benchmarking the efficiency of automatic prefix caching in vLLM using fixed prompts, real-world datasets, or synthetic prefix/suffix patterns. Use when the user asks to benchmark prefix caching hit rate, caching efficiency, or repeated-prompt performance in vLLM.

2026-04-0376

package.json

"author": "vllm-project"

"repository": "vllm-project/vllm-skills"

Abrir repositório GitHub Ver repositórios do creator

$ install --global

$ download --local

Executar no Manus

$ useful --forSOC

Desenvolvedores de softwareInformática e Matemática15-1252L4

name	vllm-bench-random-synthetic
description	Run vLLM performance benchmark using synthetic random data to measure throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and other key performance metrics. Use when the user wants to quickly test vLLM serving performance without downloading external datasets.

vLLM Benchmark with Random Synthetic Data

When to use

User wants to quickly benchmark vLLM serving performance
User wants to measure throughput and latency metrics without downloading datasets
User wants to test a vLLM deployment with synthetic workload
User wants baseline performance numbers for a specific model

Prerequisites

vLLM must be installed (pip install vllm)
A vLLM server must be running (or can be started as part of the benchmark)
For GPU models, NVIDIA GPU with appropriate drivers must be available

Quick Start

The simplest way to run the benchmark:

# Start vLLM server (in background or separate terminal)
vllm serve Qwen/Qwen2.5-1.5B-Instruct

# Run benchmark with random synthetic data
vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 10

Note:

Use --backend openai-chat with endpoint /v1/chat/completions for online benchmarks.

Parameters

Parameter	Description	Default
`--backend`	Backend type: `vllm`, `openai`, `openai-chat`	`vllm`
`--model`	Model name (must match the server)	Required
`--endpoint`	API endpoint path	`/v1/completions` or `/v1/chat/completions`
`--dataset-name`	Dataset to use	`random` (synthetic)
`--num-prompts`	Number of requests to send	`10`
`--port`	Server port	`8000`
`--max-concurrency`	Maximum concurrent requests	Auto
`--save-result`	Save results to file	Off
`--result-dir`	Directory to save results	`./`

Expected Output

When successful, you will see output like:

============ Serving Benchmark Result ============
Successful requests:                     10
Benchmark duration (s):                  5.78
Total input tokens:                      1369
Total generated tokens:                  2212
Request throughput (req/s):              1.73
Output token throughput (tok/s):         382.89
Total token throughput (tok/s):          619.85
---------------Time to First Token----------------
Mean TTFT (ms):                          71.54
Median TTFT (ms):                        73.88
P99 TTFT (ms):                           79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.91
Median TPOT (ms):                        7.96
P99 TPOT (ms):                           8.03
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.74
Median ITL (ms):                         7.70
P99 ITL (ms):                            8.39
==================================================

Advanced Usage

With more prompts for better statistics

vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 100

Save results to file

vllm bench serve \
  --backend openai-chat \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 50 \
  --save-result \
  --result-dir ./benchmark-results/

Custom port and concurrency

vllm bench serve \
  --backend openai-chat \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name random \
  --num-prompts 100 \
  --port 8001 \
  --max-concurrency 4

Model Recommendations

For quick testing (small models, fast):

Qwen/Qwen2.5-1.5B-Instruct (recommended for quick tests)
facebook/opt-125m
facebook/opt-350m

For realistic benchmarks (medium models):

Qwen/Qwen2.5-7B-Instruct
meta-llama/Llama-3.1-8B-Instruct
mistralai/Mistral-7B-Instruct-v0.3

Workflow

Check if vLLM is installed: Run vllm --version to verify
Check if server is already running: Run curl http://localhost:8000/health to check
Start vLLM server if needed: Run vllm serve <model-name> (wait for "Application startup complete")
Run benchmark: Execute vllm bench serve with appropriate parameters
Review results: Check throughput and latency metrics
Clean up: If the agent skill started the vLLM server (not a pre-existing one), stop it after benchmark completion using kill <PID>

Troubleshooting

Server not responding:

Check if server is running: curl http://localhost:8000/health
Verify port matches: Use --port flag if server is on different port

Model not found:

Ensure model name matches exactly between server and benchmark
Check HuggingFace access: export HF_TOKEN=<your_token> if needed

Out of memory:

Use a smaller model (e.g., Qwen2.5-1.5B-Instruct)
Reduce --num-prompts or --max-concurrency

Connection refused:

Server may still be starting (wait for "Application startup complete")
Check firewall or network settings

Notes

The random dataset generates synthetic prompts automatically
Benchmark duration scales with --num-prompts
For production benchmarking, use at least 100 prompts for stable statistics
Results may vary based on hardware, model size, and system load
First run may be slower due to model loading and compilation
Important: If the agent skill starts a vLLM server for benchmarking, it must stop the server after the benchmark completes to free up resources. Do not stop pre-existing servers that were already running before the benchmark.

vllm-bench-random-synthetic

vLLM Benchmark with Random Synthetic Data

When to use

Prerequisites

Quick Start

Parameters

Expected Output

Advanced Usage

With more prompts for better statistics

Save results to file

Custom port and concurrency

Model Recommendations

Workflow

Troubleshooting

Notes

Mais deste repositório

Mais deste repositório

vLLM Benchmark with Random Synthetic Data

When to use

Prerequisites

Quick Start

Parameters

Expected Output

Advanced Usage

With more prompts for better statistics

Save results to file

Custom port and concurrency

Model Recommendations

Workflow

Troubleshooting

Notes