with one click
vllm-deploy-simple
// Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.
// Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.
Run vLLM performance benchmark using synthetic random data to measure throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and other key performance metrics. Use when the user wants to quickly test vLLM serving performance without downloading external datasets.
Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs.
Deploy vLLM to Kubernetes (K8s) with GPU support, health probes, and OpenAI-compatible API endpoint. Use this skill whenever the user wants to deploy, run, or serve vLLM on a Kubernetes cluster, including creating deployments, services, checking existing deployments, or managing vLLM on K8s.
This is a skill for benchmarking the efficiency of automatic prefix caching in vLLM using fixed prompts, real-world datasets, or synthetic prefix/suffix patterns. Use when the user asks to benchmark prefix caching hit rate, caching efficiency, or repeated-prompt performance in vLLM.
| name | vllm-deploy-simple |
| description | Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API. |
A simple skill to quickly install vLLM, start a server, and validate the OpenAI-compatible API.
This skill provides a streamlined workflow to:
If user did not specify the venv path or asked to deploy in the current environment, create a venv using uv with python 3.12 in the current folder. If uv not found, make a folder in this path and use python to create a virtual environment.
If user did not specify the venv path, model, or port, use default options:
# Default deployment options (--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8)
scripts/quickstart.sh
Or with custom options:
# Use custom virtual environment
scripts/quickstart.sh --venv /path/to/venv
# Use custom model and port
scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000
# Use custom GPU memory utilization
scripts/quickstart.sh --gpu_memory_utilization 0.6
# Combine all options
scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8
This will:
Install vLLM:
scripts/quickstart.sh install
# Or with virtual environment
scripts/quickstart.sh install --venv /path/to/venv
Start the server:
scripts/quickstart.sh start
# Or with custom options
scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8
Test the API:
scripts/quickstart.sh test
# Or with custom port
scripts/quickstart.sh test --port 8000
Stop the server:
scripts/quickstart.sh stop
# Or with virtual environment
scripts/quickstart.sh stop --venv /path/to/venv
Check server status:
scripts/quickstart.sh status
Restart the server:
scripts/quickstart.sh restart
# Or with custom options
scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8
The script supports the following command-line options:
scripts/quickstart.sh [command] [OPTIONS]
Commands:
install - Install vLLM and dependencies
start - Start the vLLM server
stop - Stop the vLLM server
test - Test the OpenAI-compatible API
status - Show server status
restart - Restart the server
all - Run complete workflow (default)
Options:
--model MODEL Model to use (default: Qwen/Qwen2.5-1.5B-Instruct)
--port PORT Port to run server on (default: 8000)
--venv VENV_PATH Virtual environment path (default: .)
--gpu_memory_utilization VRAM GPU memory utilization (default: 0.8)
The script automatically detects your hardware and installs the appropriate vLLM version:
nvidia-smi command/dev/kfd and /dev/dri devicesTPU_NAME environment variable or gcloud commandFor Google TPU, the script installs vllm-tpu instead of the standard vllm package.
The test script sends a simple chat completion request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [{"role": "user", "content": "Say hello!"}],
"max_tokens": 50
}'
Virtual environment not found:
--venv exists and is a valid virtual environmentbin/activate on Linux/macOS or Scripts/activate on Windows)uv venv /path/to/venv (suggested); or with pip: python3 -m venv /path/to/venvServer won't start:
lsof -i :8000nvidia-smi (for NVIDIA) or rocm-smi (for AMD)python -c "import vllm; print(vllm.__version__)"$VENV_PATH/tmp/vllm-server.logAPI returns errors:
cat $VENV_PATH/tmp/vllm-server.logscripts/quickstart.sh statusOut of memory:
--gpu-memory-utilization parameterWrong backend detected:
nvidia-smi is in your PATHTPU_NAME environment variable or install gcloud$VENV_PATH/tmp/vllm-server.log$VENV_PATH/tmp/vllm-server.pid for easy managementuv if available, otherwise falls back to pipscripts/quickstart.sh --port 8080 start --venv /path/to/venv)