Run any Skill in Manus with one click

Get Started

$pwd:

vllm-deploy-simple

Name: Vllm Deploy Simple
Author: vllm-project

// Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.

Run Skill in Manus

$ git log --oneline --stat

stars:76

forks:22

updated:April 3, 2026 at 14:11

File Explorer

2 files

SKILL.md

readonly

name	vllm-deploy-simple
description	Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.

vLLM Simple Deployment

A simple skill to quickly install vLLM, start a server, and validate the OpenAI-compatible API.

What this skill does

This skill provides a streamlined workflow to:

Detect hardware backend (NVIDIA CUDA, AMD ROCm, Google TPU, or CPU)
Install vLLM with appropriate backend support
Start the vLLM server with configurable model and port
Test the OpenAI-compatible API endpoint
Validate the deployment is working correctly
Support virtual environment isolation

Prerequisites

Python 3.10+
GPU (NVIDIA CUDA, AMD ROCm) (recommended) or TPU or CPU
pip or uv package manager
curl (for API testing)
Virtual environment (optional but recommended)

Usage

Create a venv

If user did not specify the venv path or asked to deploy in the current environment, create a venv using uv with python 3.12 in the current folder. If uv not found, make a folder in this path and use python to create a virtual environment.

Run the complete workflow (suggested)

If user did not specify the venv path, model, or port, use default options:

# Default deployment options (--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8)
scripts/quickstart.sh

Or with custom options:

# Use custom virtual environment
scripts/quickstart.sh --venv /path/to/venv

# Use custom model and port
scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000

# Use custom GPU memory utilization
scripts/quickstart.sh --gpu_memory_utilization 0.6

# Combine all options
scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8

This will:

Activate the virtual environment (if specified)
Detect hardware backend (CUDA/ROCm/TPU/CPU)
Install vLLM with appropriate backend support
Start the vLLM server in the background
Wait for the server to be ready
Test the API with a sample request
Display the server status

Run individual commands (for step-by-step usage or troubleshooting)

Install vLLM:

scripts/quickstart.sh install
# Or with virtual environment
scripts/quickstart.sh install --venv /path/to/venv

Start the server:

scripts/quickstart.sh start
# Or with custom options
scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8

Test the API:

scripts/quickstart.sh test
# Or with custom port
scripts/quickstart.sh test --port 8000

Stop the server:

scripts/quickstart.sh stop
# Or with virtual environment
scripts/quickstart.sh stop --venv /path/to/venv

Check server status:

scripts/quickstart.sh status

Restart the server:

scripts/quickstart.sh restart
# Or with custom options
scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8

Configuration

The script supports the following command-line options:

scripts/quickstart.sh [command] [OPTIONS]

Commands:
  install  - Install vLLM and dependencies
  start    - Start the vLLM server
  stop     - Stop the vLLM server
  test     - Test the OpenAI-compatible API
  status   - Show server status
  restart  - Restart the server
  all      - Run complete workflow (default)

Options:
  --model MODEL                 Model to use (default: Qwen/Qwen2.5-1.5B-Instruct)
  --port PORT                   Port to run server on (default: 8000)
  --venv VENV_PATH              Virtual environment path (default: .)
  --gpu_memory_utilization VRAM GPU memory utilization (default: 0.8)

Hardware Backend Detection

The script automatically detects your hardware and installs the appropriate vLLM version:

NVIDIA CUDA: Detected via nvidia-smi command
AMD ROCm: Detected via /dev/kfd and /dev/dri devices
Google TPU: Detected via TPU_NAME environment variable or gcloud command
CPU: Fallback if no GPU/TPU detected

For Google TPU, the script installs vllm-tpu instead of the standard vllm package.

API Testing

The test script sends a simple chat completion request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Say hello!"}],
    "max_tokens": 50
  }'

Troubleshooting

Virtual environment not found:

Ensure the path provided with --venv exists and is a valid virtual environment
Check that the activation script exists (bin/activate on Linux/macOS or Scripts/activate on Windows)
Check and install uv, and create a new virtual environment with uv: uv venv /path/to/venv (suggested); or with pip: python3 -m venv /path/to/venv

Server won't start:

Check if the port is already in use: lsof -i :8000
Verify GPU availability: nvidia-smi (for NVIDIA) or rocm-smi (for AMD)
Check vLLM installation: python -c "import vllm; print(vllm.__version__)"
Review server logs at $VENV_PATH/tmp/vllm-server.log

API returns errors:

Wait a few seconds for the model to load
Check server logs: cat $VENV_PATH/tmp/vllm-server.log
Verify the server is running: scripts/quickstart.sh status

Out of memory:

Use a smaller model (e.g., Qwen2.5-0.5B-Instruct)
Reduce --gpu-memory-utilization parameter
Close other GPU-intensive applications

Wrong backend detected:

For NVIDIA: Ensure nvidia-smi is in your PATH
For AMD: Check that ROCm drivers are properly installed
For TPU: Set TPU_NAME environment variable or install gcloud

Notes

The server runs in the background and logs to $VENV_PATH/tmp/vllm-server.log
The PID is stored in $VENV_PATH/tmp/vllm-server.pid for easy management
First run will download the model (~3GB for Qwen2.5-1.5B-Instruct)
Subsequent runs will use the cached model
The script automatically detects and uses uv if available, otherwise falls back to pip
Virtual environment support allows isolation from system Python packages
Arguments can be specified in any order (e.g., scripts/quickstart.sh --port 8080 start --venv /path/to/venv)

related-skills.json

same repository

vllm-bench-random-synthetic.md

from "vllm-project/vllm-skills"

Run vLLM performance benchmark using synthetic random data to measure throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and other key performance metrics. Use when the user wants to quickly test vLLM serving performance without downloading external datasets.

2026-04-0376

vllm-bench-serve.md

from "vllm-project/vllm-skills"

Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs.

2026-04-0376

vllm-deploy-k8s.md

from "vllm-project/vllm-skills"

Deploy vLLM to Kubernetes (K8s) with GPU support, health probes, and OpenAI-compatible API endpoint. Use this skill whenever the user wants to deploy, run, or serve vLLM on a Kubernetes cluster, including creating deployments, services, checking existing deployments, or managing vLLM on K8s.

2026-04-0376

vllm-prefix-cache-bench.md

from "vllm-project/vllm-skills"

This is a skill for benchmarking the efficiency of automatic prefix caching in vLLM using fixed prompts, real-world datasets, or synthetic prefix/suffix patterns. Use when the user asks to benchmark prefix caching hit rate, caching efficiency, or repeated-prompt performance in vLLM.

2026-04-0376

package.json

"author": "vllm-project"

"repository": "vllm-project/vllm-skills"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

name	vllm-deploy-simple
description	Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.

vLLM Simple Deployment

A simple skill to quickly install vLLM, start a server, and validate the OpenAI-compatible API.

What this skill does

This skill provides a streamlined workflow to:

Detect hardware backend (NVIDIA CUDA, AMD ROCm, Google TPU, or CPU)
Install vLLM with appropriate backend support
Start the vLLM server with configurable model and port
Test the OpenAI-compatible API endpoint
Validate the deployment is working correctly
Support virtual environment isolation

Prerequisites

Python 3.10+
GPU (NVIDIA CUDA, AMD ROCm) (recommended) or TPU or CPU
pip or uv package manager
curl (for API testing)
Virtual environment (optional but recommended)

Usage

Create a venv

Run the complete workflow (suggested)

If user did not specify the venv path, model, or port, use default options:

# Default deployment options (--venv "." --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8)
scripts/quickstart.sh

Or with custom options:

# Use custom virtual environment
scripts/quickstart.sh --venv /path/to/venv

# Use custom model and port
scripts/quickstart.sh --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000

# Use custom GPU memory utilization
scripts/quickstart.sh --gpu_memory_utilization 0.6

# Combine all options
scripts/quickstart.sh --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8

This will:

Activate the virtual environment (if specified)
Detect hardware backend (CUDA/ROCm/TPU/CPU)
Install vLLM with appropriate backend support
Start the vLLM server in the background
Wait for the server to be ready
Test the API with a sample request
Display the server status

Run individual commands (for step-by-step usage or troubleshooting)

Install vLLM:

scripts/quickstart.sh install
# Or with virtual environment
scripts/quickstart.sh install --venv /path/to/venv

Start the server:

scripts/quickstart.sh start
# Or with custom options
scripts/quickstart.sh start --venv /path/to/venv --model "Qwen/Qwen2.5-1.5B-Instruct" --port 8000 --gpu_memory_utilization 0.8

Test the API:

scripts/quickstart.sh test
# Or with custom port
scripts/quickstart.sh test --port 8000

Stop the server:

scripts/quickstart.sh stop
# Or with virtual environment
scripts/quickstart.sh stop --venv /path/to/venv

Check server status:

scripts/quickstart.sh status

Restart the server:

scripts/quickstart.sh restart
# Or with custom options
scripts/quickstart.sh restart --venv /path/to/venv --port 8000 --gpu_memory_utilization 0.8

Configuration

The script supports the following command-line options:

scripts/quickstart.sh [command] [OPTIONS]

Commands:
  install  - Install vLLM and dependencies
  start    - Start the vLLM server
  stop     - Stop the vLLM server
  test     - Test the OpenAI-compatible API
  status   - Show server status
  restart  - Restart the server
  all      - Run complete workflow (default)

Options:
  --model MODEL                 Model to use (default: Qwen/Qwen2.5-1.5B-Instruct)
  --port PORT                   Port to run server on (default: 8000)
  --venv VENV_PATH              Virtual environment path (default: .)
  --gpu_memory_utilization VRAM GPU memory utilization (default: 0.8)

Hardware Backend Detection

The script automatically detects your hardware and installs the appropriate vLLM version:

NVIDIA CUDA: Detected via nvidia-smi command
AMD ROCm: Detected via /dev/kfd and /dev/dri devices
Google TPU: Detected via TPU_NAME environment variable or gcloud command
CPU: Fallback if no GPU/TPU detected

For Google TPU, the script installs vllm-tpu instead of the standard vllm package.

API Testing

The test script sends a simple chat completion request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Say hello!"}],
    "max_tokens": 50
  }'

Troubleshooting

Virtual environment not found:

Ensure the path provided with --venv exists and is a valid virtual environment
Check that the activation script exists (bin/activate on Linux/macOS or Scripts/activate on Windows)
Check and install uv, and create a new virtual environment with uv: uv venv /path/to/venv (suggested); or with pip: python3 -m venv /path/to/venv

Server won't start:

Check if the port is already in use: lsof -i :8000
Verify GPU availability: nvidia-smi (for NVIDIA) or rocm-smi (for AMD)
Check vLLM installation: python -c "import vllm; print(vllm.__version__)"
Review server logs at $VENV_PATH/tmp/vllm-server.log

API returns errors:

Wait a few seconds for the model to load
Check server logs: cat $VENV_PATH/tmp/vllm-server.log
Verify the server is running: scripts/quickstart.sh status

Out of memory:

Use a smaller model (e.g., Qwen2.5-0.5B-Instruct)
Reduce --gpu-memory-utilization parameter
Close other GPU-intensive applications

Wrong backend detected:

For NVIDIA: Ensure nvidia-smi is in your PATH
For AMD: Check that ROCm drivers are properly installed
For TPU: Set TPU_NAME environment variable or install gcloud

Notes

The server runs in the background and logs to $VENV_PATH/tmp/vllm-server.log
The PID is stored in $VENV_PATH/tmp/vllm-server.pid for easy management
First run will download the model (~3GB for Qwen2.5-1.5B-Instruct)
Subsequent runs will use the cached model
The script automatically detects and uses uv if available, otherwise falls back to pip
Virtual environment support allows isolation from system Python packages
Arguments can be specified in any order (e.g., scripts/quickstart.sh --port 8080 start --venv /path/to/venv)

vllm-deploy-simple

vLLM Simple Deployment

What this skill does

Prerequisites

Usage

Create a venv

Run the complete workflow (suggested)

Run individual commands (for step-by-step usage or troubleshooting)

Configuration

Hardware Backend Detection

API Testing

Troubleshooting

Notes

More from this repository

More from this repository

vLLM Simple Deployment

What this skill does

Prerequisites

Usage

Create a venv

Run the complete workflow (suggested)

Run individual commands (for step-by-step usage or troubleshooting)

Configuration

Hardware Backend Detection

API Testing

Troubleshooting

Notes