Execute qualquer Skill no Manus
com um clique

Execute qualquer Skill no Manus com um clique

$pwd:

vllm-deploy-k8s

Name: Vllm Deploy K8s
Author: vllm-project

// Deploy vLLM to Kubernetes (K8s) with GPU support, health probes, and OpenAI-compatible API endpoint. Use this skill whenever the user wants to deploy, run, or serve vLLM on a Kubernetes cluster, including creating deployments, services, checking existing deployments, or managing vLLM on K8s.

Executar no Manus

$ git log --oneline --stat

stars:76

forks:22

updated:3 de abril de 2026 às 14:11

Explorador de arquivos

3 arquivos

SKILL.md

readonly

related-skills.json

mesmo repositório

vllm-bench-random-synthetic.md

from "vllm-project/vllm-skills"

Run vLLM performance benchmark using synthetic random data to measure throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and other key performance metrics. Use when the user wants to quickly test vLLM serving performance without downloading external datasets.

2026-04-0376

vllm-bench-serve.md

from "vllm-project/vllm-skills"

Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs.

2026-04-0376

vllm-deploy-simple.md

from "vllm-project/vllm-skills"

Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.

2026-04-0376

vllm-prefix-cache-bench.md

from "vllm-project/vllm-skills"

This is a skill for benchmarking the efficiency of automatic prefix caching in vLLM using fixed prompts, real-world datasets, or synthetic prefix/suffix patterns. Use when the user asks to benchmark prefix caching hit rate, caching efficiency, or repeated-prompt performance in vLLM.

2026-04-0376

package.json

"author": "vllm-project"

"repository": "vllm-project/vllm-skills"

Abrir repositório GitHub Ver repositórios do creator

$ install --global

$ download --local

Executar no Manus

$ useful --forSOC

Administradores de redes e sistemas de computadorInformática e Matemática15-1244L4

name	vllm-deploy-k8s
description	Deploy vLLM to Kubernetes (K8s) with GPU support, health probes, and OpenAI-compatible API endpoint. Use this skill whenever the user wants to deploy, run, or serve vLLM on a Kubernetes cluster, including creating deployments, services, checking existing deployments, or managing vLLM on K8s.

vLLM Kubernetes Deployment

A Claude skill for deploying vLLM to Kubernetes using YAML templates. Deploys a vLLM OpenAI-compatible server as a Kubernetes Deployment with a ClusterIP Service, GPU resources, and health probes.

What this skill does

Deploy vLLM as a Kubernetes Deployment + Service with NVIDIA GPU support
Check if a vLLM deployment already exists before deploying
Check if the Hugging Face token secret exists, and ask the user for their token if not
Use the vllm/vllm-openai:latest image by default (user can specify a different version)
Provide sensible default configuration that users can customize (model, replicas, GPU count, extra vLLM flags, etc.)

Prerequisites

kubectl configured with access to a Kubernetes cluster
NVIDIA GPU Operator or device plugin installed on cluster nodes
Hugging Face token (required for gated models like Llama, optional for public models)

Deployment Steps

Step 1: Check HF token secret

Before deploying, check if the hf-token Kubernetes secret exists in the target namespace:

kubectl get secret hf-token -n <namespace>

If the secret exists: proceed to Step 2.
If the secret does not exist: ask the user to provide their Hugging Face token, then create the secret:

kubectl create secret generic hf-token --from-literal=HF_TOKEN="<user-provided-token>" -n <namespace>

This is required for gated models (e.g., meta-llama/Meta-Llama-3.1-8B). For public models, the secret is optional but recommended to avoid rate limits.

Step 2: Check if deployment already exists

Before applying, check if a vLLM deployment already exists:

kubectl get deployment vllm -n <namespace>

If it exists: inform the user that the deployment already exists. Show the current image and status. Ask the user if they want to update it or skip.
If it does not exist: proceed to deploy.

Step 3: Deploy

Apply the template YAML files to deploy vLLM:

kubectl apply -f templates/vllm-service.yaml -n <namespace>
kubectl apply -f templates/vllm-deployment.yaml -n <namespace>

Step 4: Wait and verify

Wait for the deployment to roll out:

kubectl rollout status deployment/vllm -n <namespace> --timeout=600s

Verify the pod is running and ready:

kubectl get pods -n <namespace> -l app=vllm

Confirm the pod shows READY 1/1 and STATUS Running. If the pod is not ready yet, wait and check again. If it's in CrashLoopBackOff or Error, check the logs with kubectl logs -n <namespace> -l app=vllm.

Step 5: Print deployment summary

Once the pod is ready, print a summary message to the user in this format (replace placeholders with actual values):

🎉 **vLLM Deployment Successful!**

| Resource | Name | Status |
|----------|------|--------|
| Deployment | <deployment-name> | <ready>/<total> Ready |
| Service | <service-name> | ClusterIP:<port> |
| Pod | <pod-name> | Running |
| Image | <image> | |
| Model | <model> | |

&nbsp;

**To test the API, run these two commands in your terminal:**

**1. Open a port-forward** (this connects your local port <port> to the vLLM service inside the cluster):

kubectl port-forward svc/vllm-svc <port>:<port> -n <namespace>

**2. In a separate terminal**, send a test request to the OpenAI-compatible API:

curl -s http://localhost:<port>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"<model>","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}' | python3 -m json.tool

If everything is working, you'll get a JSON response with the model's reply.

Default Configuration

The templates use the following defaults:

Parameter	Default Value
Image	`vllm/vllm-openai:latest`
Model	`Qwen/Qwen2.5-1.5B-Instruct`
Port	`8000`
Replicas	`1`
GPU count	`1`
GPU memory utilization	`0.85`
Tensor parallel size	`1`
CPU request / limit	`12` / `128`
Memory request / limit	`100Gi` / `400Gi`
Shared memory (dshm)	`80Gi`

Customization

When the user requests changes, modify the template YAML files before applying. The following can be customized:

Image version: Change image: vllm/vllm-openai:<version> in templates/vllm-deployment.yaml (default: latest). Use a specific version tag like v0.17.1 if the user requests it.
Model: Change the model name in the vllm serve command inside the Deployment args.
Extra vLLM flags: Append additional flags to the vllm serve command in the Deployment args (e.g., --max-model-len 4096, --kv-cache-dtype fp8, --enforce-eager, --generation-config vllm).
Replicas: Change replicas: in the Deployment spec.
GPU count: Change nvidia.com/gpu in both requests and limits under resources.
Tensor parallel size: Change --tensor-parallel-size flag to match the GPU count.
CPU/Memory resources: Change cpu and memory values under requests and limits.
Port: Change containerPort in the Deployment, port/targetPort in the Service, the port in all health probes (liveness, readiness, startup), AND add --port <port> to the vllm serve command in args. All four must match.
Namespace: Apply to a specific namespace using -n <namespace>.
Shared memory size: Change the sizeLimit of the dshm emptyDir volume.

Edit the template files using the Edit tool, then apply the modified templates.

Status Check

kubectl get deployment,svc,pods -n <namespace> -l app=vllm

Cleanup

When the user asks to clean up or delete the vLLM deployment, run the following steps:

Delete the Deployment and Service:

kubectl delete -f templates/vllm-deployment.yaml -n <namespace>
kubectl delete -f templates/vllm-service.yaml -n <namespace>

Ask the user if they also want to delete the HF token secret. If yes:

kubectl delete secret hf-token -n <namespace>

Verify everything is cleaned up:

kubectl get deployment,svc,pods -n <namespace> -l app=vllm

Print a summary message to the user:

vLLM deployment has been cleaned up from namespace <namespace>.
Deleted: Deployment/vllm, Service/vllm-svc
HF token secret: <kept/deleted>

Troubleshooting

Pod stuck in Pending: No GPU nodes available. Check kubectl describe pod <pod-name> for scheduling errors. Ensure NVIDIA GPU Operator or device plugin is installed.
Pod OOMKilled: Increase memory limits in the Deployment, or use a smaller model.
ImagePullBackOff: Check the image name and tag. Verify the node has access to Docker Hub / the container registry.
Startup probe failures (CrashLoopBackOff): Model download may be slow. Check logs with kubectl logs <pod-name>. Ensure hf-token secret exists for gated models. Increase failureThreshold on the startup probe if needed.
HF_TOKEN not working: Verify the secret exists: kubectl get secret hf-token -n <namespace>. Check the token is valid.
GPU not detected in container: Ensure nvidia.com/gpu resource is requested and the NVIDIA device plugin is running on the node.

vllm-deploy-k8s

vLLM Kubernetes Deployment

What this skill does

Prerequisites

Deployment Steps

Step 1: Check HF token secret

Step 2: Check if deployment already exists

Step 3: Deploy

Step 4: Wait and verify

Step 5: Print deployment summary

Default Configuration

Customization

Status Check

Cleanup

Troubleshooting

References

vLLM Kubernetes Deployment

What this skill does

Prerequisites

Deployment Steps

Step 1: Check HF token secret

Step 2: Check if deployment already exists

Step 3: Deploy

Step 4: Wait and verify

Step 5: Print deployment summary

Default Configuration

Customization

Status Check

Cleanup

Troubleshooting

References

vllm-deploy-k8s

Mais deste repositório

Mais deste repositório

vLLM Kubernetes Deployment

What this skill does

Prerequisites

Deployment Steps

Step 1: Check HF token secret

Step 2: Check if deployment already exists

Step 3: Deploy

Step 4: Wait and verify

Step 5: Print deployment summary

Default Configuration

Customization

Status Check

Cleanup

Troubleshooting

References

vLLM Kubernetes Deployment

What this skill does

Prerequisites

Deployment Steps

Step 1: Check HF token secret

Step 2: Check if deployment already exists

Step 3: Deploy

Step 4: Wait and verify

Step 5: Print deployment summary

Default Configuration

Customization

Status Check

Cleanup

Troubleshooting

References