| name | model-serving-kubernetes |
| description | Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving. |
| license | MIT |
| metadata | {"author":"devops-skills","version":"1.0"} |
Model Serving on Kubernetes
Production ML model serving with KServe and Triton — canary deployments, autoscaling, and GPU-aware scheduling.
When to Use This Skill
Use this skill when:
- Serving scikit-learn, PyTorch, TensorFlow, or ONNX models at scale
- Implementing canary deployments and A/B testing for ML models
- Autoscaling inference pods based on request rate or GPU metrics
- Deploying LLMs with Triton or KServe on Kubernetes
- Managing multiple model versions with traffic splitting
Prerequisites
- Kubernetes 1.28+ with GPU nodes
- KServe installed (or Triton standalone)
kubectl and helm configured
- NVIDIA GPU Operator installed on cluster
KServe Installation
helm repo add kserve https://kserve.github.io/helm-charts
helm repo update
helm install kserve kserve/kserve \
--namespace kserve \
--create-namespace \
--set kserve.controller.gateway.ingressGateway.className=nginx
kubectl get pods -n kserve
kubectl get crd | grep kserve
Basic InferenceService (KServe)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
namespace: models
spec:
predictor:
sklearn:
storageUri: gs://kfserving-examples/models/sklearn/1.0/model
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "2"
memory: 4Gi
kubectl apply -f inference-service.yaml
kubectl get inferenceservice sklearn-iris -n models
curl -X POST http://sklearn-iris.models.example.com/v1/models/sklearn-iris:predict \
-H "Content-Type: application/json" \
-d '{"instances": [[6.8, 2.8, 4.8, 1.4]]}'
GPU-Enabled LLM InferenceService
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-8b
namespace: models
annotations:
serving.kserve.io/enable-prometheus-scraping: "true"
spec:
predictor:
containers:
- name: vllm-container
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--tensor-parallel-size"
- "1"
- "--gpu-memory-utilization"
- "0.90"
ports:
- containerPort: 8080
protocol: TCP
resources:
requests:
nvidia.com/gpu: "1"
memory: "20Gi"
cpu: "4"
limits:
nvidia.com/gpu: "1"
memory: "24Gi"
cpu: "8"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
nodeSelector:
nvidia.com/gpu.present: "true"
transformer:
containers:
- name: kserve-container
image: kserve/kserve-transformer:latest
Canary Deployment (Traffic Splitting)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-8b
namespace: models
spec:
predictor:
canaryTrafficPercent: 20
containers:
- name: vllm-container
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct-v2"
resources:
limits:
nvidia.com/gpu: "1"
kubectl patch inferenceservice llama-3-8b -n models \
--type='json' \
-p='[{"op":"replace","path":"/spec/predictor/canaryTrafficPercent","value":50}]'
kubectl patch inferenceservice llama-3-8b -n models \
--type='json' \
-p='[{"op":"remove","path":"/spec/predictor/canaryTrafficPercent"}]'
Autoscaling with KEDA
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llama-scaler
namespace: models
spec:
scaleTargetRef:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
name: llama-3-8b
minReplicaCount: 1
maxReplicaCount: 5
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring:9090
metricName: kserve_request_count
threshold: "10"
query: |
sum(rate(kserve_request_count_total{namespace="models",
service="llama-3-8b"}[1m]))
NVIDIA Triton Inference Server
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-server
namespace: models
spec:
replicas: 2
selector:
matchLabels:
app: triton
template:
metadata:
labels:
app: triton
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.05-py3
args:
- "tritonserver"
- "--model-store=s3://my-model-store/models"
- "--model-control-mode=poll"
- "--repository-poll-secs=30"
- "--metrics-port=8002"
ports:
- containerPort: 8000
- containerPort: 8001
- containerPort: 8002
resources:
limits:
nvidia.com/gpu: "1"
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 30
Triton Model Repository Structure
s3://my-model-store/models/
├── text-classifier/
│ ├── config.pbtxt
│ ├── 1/
│ │ └── model.onnx
│ └── 2/
│ └── model.onnx # new version; auto-loaded
├── embedding-model/
│ ├── config.pbtxt
│ └── 1/
│ └── model.onnx
# config.pbtxt for ONNX model
name: "text-classifier"
backend: "onnxruntime"
max_batch_size: 64
dynamic_batching {
preferred_batch_size: [16, 32]
max_queue_delay_microseconds: 1000
}
input [
{ name: "input_ids" data_type: TYPE_INT64 dims: [-1] }
{ name: "attention_mask" data_type: TYPE_INT64 dims: [-1] }
]
output [
{ name: "logits" data_type: TYPE_FP32 dims: [-1] }
]
instance_group [
{ kind: KIND_GPU count: 2 } # 2 model instances on GPU
]
Model Management Commands
curl http://triton:8000/v2/models
curl -X POST http://triton:8000/v2/repository/models/text-classifier/load
curl -X POST http://triton:8000/v2/repository/models/text-classifier/unload
kubectl rollout status deployment/llama-3-8b-predictor -n models
kubectl get inferenceservice llama-3-8b -n models -w
Common Issues
| Issue | Cause | Fix |
|---|
InferenceService not ready | Model loading or OOM | Check predictor pod logs; increase memory limits |
| Canary stuck at 0% | KNative routing issue | Check kubectl get ksvc -n models |
| Triton missing model | S3 permissions or path | Verify IAM role; check --model-store path |
| Low GPU utilization | Dynamic batching off | Enable dynamic_batching in Triton config |
| Autoscaler not triggering | Prometheus query wrong | Test query in Prometheus UI |
Best Practices
- Use canary deployments for all model updates — roll back in seconds if metrics degrade.
- Enable Triton dynamic batching — it can increase GPU throughput 5–10× for small models.
- Store models in S3/GCS with versioned paths (
s3://bucket/model/v1/, v2/).
- Pin GPU node selectors to prevent model pods landing on CPU-only nodes.
- Monitor p99 latency and error rates per model version during canary rollouts.
Related Skills