Run any Skill in Manus with one click

model-serving-kubernetes

Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.

Run Skill in Manus

Stars30

Forks4

UpdatedMay 22, 2026 at 13:02

Source

BagelHole

BagelHole/DevOps-Security-Agent-Skills

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Computer Network ArchitectsComputer and Mathematical Occupations15-1241L4

SKILL.md

readonly

More from this repository

same repository

ai-pipeline-orchestration

BagelHole/DevOps-Security-Agent-Skills

Orchestrate AI/ML pipelines for data ingestion, model training, batch inference, and RAG indexing using Prefect, Airflow, or Dagster. Build reliable, observable, and retriable workflows for production AI systems.

2026-05-2230

llm-caching

BagelHole/DevOps-Security-Agent-Skills

Implement multi-layer LLM caching with exact match, semantic similarity, and provider-side prompt caching. Reduce API costs by 30–70%, cut latency, and improve throughput using Redis, GPTCache, and provider caching APIs.

2026-05-2230

llm-cost-optimization

BagelHole/DevOps-Security-Agent-Skills

Reduce LLM API and infrastructure costs through model selection, prompt caching, batching, caching, quantization, and self-hosting strategies. Track spend by team and model, set budgets, and implement cost-aware routing.

2026-05-2230

sre-dashboards

BagelHole/DevOps-Security-Agent-Skills

Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.

2026-05-2230

vector-database-ops

BagelHole/DevOps-Security-Agent-Skills

Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.

2026-05-2230

openclaw-security-hardening

BagelHole/DevOps-Security-Agent-Skills

Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.

2026-05-2230

name	model-serving-kubernetes
description	Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.
license	MIT
metadata	{"author":"devops-skills","version":"1.0"}

Model Serving on Kubernetes

Production ML model serving with KServe and Triton — canary deployments, autoscaling, and GPU-aware scheduling.

When to Use This Skill

Use this skill when:

Serving scikit-learn, PyTorch, TensorFlow, or ONNX models at scale
Implementing canary deployments and A/B testing for ML models
Autoscaling inference pods based on request rate or GPU metrics
Deploying LLMs with Triton or KServe on Kubernetes
Managing multiple model versions with traffic splitting

Prerequisites

Kubernetes 1.28+ with GPU nodes
KServe installed (or Triton standalone)
kubectl and helm configured
NVIDIA GPU Operator installed on cluster

KServe Installation

# Install KServe with Helm
helm repo add kserve https://kserve.github.io/helm-charts
helm repo update

helm install kserve kserve/kserve \
  --namespace kserve \
  --create-namespace \
  --set kserve.controller.gateway.ingressGateway.className=nginx

# Verify
kubectl get pods -n kserve
kubectl get crd | grep kserve

Basic InferenceService (KServe)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: models
spec:
  predictor:
    sklearn:
      storageUri: gs://kfserving-examples/models/sklearn/1.0/model
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "2"
          memory: 4Gi

kubectl apply -f inference-service.yaml

# Get inference service URL
kubectl get inferenceservice sklearn-iris -n models
# NAME           URL                                          READY   ...
# sklearn-iris   http://sklearn-iris.models.example.com       True

# Test prediction
curl -X POST http://sklearn-iris.models.example.com/v1/models/sklearn-iris:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[6.8, 2.8, 4.8, 1.4]]}'

GPU-Enabled LLM InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-8b
  namespace: models
  annotations:
    serving.kserve.io/enable-prometheus-scraping: "true"
spec:
  predictor:
    containers:
    - name: vllm-container
      image: vllm/vllm-openai:latest
      args:
      - "--model"
      - "meta-llama/Llama-3.1-8B-Instruct"
      - "--tensor-parallel-size"
      - "1"
      - "--gpu-memory-utilization"
      - "0.90"
      ports:
      - containerPort: 8080
        protocol: TCP
      resources:
        requests:
          nvidia.com/gpu: "1"
          memory: "20Gi"
          cpu: "4"
        limits:
          nvidia.com/gpu: "1"
          memory: "24Gi"
          cpu: "8"
      readinessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 60
        periodSeconds: 10
      env:
      - name: HUGGING_FACE_HUB_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-token
            key: token
    nodeSelector:
      nvidia.com/gpu.present: "true"
  transformer:
    containers:
    - name: kserve-container
      image: kserve/kserve-transformer:latest

Canary Deployment (Traffic Splitting)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-8b
  namespace: models
spec:
  predictor:
    canaryTrafficPercent: 20    # 20% to new version, 80% to stable
    containers:
    - name: vllm-container
      image: vllm/vllm-openai:latest
      args:
      - "--model"
      - "meta-llama/Llama-3.1-8B-Instruct-v2"  # new model version
      resources:
        limits:
          nvidia.com/gpu: "1"

# Gradually increase canary traffic
kubectl patch inferenceservice llama-3-8b -n models \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/predictor/canaryTrafficPercent","value":50}]'

# Promote canary to stable
kubectl patch inferenceservice llama-3-8b -n models \
  --type='json' \
  -p='[{"op":"remove","path":"/spec/predictor/canaryTrafficPercent"}]'

Autoscaling with KEDA

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama-scaler
  namespace: models
spec:
  scaleTargetRef:
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    name: llama-3-8b
  minReplicaCount: 1
  maxReplicaCount: 5
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: kserve_request_count
      threshold: "10"
      query: |
        sum(rate(kserve_request_count_total{namespace="models",
            service="llama-3-8b"}[1m]))

NVIDIA Triton Inference Server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
  namespace: models
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.05-py3
        args:
        - "tritonserver"
        - "--model-store=s3://my-model-store/models"
        - "--model-control-mode=poll"        # auto-load new model versions
        - "--repository-poll-secs=30"
        - "--metrics-port=8002"
        ports:
        - containerPort: 8000   # HTTP
        - containerPort: 8001   # gRPC
        - containerPort: 8002   # Metrics
        resources:
          limits:
            nvidia.com/gpu: "1"
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 30

Triton Model Repository Structure

s3://my-model-store/models/
├── text-classifier/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.onnx
│   └── 2/
│       └── model.onnx          # new version; auto-loaded
├── embedding-model/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx

# config.pbtxt for ONNX model
name: "text-classifier"
backend: "onnxruntime"
max_batch_size: 64
dynamic_batching {
  preferred_batch_size: [16, 32]
  max_queue_delay_microseconds: 1000
}
input [
  { name: "input_ids" data_type: TYPE_INT64 dims: [-1] }
  { name: "attention_mask" data_type: TYPE_INT64 dims: [-1] }
]
output [
  { name: "logits" data_type: TYPE_FP32 dims: [-1] }
]
instance_group [
  { kind: KIND_GPU count: 2 }   # 2 model instances on GPU
]

Model Management Commands

# List loaded models (Triton)
curl http://triton:8000/v2/models

# Load a new model version
curl -X POST http://triton:8000/v2/repository/models/text-classifier/load

# Unload a model
curl -X POST http://triton:8000/v2/repository/models/text-classifier/unload

# KServe — watch rollout status
kubectl rollout status deployment/llama-3-8b-predictor -n models
kubectl get inferenceservice llama-3-8b -n models -w

Common Issues

Issue	Cause	Fix
`InferenceService not ready`	Model loading or OOM	Check predictor pod logs; increase memory limits
Canary stuck at 0%	KNative routing issue	Check `kubectl get ksvc -n models`
Triton missing model	S3 permissions or path	Verify IAM role; check `--model-store` path
Low GPU utilization	Dynamic batching off	Enable `dynamic_batching` in Triton config
Autoscaler not triggering	Prometheus query wrong	Test query in Prometheus UI

Best Practices

Use canary deployments for all model updates — roll back in seconds if metrics degrade.
Enable Triton dynamic batching — it can increase GPU throughput 5–10× for small models.
Store models in S3/GCS with versioned paths (s3://bucket/model/v1/, v2/).
Pin GPU node selectors to prevent model pods landing on CPU-only nodes.
Monitor p99 latency and error rates per model version during canary rollouts.

Related Skills

vllm-server - vLLM for LLM serving
llm-inference-scaling - KEDA autoscaling
kubernetes-ops - Core Kubernetes operations
gpu-server-management - GPU nodes