| name | LiteLLM-Huawei-MaaS-Proxy |
| description | Deploy, configure, validate, troubleshoot, or extend an OpenAI-compatible API proxy backed by PostgreSQL, Prometheus, and Grafana, routing through Huawei ModelArts MaaS (ap-southeast-1) with multi-key load balancing. TRIGGER when the task involves LiteLLM proxy deployment, Docker Compose stack with litellm_config.yaml, Huawei MaaS model routing, virtual key or budget management, Prometheus/Grafana observability for LLM traffic, custom_callbacks.py TTFT/TPOT/ITL metrics, multi-key load balancing, or any reference to `LITELLM_MASTER_KEY`, `HUAWEI_MAAS_API_KEY`, or `docker compose` with this stack. |
LiteLLM Huawei MaaS Proxy
Deploy an OpenAI-compatible API proxy backed by PostgreSQL, Prometheus, and Grafana, routing through Huawei ModelArts MaaS (ap-southeast-1) with multi-key load balancing.
This repo ships runtime stack files for deterministic clone-and-run deployment. The file templates below serve as reference for understanding each file or building from scratch when git is not available.
When to Use
| Situation | Route |
|---|
| Deploy the full stack from scratch | Follow Deployment Workflow |
| Add or modify a model in the proxy | Follow Adding a new model |
| Troubleshoot a broken deployment | Follow Repair Playbook, then Common failure modes |
| Validate an existing deployment | Follow Validation Sequence |
| Manage virtual keys, budgets, or teams | Follow Virtual key management |
| Extend observability (custom metrics, dashboards) | Read Metrics and Grafana Dashboard sections |
| Backup, restore, or reset data | Follow Operations |
When NOT to use:
- Direct MaaS API calls without proxy (no spend tracking, no rate limiting)
- Non-Huawei LLM providers (this stack is MaaS-specific)
- Multi-host / Kubernetes deployment (this is a single-host Docker Compose stack)
Required Inputs
Confirm before making changes:
- Huawei MaaS API key — from ModelArts MaaS console. Mandatory main key.
- Additional MaaS API keys (optional) — for load balancing. Each extra key multiplies effective RPM/TPM per model.
- Huawei MaaS API base —
https://api-ap-southeast-1.modelarts-maas.com/openai/v1 (ap-southeast-1 / CN-Hong Kong). Do not swap regions without re-validating models and quotas.
- Explicit MaaS model IDs to expose (e.g.
glm-5.1, deepseek-v4-flash). Verify in MaaS console — do not guess. If the user only gives one model, prefer explicit routing for that model instead of adding all five.
- LiteLLM listen port — default
4000. Override in docker-compose.yml ports section if colliding with an existing service.
- Prometheus listen port — default
9090.
- Grafana listen port — default
3000.
- Prometheus retention — default
15d. Adjust for disk capacity.
- Whether virtual keys already exist — if yes,
LITELLM_SALT_KEY is immutable and cannot be changed.
- Docker 20.10+ with Compose V2 on the target host.
All of these are collected by ./scripts/init_env.sh (interactive, --auto, or --ci mode). The user can always choose manual .env editing instead.
Core Rules
- Never commit
.env, real API keys, virtual keys, or bearer tokens. Secrets live in .env (gitignored) with 0600 permissions.
- Never change
LITELLM_SALT_KEY after virtual keys exist. Recovery requires docker compose down -v and fresh start.
- Model names are case-sensitive. Must match MaaS console exactly.
- MaaS is region-locked to
ap-southeast-1.
- LiteLLM config is generated by
scripts/generate_config.sh. Do not edit litellm_config.yaml directly — it will be overwritten. Edit .env and re-run generate_config.sh.
- LiteLLM config is read-only at startup. Changes require
docker compose restart litellm.
- Every model must have non-zero
input_cost_per_token and output_cost_per_token for budget enforcement to work.
- Keep master key admin-only. Mint child virtual keys per team/service/environment.
- Make proxy the only egress path for MaaS traffic so budgets, rate limits, and spend logs stay centralized.
STORE_MODEL_IN_DB: True — DB models take precedence over config file models.
drop_params: True — unsupported parameters silently dropped rather than causing errors.
- TTFT and ITL custom metrics are streaming-only.
- With N MaaS API keys, each model has N deployments. LiteLLM load-balances across them. Effective RPM/TPM = per-key × N.
HUAWEI_MAAS_API_KEY is mandatory. Extra keys are optional. HUAWEI_MAAS_API_KEY_COUNT and HUAWEI_MAAS_API_KEY_N are set by init_env.sh.
Multi-Key Load Balancing
The proxy supports multiple MaaS API keys for increased throughput and resilience:
| Concept | Detail |
|---|
| Main key | HUAWEI_MAAS_API_KEY — mandatory, always required |
| Extra keys | Optional, collected by init_env.sh or set via HUAWEI_MAAS_EXTRA_API_KEYS (CI mode) |
| Internal vars | HUAWEI_MAAS_API_KEY_COUNT=N, HUAWEI_MAAS_API_KEY_0 through _N-1 — set by init_env.sh |
| Config generation | scripts/generate_config.sh reads .env and generates litellm_config.yaml with N deployments per model |
| Routing strategy | simple-shuffle (default) — round-robin with random shuffle. Alternatives: least-busy, latency-based-routing |
| Cooldown | cooldown_time: 30, allowed_fails: 3 — failed deployments temporarily removed from rotation |
| Effective capacity | RPM/TPM per model = per-key × N keys |
| Backward compatible | Single key (N=1) = identical behavior to before |
Adding/removing MaaS API keys
Add a key:
- Add
HUAWEI_MAAS_API_KEY_N=<key> to .env (next index)
- Increment
HUAWEI_MAAS_API_KEY_COUNT
- Run
scripts/generate_config.sh
- Run
docker compose restart litellm
Remove a key:
- Remove the
HUAWEI_MAAS_API_KEY_N line from .env
- Re-number remaining keys to be contiguous (_0, _1, ...)
- Update
HUAWEI_MAAS_API_KEY_COUNT
- Run
scripts/generate_config.sh
- Run
docker compose restart litellm
Change routing strategy:
./scripts/generate_config.sh --routing-strategy=least-busy
docker compose restart litellm
Architecture
Client → LiteLLM (:4000) → Huawei MaaS (ap-southeast-1)
│ │
│ ┌────┴────┐
│ │ N API │ (N = HUAWEI_MAAS_API_KEY_COUNT)
│ │ keys │ LiteLLM load-balances across N deployments
│ └────────┘
├── PostgreSQL (:5432) — keys, usage, spend
├── Prometheus (:9090) — /metrics scrape every 15s
└── Grafana (:3000) — pre-built dashboard
Startup chain: PostgreSQL (pg_isready) → LiteLLM (/health/liveliness) → Prometheus (scrape) → Grafana.
Request flow: Client → LiteLLM:4000 → (router selects deployment) → Huawei MaaS. LiteLLM logs usage/spend to PostgreSQL, exposes /metrics for Prometheus, returns response.
With N MaaS API keys, each model has N deployments. LiteLLM's router distributes requests across deployments using the configured routing_strategy (default: simple-shuffle). Total effective RPM/TPM = per-key × N.
Multi-Key Load Balancing
The proxy supports multiple MaaS API keys for load balancing and increased throughput:
Environment variables
| Variable | Set by | Description |
|---|
HUAWEI_MAAS_API_KEY | Manual / init_env.sh | Main MaaS API key (mandatory) |
HUAWEI_MAAS_API_KEY_COUNT | init_env.sh | Total number of keys (1 + extra) |
HUAWEI_MAAS_API_KEY_0 | init_env.sh | Indexed key 0 (same as main key) |
HUAWEI_MAAS_API_KEY_N | init_env.sh | Indexed keys 1, 2, 3... |
HUAWEI_MAAS_EXTRA_API_KEYS | Manual (CI mode) | Comma-separated extra keys |
Config generation
litellm_config.yaml is generated by scripts/generate_config.sh from litellm_config.yaml.example
- The generated file is gitignored — never edit it directly
- With N keys, each model has N deployments (e.g.,
glm-5.1 → glm-5.1--maas-key-0, glm-5.1--maas-key-1, ...)
- Total effective RPM/TPM = per-key × N
Router settings
| Setting | Value | Purpose |
|---|
routing_strategy | simple-shuffle | Random selection across healthy deployments |
cooldown_time | 30 | Seconds to remove failed deployment from rotation |
allowed_fails | 3 | Failures before cooldown triggers |
Backward compatibility
Single key = identical behavior to before. No changes required for existing single-key deployments.
Codebase
.
├── README.md human-facing overview
├── SKILL.md agent-facing workflow (this file)
├── docker-compose.yml 4-service orchestrator (references assets/config/)
├── assets/config/
│ ├── litellm_config.yaml.example model catalog example (tracked in git)
│ ├── litellm_config.yaml generated config (gitignored)
│ ├── custom_callbacks.py TTFT/TPOT/ITL Prometheus histograms
│ ├── prometheus.yml 15s scrape → litellm:4000
│ ├── .env.example environment template
│ └── grafana/
│ └── provisioning/
│ ├── datasources/prometheus.yml auto-linked Prometheus datasource
│ └── dashboards/
│ ├── dashboards.yml file-based provider, 30s refresh
│ └── litellm_overview.json pre-built overview dashboard
├── references/
│ ├── architecture.md topology, services, volumes, environment
│ ├── metrics-and-dashboards.md PromQL, custom metrics, Grafana panel config
│ ├── operations.md health checks, backup, restart, usage, endpoints
│ └── troubleshooting.md repair playbook, failure modes, common mistakes
├── scripts/
│ ├── init_env.sh interactive .env setup (manual or agent-guided)
│ ├── generate_config.sh generates litellm_config.yaml from .env
│ └── validate_e2e.sh 12-step end-to-end validation
├── .env actual secrets (gitignored)
└── .gitignore .env and litellm_config.yaml
File-by-file reference
| File | Role | Key details |
|---|
docker-compose.yml | Service orchestration | YAML anchor, 4 services with healthcheck chain, named volumes, mounts from ./assets/config/ |
assets/config/litellm_config.yaml.example | Model catalog example | openai/ prefix + MaaS endpoint, tpm/rpm per model, per-token pricing, tracked in git |
assets/config/litellm_config.yaml | Generated config | Created by generate_config.sh, gitignored, N deployments per model |
assets/config/custom_callbacks.py | Custom Prometheus metrics | PrometheusTTFTTPOTITL(CustomLogger), 3 histograms labeled by model, model_group, api_provider |
assets/config/prometheus.yml | Scrape config | Single job litellm at 15s interval |
assets/config/grafana/provisioning/datasources/prometheus.yml | Datasource | Prometheus type, proxy access, http://prometheus:9090 |
assets/config/grafana/provisioning/dashboards/dashboards.yml | Dashboard provider | File-based, org 1, 30s update interval |
assets/config/grafana/provisioning/dashboards/litellm_overview.json | Pre-built dashboard | UID litellm-overview, 10s auto-refresh, template variables: model, datasource, includes Deployment Load Balancing row |
scripts/generate_config.sh | Config generator | Reads .env, generates litellm_config.yaml from template, creates N deployments per model |
Docker Compose Services
| Service | Image | Container name | Port | Healthcheck | Depends on |
|---|
litellm | ghcr.io/berriai/litellm:v1.83.14-stable.patch.3 | litellm_proxy | 4000:4000 | GET /health/liveliness every 30s, 10s timeout, 3 retries, 40s start period | db (healthy) |
db | postgres:16-alpine | litellm_pg_db | (internal 5432) | pg_isready every 5s, 5s timeout, 10 retries | — |
prometheus | prom/prometheus:v3.3.1 | litellm_prometheus | 9090:9090 | GET /-/healthy every 15s, 5s timeout, 3 retries, 10s start period | litellm (healthy) |
grafana | grafana/grafana:11.5.2 | litellm_grafana | 3000:3000 | GET /api/health every 15s, 5s timeout, 3 retries, 15s start period | prometheus (healthy) |
Volume mounts
| Service | Host path | Container path | Mode |
|---|
litellm | ./assets/config/litellm_config.yaml | /app/config.yaml | ro (generated file) |
litellm | ./assets/config/custom_callbacks.py | /app/custom_callbacks.py | ro |
db | postgres_data volume | /var/lib/postgresql/data | rw |
prometheus | ./assets/config/prometheus.yml | /etc/prometheus/prometheus.yml | ro |
prometheus | prometheus_data volume | /prometheus | rw |
grafana | ./assets/config/grafana/provisioning | /etc/grafana/provisioning | ro |
grafana | grafana_data volume | /var/lib/grafana | rw |
Named volumes
| Volume name | Survives down? | Removed by |
|---|
litellm_postgres_data | Yes | docker compose down -v |
litellm_prometheus_data | Yes | docker compose down -v |
litellm_grafana_data | Yes | docker compose down -v |
LiteLLM container environment
Set via env_file: .env plus explicit environment:
| Variable | Source | Value |
|---|
DATABASE_URL | docker-compose | postgresql://llmproxy:${DB_PASSWORD}@db:5432/litellm |
STORE_MODEL_IN_DB | docker-compose | True |
LITELLM_MASTER_KEY | .env | Admin key, must start with sk- |
LITELLM_SALT_KEY | .env | Key encryption salt |
HUAWEI_MAAS_API_KEY | .env | Main Huawei MaaS API key |
HUAWEI_MAAS_API_KEY_N | .env | Indexed MaaS API keys (0, 1, 2...) |
HUAWEI_MAAS_API_KEY_COUNT | .env | Number of MaaS API keys |
HUAWEI_MAAS_API_BASE | .env | https://api-ap-southeast-1.modelarts-maas.com/openai/v1 |
LiteLLM command: --config=/app/config.yaml
Deployment Workflow
Follow in order. Do not skip validation steps.
0. Preflight
docker --version
docker compose version
1. Install from monorepo
MONOREPO="https://github.com/binrogithub/1-3-Cloud-Adoption-Skills.git"
TEMP_DIR="/home/1-3-Cloud-Adoption-Skills"
LITELLM_DIR="/home/LiteLLM-Huawei-MaaS-Proxy"
git clone --depth 1 "$MONOREPO" "$TEMP_DIR"
cp -r "$TEMP_DIR/AI/AI-Coding/LiteLLM-Huawei-MaaS-Proxy" "$LITELLM_DIR"
rm -rf "$TEMP_DIR"
cd "$LITELLM_DIR"
2. Configure .env
Two paths — choose one:
Guided (recommended for agents and first-time deployers):
./scripts/init_env.sh
./scripts/init_env.sh --auto
./scripts/init_env.sh --ci
The script writes .env with 0600 permissions, validates required values, and refuses to proceed if HUAWEI_MAAS_API_KEY is missing or placeholder. After collecting keys, it sets HUAWEI_MAAS_API_KEY_COUNT and indexed HUAWEI_MAAS_API_KEY_N vars.
Manual (full control over every value):
cp assets/config/.env.example .env
$EDITOR .env
chmod 600 .env
Or generate secrets individually:
python3 -c "import secrets; print('sk-' + secrets.token_urlsafe(32))"
python3 -c "import secrets; print(secrets.token_urlsafe(32))"
2b. Generate config
After .env is configured:
./scripts/generate_config.sh
This reads HUAWEI_MAAS_API_KEY_COUNT and creates N deployments per model.
3. Pre-deploy validation
Before starting the stack, verify .env is complete:
source .env
for VAR in LITELLM_MASTER_KEY LITELLM_SALT_KEY DB_PASSWORD HUAWEI_MAAS_API_KEY; do
VAL="${!VAR:-}"
[ -z "$VAL" ] && echo "MISSING: $VAR" || echo "OK: $VAR (len=${#VAL})"
done
If any variable is missing or contains a placeholder, halt and fix before proceeding. The stack will fail at runtime with incomplete secrets.
4. Start the stack
docker compose up -d
5. Wait for healthy services
docker compose ps
All four services must show healthy or running. LiteLLM has a 40s start period.
6. Validate direct MaaS connectivity
curl -s https://api-ap-southeast-1.modelarts-maas.com/openai/v1/models \
-H "Authorization: Bearer $HUAWEI_MAAS_API_KEY" | jq '.data[].id'
Expect a list of model IDs. If 403, key is wrong or expired.
7. Validate LiteLLM health
curl -s http://localhost:4000/health/liveliness
curl -s http://localhost:4000/health \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '.healthy_count, .unhealthy_count'
Expect unhealthy_count: 0.
8. Validate proxied chat completion
curl -s http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "glm-5.1", "messages": [{"role": "user", "content": "Reply with OK only."}]}' | jq '.choices[0].message.content'
9. Validate streaming
curl -s http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-v4-flash", "messages": [{"role": "user", "content": "Count to 3."}], "stream": true}' | head -5
Expect SSE chunks (data: {...}).
10. Validate Prometheus metrics
curl -s http://localhost:4000/metrics | grep -c "litellm_"
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
Expect metric count > 0 and Prometheus target health = up.
11. Validate Grafana dashboard
curl -s -o /dev/null -w "%{http_code}\n" http://localhost:3000
Expect 200.
12. Validate virtual key minting
curl -s -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"models": ["glm-5.1"], "max_budget": 1.0, "duration": "1d"}' | jq '.key'
Expect a virtual key starting with sk-.
Validation Sequence
For an existing deployment, run in order:
docker compose ps — all services healthy
curl http://localhost:4000/health/liveliness — LiteLLM process up
curl http://localhost:4000/health -H "Authorization: Bearer $LITELLM_MASTER_KEY" — upstream reachable per model
- Chat completion with master key on
glm-5.1 — sync path
- Streaming completion — SSE path
/key/generate — mints a virtual key
- Chat completion with virtual key — multi-user path and budget hooks
/metrics | grep -c litellm_ — metrics flowing
- Prometheus targets — scraping
http://localhost:3000 — Grafana reachable
Environment Reference
| Variable | Required | Default | Description |
|---|
LITELLM_MASTER_KEY | Yes | — | Admin key, must start with sk- |
LITELLM_SALT_KEY | Yes | — | Key encryption salt — immutable after first virtual key |
DB_PASSWORD | Yes | — | PostgreSQL password for llmproxy user |
HUAWEI_MAAS_API_KEY | Yes | — | Main MaaS API key from ModelArts console (CN-Hong Kong) |
HUAWEI_MAAS_API_BASE | Yes | — | https://api-ap-southeast-1.modelarts-maas.com/openai/v1 |
HUAWEI_MAAS_API_KEY_COUNT | Auto | 1 | Number of MaaS API keys (set by init_env.sh) |
HUAWEI_MAAS_API_KEY_N | Auto | — | Indexed keys (0, 1, 2...). Set by init_env.sh. |
HUAWEI_MAAS_EXTRA_API_KEYS | No | — | Comma-separated extra keys for CI mode |
PROMETHEUS_RETENTION | No | 15d | Prometheus TSDB retention period |
GRAFANA_PASSWORD | No | admin | Grafana admin password |
Endpoints
| Service | URL | Auth |
|---|
| LiteLLM API | http://localhost:4000 | Authorization: Bearer <key> |
| LiteLLM Admin UI | http://localhost:4000/ui | Login with LITELLM_MASTER_KEY |
| Prometheus | http://localhost:9090 | None |
| Grafana | http://localhost:3000 | admin / GRAFANA_PASSWORD |
LiteLLM API routes
| Route | Method | Description |
|---|
/v1/chat/completions | POST | OpenAI-compatible chat completions |
/v1/models | GET | List available models |
/health/liveliness | GET | Liveness probe (used by healthcheck) |
/health | GET | Per-model health (auth required) |
/metrics | GET | Prometheus metrics endpoint |
/key/generate | POST | Generate scoped virtual key |
/key/info | POST | Get key info |
/key/update | POST | Update key settings |
/key/delete | POST | Delete a key |
/model/info | GET | Model details including pricing (auth required) |
/ui | GET | Admin UI |
Models
| Name | in / out | RPM | TPM | Cost (in/out per token) |
|---|
glm-5.1 | 192K / 128K | 30 | 500K | $1.078 / $3.774 × 10⁻⁶ |
glm-5 | 192K / 64K | 30 | 500K | $0.809 / $2.965 × 10⁻⁶ |
deepseek-v4-pro | 1M / 128K | 3 | 30K | $1.617 / $3.235 × 10⁻⁶ |
deepseek-v4-flash | 1M / 128K | 3 | 30K | $0.135 / $0.270 × 10⁻⁶ |
deepseek-v3.2 | 128K / 32K | 700 | 500K | $0.270 / $0.404 × 10⁻⁶ |
Model configuration structure
Single key (backward compatible):
- model_name: <public-name>
litellm_params:
model: openai/<maas-model-name>
api_base: os.environ/HUAWEI_MAAS_API_BASE
api_key: os.environ/HUAWEI_MAAS_API_KEY
tpm: <tokens-per-minute>
rpm: <requests-per-minute>
model_info:
max_tokens: <total>
max_input_tokens: <input>
max_output_tokens: <output>
input_cost_per_token: <price>
output_cost_per_token: <price>
Multi-key (N keys → N deployments per model):
With N MaaS API keys, generate_config.sh creates N deployments per model:
- model_name: <public-name>
litellm_params:
model: openai/<maas-model-name>
api_base: os.environ/HUAWEI_MAAS_API_BASE
api_key: os.environ/HUAWEI_MAAS_API_KEY_0
tpm: <tokens-per-minute>
rpm: <requests-per-minute>
model_info: { ... }
- model_name: <public-name>
litellm_params:
model: openai/<maas-model-name>
api_base: os.environ/HUAWEI_MAAS_API_BASE
api_key: os.environ/HUAWEI_MAAS_API_KEY_1
tpm: <tokens-per-minute>
rpm: <requests-per-minute>
model_info: { ... }
LiteLLM's router load-balances across all deployments for the same model_name.
Adding a new model
- Find model name and rate/price info in ModelArts MaaS console
- Add entry to
model_list in assets/config/litellm_config.yaml.example following the structure above
- Ensure
model_name matches MaaS exactly (case-sensitive)
- Set
tpm/rpm from MaaS console quotas
- Set non-zero
input_cost_per_token and output_cost_per_token (per-token, not per-1K)
- Regenerate config:
./scripts/generate_config.sh
- Restart:
docker compose restart litellm
- Verify:
curl -s http://localhost:4000/v1/models -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '.data[].id'
- Confirm pricing:
curl -s http://localhost:4000/model/info -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '.data[] | {model: .model_name, input_cost: .input_cost_per_token, output_cost: .output_cost_per_token}'
Adding/removing MaaS API keys
To add a key:
- Add the new key to
.env as HUAWEI_MAAS_API_KEY_N (next index)
- Increment
HUAWEI_MAAS_API_KEY_COUNT
- Regenerate config:
./scripts/generate_config.sh
- Restart:
docker compose restart litellm
- Verify deployment count:
curl -s http://localhost:4000/model/info -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '[.data[].model_name] | length'
To remove a key:
- Remove the key from
.env
- Decrement
HUAWEI_MAAS_API_KEY_COUNT
- Re-index remaining keys if needed (keys must be 0, 1, 2... contiguous)
- Regenerate config:
./scripts/generate_config.sh
- Restart:
docker compose restart litellm
To rotate an expired key:
- Replace the expired key value in
.env
- Regenerate config:
./scripts/generate_config.sh
- Restart:
docker compose restart litellm
Proxy settings
Configured in assets/config/litellm_config.yaml under litellm_settings:
| Setting | Value | Meaning |
|---|
num_retries | 3 | Retry failed calls 3 times within same deployment |
request_timeout | 600 | Raise TimeoutError after 600s (full request latency; matches LiteLLM default) |
stream_timeout | 60 | Raise TimeoutError after 60s waiting for first token (TTFT only) |
drop_params | True | Drop unsupported params instead of erroring |
set_verbose | False | Suppress debug logging |
callbacks | ["prometheus", "custom_callbacks.my_prometheus_logger"] | Built-in Prometheus + custom TTFT/TPOT/ITL |
Under general_settings:
| Setting | Value | Meaning |
|---|
database_connection_pool_limit | 10 | Max DB connections |
database_connection_timeout | 60 | DB connection timeout in seconds |
Usage
Chat completion
curl -s http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "glm-5.1", "messages": [{"role": "user", "content": "Hello!"}]}'
Streaming
curl -s http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-v4-flash", "messages": [{"role": "user", "content": "Count to 5."}], "stream": true}'
Thinking mode (DeepSeek)
curl -s http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-v4-pro", "messages": [{"role": "user", "content": "Solve step by step."}], "extra_body": {"thinking": {"type": "enabled"}}}'
Python SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:4000/v1", api_key="sk-...")
response = client.chat.completions.create(
model="glm-5.1",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Virtual key management
curl -s -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"models": ["glm-5.1", "deepseek-v4-flash"], "max_budget": 10.0, "duration": "30d"}'
curl -s -X POST http://localhost:4000/key/info \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"key": "sk-..."}'
curl -s -X POST http://localhost:4000/key/update \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"key": "sk-...", "max_budget": 50.0}'
curl -s -X POST http://localhost:4000/key/delete \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"keys": ["sk-..."]}'
Metrics
Built-in LiteLLM metrics (on /metrics)
| Metric | Type | Description |
|---|
litellm_proxy_total_requests_metric | counter | Total requests |
litellm_request_total_latency_metric | histogram | End-to-end latency |
litellm_llm_api_latency_metric | histogram | Upstream API latency only |
litellm_spend_metric | counter | Cumulative spend (USD) |
litellm_input_tokens_metric | counter | Input tokens |
litellm_output_tokens_metric | counter | Output tokens |
litellm_deployment_state | gauge | 0=healthy, 1=partial, 2=outage |
Deployment-level metrics (multi-key)
| Metric | Type | Labels | Description |
|---|
litellm_deployment_success_responses_total | counter | litellm_model_name | Successful responses per deployment |
litellm_deployment_failure_responses_total | counter | litellm_model_name, exception_status, exception_class | Failed responses per deployment |
litellm_deployment_cooled_down | gauge | litellm_model_name | 1 if deployment is in cooldown |
litellm_deployment_latency_per_output_token | histogram | litellm_model_name | Latency per output token by deployment |
Label note: Deployment metrics use litellm_model_name (includes deployment suffix like glm-5.1--maas-key-0), not model.
Custom metrics (custom_callbacks.py)
| Metric | Type | When | Bucket range |
|---|
litellm_custom_ttft_seconds | histogram | Streaming only | 0.01s → 30s |
litellm_custom_tpot_seconds | histogram | Always | 0.001s → 5s |
litellm_custom_itl_seconds | histogram | Streaming only | 0.001s → 5s |
All custom metrics labeled: model, model_group, api_provider.
Custom callback internals
PrometheusTTFTTPOTITL(CustomLogger):
- TTFT =
completion_start_time - api_call_start_time (streaming only, when > 0)
- TPOT =
(end_time - start_time) / output_tokens (always, when output_tokens > 0)
- ITL =
(end_time - completion_start_time) / (output_tokens - 1) (streaming only, when output_tokens > 1)
Useful PromQL
rate(litellm_proxy_total_requests_metric[5m]) * 60
histogram_quantile(0.99, rate(litellm_request_total_latency_metric_bucket[5m]))
rate(litellm_spend_metric[1d])
rate(litellm_input_tokens_metric[5m])*60 + rate(litellm_output_tokens_metric[5m])*60
litellm_deployment_state == 2
histogram_quantile(0.95, rate(litellm_custom_ttft_seconds_bucket[5m]))
rate(litellm_custom_tpot_seconds_sum[5m]) / rate(litellm_custom_tpot_seconds_count[5m])
Grafana Dashboard
Pre-built dashboard (litellm_overview.json):
- Auto-refresh: 10s
- Default time range: Last 1 hour
- Template variables:
model (multi-select), datasource (Prometheus selector)
- Panel sections: Request Rates, Latency Percentiles, Spend, Token Rates, Deployment State, Custom TTFT/TPOT/ITL, Deployment Load Balancing
Deployment Load Balancing panels (multi-key)
When multiple MaaS API keys are configured, the dashboard includes:
| Panel | Description |
|---|
| Deployments Per Model | Shows N deployments per model (bar chart) |
| Request Distribution | Per-deployment request counts (pie/bar) |
| Cooldown Events | Deployments temporarily removed from rotation (time series) |
| Per-Deployment Latency | Latency breakdown by deployment (heatmap or time series) |
| Deployment Health | Current health status per deployment (stat panel) |
Access at http://localhost:3000, login with admin / GRAFANA_PASSWORD.
Operations
Health checks
docker compose ps
curl -s http://localhost:4000/health/liveliness
curl -s http://localhost:4000/health \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '.healthy_count, .unhealthy_count'
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
curl -s https://api-ap-southeast-1.modelarts-maas.com/openai/v1/models \
-H "Authorization: Bearer $HUAWEI_MAAS_API_KEY"
Backup & restore
docker compose exec db pg_dump -U llmproxy litellm > backup_$(date +%Y%m%d).sql
cat backup_20260516.sql | docker compose exec -T db psql -U llmproxy litellm
Restart & reset
docker compose restart litellm
docker compose down && docker compose up -d
docker compose down -v && docker compose up -d
Troubleshooting commands
docker compose logs litellm
docker compose logs -f litellm
docker compose logs db
docker compose exec db pg_isready -d litellm -U llmproxy
docker compose logs prometheus
docker compose logs grafana
docker volume ls | grep litellm
docker compose exec litellm env | grep -E '^(LITELLM|DB_|HUAWEI|STORE_)'
Repair Playbook
- Inspect state —
docker compose ps and docker compose logs litellm --tail 50
- Inspect config — read
assets/config/litellm_config.yaml and .env before editing
- Confirm environment — verify
.env contains real MaaS key (not placeholder)
- Check DB —
docker compose exec db pg_isready -d litellm -U llmproxy
- Check LiteLLM health —
curl -s http://localhost:4000/health -H "Authorization: Bearer $LITELLM_MASTER_KEY"
- Fix the issue — see Common failure modes below
- Restart if config changed —
docker compose restart litellm
- Re-validate — run
scripts/validate_e2e.sh or Validation Sequence
Common failure modes
| Symptom | Cause | Fix |
|---|
litellm keeps restarting | DB not ready or wrong DB_PASSWORD | Check docker compose logs db, verify .env |
401 from /v1/chat/completions | Wrong or missing API key | Verify Authorization: Bearer sk-... header |
| 404 model not found | Model name mismatch | Names are case-sensitive, must match MaaS console |
| No metrics in Prometheus | LiteLLM healthcheck failing | Check docker compose ps, ensure litellm healthy |
LITELLM_SALT_KEY error | Salt changed after keys created | Use original salt; if lost, docker compose down -v |
| MaaS 403 | Wrong region or expired key | Verify key in ModelArts console, region must be ap-southeast-1 |
| Callback import error | custom_callbacks.py not mounted | Check volume mount in docker-compose.yml |
unhealthy_count > 0 in /health | Upstream model unreachable | Check MaaS key, model ID, region; do not add wildcards |
| Budget not consumed | Zero input_cost_per_token / output_cost_per_token | Set non-zero pricing; verify with /model/info |
| Prometheus target down | LiteLLM not healthy or not started | Check healthcheck chain: db → litellm → prometheus |
| Grafana shows no data | Prometheus not scraping or wrong datasource | Check targets; verify datasource URL is http://prometheus:9090 |
| Virtual key 403 | Key expired, over budget, or model not in allow-list | Check key with /key/info |
| Intermittent TimeoutError | request_timeout too low for LLM calls (applies to full request latency, not just TTFT) | Increase request_timeout (default 600s); add stream_timeout for tighter TTFT deadline |
Sanitization Rules
- Never write real secrets into committed files. Use
.env (gitignored) with 0600 permissions.
- In output or documentation, use placeholders:
sk-<master-key>, <maas-api-key>, <db-password>.
- In configuration demos, read secrets from env vars (
os.environ/...) or $VAR_NAME placeholders.
- Mask discovered keys as
<prefix>...<suffix> (len=N) or ***redacted***.
- LiteLLM may print
api_key values in startup logs. Scan after troubleshooting: docker compose logs litellm 2>&1 | grep -i 'api_key\|sk-'; set set_verbose: False if keys appear.
Common Mistakes
| Mistake | Why it's wrong | Correct approach |
|---|
Committing .env to git | Leaks all secrets | .env is gitignored; never git add .env |
Changing LITELLM_SALT_KEY after creating virtual keys | All existing keys unreadable | Keep original salt; if lost, full reset |
Giving clients the raw HUAWEI_MAAS_API_KEY | Bypasses spend tracking, rate limiting, audit | Mint virtual keys via /key/generate |
Using per-1K-token pricing in model_info | LiteLLM expects per-token pricing | Use input_cost_per_token (e.g. 0.000001078, not 0.001078) |
| Adding a model with zero pricing | Budgets don't consume spend | Set non-zero input_cost_per_token and output_cost_per_token |
| Guessing model names | MaaS model IDs are case-sensitive | Verify exact name in MaaS console |
| Editing config without restarting | Config is read at startup only | docker compose restart litellm after changes |
Running docker compose down expecting data loss | Volumes survive down | Use docker compose down -v to destroy data |
Checking /health/liveliness instead of /health for model status | Liveliness only checks process | Use /health with auth for model-level diagnostics |
Setting request_timeout too low (e.g. 10s) | LLM calls routinely exceed 10s end-to-end, causing intermittent TimeoutErrors | Use request_timeout: 600 (default); use stream_timeout for tighter TTFT control |
Reference: File Templates
Templates for reference — to understand each file's structure or build from scratch when git is not available. When cloning this repo, the actual files take precedence.
.gitignore
.env
.env.example
# ── Proxy Auth ───────────────────────────────────
LITELLM_MASTER_KEY="sk-change-me"
LITELLM_SALT_KEY="change-me-to-a-long-random-string"
# ── Database ─────────────────────────────────────
DB_PASSWORD="change-me-to-a-strong-password"
# ── Huawei MaaS ──────────────────────────────────
HUAWEI_MAAS_API_KEY="change-me-to-your-maas-api-key"
HUAWEI_MAAS_API_KEY_0="change-me-to-your-maas-api-key"
HUAWEI_MAAS_API_BASE="https://api-ap-southeast-1.modelarts-maas.com/openai/v1"
# ── Prometheus ───────────────────────────────────
PROMETHEUS_RETENTION="15d"
# ── Grafana ──────────────────────────────────────
GRAFANA_PASSWORD="change-me-to-a-strong-password"
docker-compose.yml
x-default: &default
restart: unless-stopped
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
services:
litellm:
<<: *default
container_name: litellm_proxy
image: ghcr.io/berriai/litellm:v1.83.14-stable.patch.3
ports:
- "4000:4000"
volumes:
- ./assets/config/litellm_config.yaml:/app/config.yaml:ro
- ./assets/config/custom_callbacks.py:/app/custom_callbacks.py:ro
environment:
DATABASE_URL: "postgresql://llmproxy:${DB_PASSWORD}@db:5432/litellm"
STORE_MODEL_IN_DB: "True"
env_file:
- .env
depends_on:
db:
condition: service_healthy
healthcheck:
test: ["CMD-SHELL", "python3 -c \"import urllib.request; urllib.request.urlopen('http://localhost:4000/health/liveliness')\""]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
command:
- "--config=/app/config.yaml"
db:
<<: *default
image: postgres:16-alpine
container_name: litellm_pg_db
environment:
POSTGRES_DB: litellm
POSTGRES_USER: llmproxy
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -d litellm -U llmproxy"]
interval: 5s
timeout: 5s
retries: 10
prometheus:
<<: *default
image: prom/prometheus:v3.3.1
container_name: litellm_prometheus
volumes:
- ./assets/config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=${PROMETHEUS_RETENTION:-15d}"
depends_on:
litellm:
condition: service_healthy
healthcheck:
test: ["CMD-SHELL", "wget --spider -q http://localhost:9090/-/healthy || exit 1"]
interval: 15s
timeout: 5s
retries: 3
start_period: 10s
grafana:
<<: *default
image: grafana/grafana:11.5.2
container_name: litellm_grafana
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
GF_USERS_ALLOW_SIGN_UP: "false"
volumes:
- ./assets/config/grafana/provisioning:/etc/grafana/provisioning:ro
- grafana_data:/var/lib/grafana
depends_on:
prometheus:
condition: service_healthy
healthcheck:
test: ["CMD-SHELL", "wget --spider -q http://localhost:3000/api/health || exit 1"]
interval: 15s
timeout: 5s
retries: 3
start_period: 15s
volumes:
postgres_data:
name: litellm_postgres_data
prometheus_data:
name: litellm_prometheus_data
grafana_data:
name: litellm_grafana_data
litellm_config.yaml
model_list:
- model_name: glm-5.1
litellm_params:
model: openai/glm-5.1
api_base: os.environ/HUAWEI_MAAS_API_BASE
api_key: os.environ/HUAWEI_MAAS_API_KEY
tpm: 500000
rpm: 30
model_info:
max_tokens: 198000
max_input_tokens: 192000
max_output_tokens: 128000
input_cost_per_token: 0.000001078
output_cost_per_token: 0.000003774
- model_name: glm-5
litellm_params:
model: openai/glm-5
api_base: os.environ/HUAWEI_MAAS_API_BASE
api_key: os.environ/HUAWEI_MAAS_API_KEY
tpm: 500000
rpm: 30
model_info:
max_tokens: 198000
max_input_tokens: 192000
max_output_tokens: 64000
input_cost_per_token: 0.000000809
output_cost_per_token: 0.000002965
- model_name: deepseek-v4-pro
litellm_params:
model: openai/deepseek-v4-pro
api_base: os.environ/HUAWEI_MAAS_API_BASE
api_key: os.environ/HUAWEI_MAAS_API_KEY
tpm: 30000
rpm: 3
model_info:
max_tokens: 1000000
max_input_tokens: 1000000
max_output_tokens: 128000
input_cost_per_token: 0.000001617
output_cost_per_token: 0.000003235
- model_name: deepseek-v4-flash
litellm_params:
model: openai/deepseek-v4-flash
api_base: os.environ/HUAWEI_MAAS_API_BASE
api_key: os.environ/HUAWEI_MAAS_API_KEY
tpm: 30000
rpm: 3
model_info:
max_tokens: 1000000
max_input_tokens: 1000000
max_output_tokens: 128000
input_cost_per_token: 0.000000135
output_cost_per_token: 0.00000027
- model_name: deepseek-v3.2
litellm_params:
model: openai/deepseek-v3.2
api_base: os.environ/HUAWEI_MAAS_API_BASE
api_key: os.environ/HUAWEI_MAAS_API_KEY
tpm: 500000
rpm: 700
model_info:
max_tokens: 160000
max_input_tokens: 128000
max_output_tokens: 32000
input_cost_per_token: 0.00000027
output_cost_per_token: 0.000000404
litellm_settings:
num_retries: 3
request_timeout: 600
stream_timeout: 60
drop_params: True
set_verbose: False
callbacks:
- "prometheus"
- custom_callbacks.my_prometheus_logger
ui_theme_config:
logo_url: "https://upload.wikimedia.org/wikipedia/en/thumb/0/04/Huawei_Standard_logo.svg/3840px-Huawei_Standard_logo.svg.png"
favicon_url: "https://upload.wikimedia.org/wikipedia/en/thumb/0/04/Huawei_Standard_logo.svg/3840px-Huawei_Standard_logo.svg.png"
router_settings:
routing_strategy: simple-shuffle
num_retries: 3
cooldown_time: 30
allowed_fails: 3
general_settings:
database_connection_pool_limit: 10
database_connection_timeout: 60
custom_callbacks.py
"""
Custom LiteLLM callback that emits TTFT, TPOT, and ITL as Prometheus histograms.
TTFT = completion_start_time - api_call_start_time (streaming only)
TPOT = total_latency / output_tokens
ITL = (end_time - completion_start_time) / max(output_tokens - 1, 1) (streaming only)
"""
from datetime import datetime
from litellm.integrations.custom_logger import CustomLogger
from prometheus_client import Histogram
def _to_timestamp(val):
"""Convert datetime or numeric to a float unix timestamp."""
if val is None:
return None
if isinstance(val, (int, float)):
return float(val)
if hasattr(val, "timestamp"):
return val.timestamp()
return None
class PrometheusTTFTTPOTITL(CustomLogger):
"""Custom callback that emits TTFT, TPOT, and ITL as Prometheus histograms."""
def __init__(self):
super().__init__()
self.ttft = Histogram(
"litellm_custom_ttft_seconds",
"Time to first token in seconds (streaming only)",
labelnames=["model", "model_group", "api_provider"],
buckets=(0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0),
)
self.tpot = Histogram(
"litellm_custom_tpot_seconds",
"Time per output token in seconds",
labelnames=["model", "model_group", "api_provider"],
buckets=(0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 5.0),
)
self.itl = Histogram(
"litellm_custom_itl_seconds",
"Inter-token latency in seconds (average between successive tokens, streaming only)",
labelnames=["model", "model_group", "api_provider"],
buckets=(0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 5.0),
)
async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
try:
stream = kwargs.get("stream", False)
completion_start_time = kwargs.get("completion_start_time")
api_call_start_time = kwargs.get("api_call_start_time")
slo = kwargs.get("standard_logging_object") or {}
model = slo.get("model") or kwargs.get("model", "unknown")
model_group = slo.get("model_group") or model
api_provider = slo.get("custom_llm_provider") or "unknown"
labels = {"model": model, "model_group": model_group, "api_provider": api_provider}
output_tokens = 0
if response_obj is not None:
usage = None
if hasattr(response_obj, "get"):
usage = response_obj.get("usage")
elif hasattr(response_obj, "usage"):
usage = response_obj.usage
if usage is not None:
if isinstance(usage, dict):
output_tokens = usage.get("completion_tokens", 0) or 0
elif hasattr(usage, "completion_tokens"):
output_tokens = usage.completion_tokens or 0
start_ts = _to_timestamp(start_time)
end_ts = _to_timestamp(end_time)
api_start_ts = _to_timestamp(api_call_start_time)
comp_start_ts = _to_timestamp(completion_start_time)
if stream and api_start_ts and comp_start_ts:
ttft_seconds = comp_start_ts - api_start_ts
if ttft_seconds > 0:
self.ttft.labels(**labels).observe(ttft_seconds)
if output_tokens > 0 and start_ts and end_ts:
total_latency = end_ts - start_ts
tpot_seconds = total_latency / output_tokens
self.tpot.labels(**labels).observe(tpot_seconds)
if stream and comp_start_ts:
streaming_duration = end_ts - comp_start_ts
if streaming_duration > 0 and output_tokens > 1:
itl_seconds = streaming_duration / (output_tokens - 1)
self.itl.labels(**labels).observe(itl_seconds)
except Exception as e:
print(f"[PrometheusTTFTTPOTITL] Error: {e}")
my_prometheus_logger = PrometheusTTFTTPOTITL()
prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: "litellm"
static_configs:
- targets: ["litellm:4000"]
grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: LiteLLM
orgId: 1
folder: ""
type: file
disableDeletion: false
updateIntervalSeconds: 30
options:
path: /etc/grafana/provisioning/dashboards
foldersFromFilesStructure: false
The pre-built litellm_overview.json dashboard is at assets/config/grafana/provisioning/dashboards/litellm_overview.json.
Bundled Resources
- references/architecture.md — topology, services, volumes
- references/metrics-and-dashboards.md — PromQL, custom metrics, Grafana
- references/operations.md — health checks, backup, restart, usage
- references/troubleshooting.md — repair playbook, failure modes
- scripts/init_env.sh — interactive .env setup (manual, agent-guided, or CI)
- scripts/generate_config.sh — generates litellm_config.yaml from .env
- scripts/validate_e2e.sh — 12-step end-to-end validation
Output Expectations
On completion, leave behind:
docker compose ps with all four services healthy
.env populated with real secrets, chmod 600, no placeholders
- Validated: direct MaaS request, proxied request, streaming, metrics, Grafana, virtual key
- Operator note listing: endpoints, file paths, master key location, MaaS region, virtual keys created
Verification Exit Criteria