Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

litellm-huawei-maas-proxy

Deploy, configure, validate, troubleshoot, or extend an OpenAI-compatible API proxy backed by PostgreSQL, Prometheus, and Grafana, routing through Huawei ModelArts MaaS (ap-southeast-1) with multi-key load balancing. TRIGGER when the task involves LiteLLM proxy deployment, Docker Compose stack with litellm_config.yaml, Huawei MaaS model routing, virtual key or budget management, Prometheus/Grafana observability for LLM traffic, custom_callbacks.py TTFT/TPOT/ITL metrics, multi-key load balancing, or any reference to `LITELLM_MASTER_KEY`, `HUAWEI_MAAS_API_KEY`, or `docker compose` with this stack.

Exécuter dans Manus

Étoiles9

Forks11

Mis à jour2 juin 2026 à 13:11

Source

binrogithub

binrogithub/1-3-Cloud-Adoption-Skills

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Commande d'installation

Téléchargement

Exécuter dans Manus

Utile pourSOC

Administrateurs de réseaux et de systèmes informatiquesProfessions informatiques et mathématiques15-1244L4

Explorateur de fichiers

18 fichiers

SKILL.md

readonly

Plus depuis ce dépôt

même dépôt

databricks-to-huawei-mrs-hudi-demo

binrogithub/1-3-Cloud-Adoption-Skills

Use when analyzing or migrating Databricks CDC/Delta workflow scripts into a Huawei Cloud Chile MRS + OBS + Apache Hudi demo, including synthetic CDC data generation, notebook-triggered automation, Delta-to-Hudi replacement, deployment troubleshooting, and smoke validation. Also use when continuing the dockone ExampleApp demo package under the Codex outputs folder.

2026-06-039

claude-code-huawei-maas

binrogithub/1-3-Cloud-Adoption-Skills

Configure Claude Code to use Huawei Cloud MaaS or ModelArts MaaS through an OpenAI-compatible endpoint, and optionally add Z.ai web-search-prime MCP search. Use when Codex needs to add a side-by-side claude-glm command that routes to Huawei MaaS glm-5.1 while preserving the original claude command on Anthropic, migrate claude itself to Huawei MaaS, install or configure claude-code-router, set API_KEY-based authentication, adjust context length, verify that Claude Code is actually backed by MaaS, or configure Z.ai MCP search with Z_API_KEY.

2026-06-039

oh-my-opencode-slim-huawei-maas

binrogithub/1-3-Cloud-Adoption-Skills

Bootstrap AI coding stack: deploy LiteLLM proxy (via LiteLLM-Huawei-MaaS-Proxy skill), install opencode + oh-my-opencode-slim, mint virtual key, wire everything. Supports multi-key MaaS load balancing. TRIGGER on: opencode + Huawei MaaS setup, full-stack bootstrap, oh-my-opencode-slim-huawei-maas, deploy-litellm.

2026-06-029

maas-screenshot-to-dashboard

binrogithub/1-3-Cloud-Adoption-Skills

End-to-end screenshot-to-dashboard pipeline powered by MaaS vision models. Analyze a dashboard screenshot via OpenRouter free vision models, extract design tokens with anydesign, generate design.md + design-tokens.json, plan via CCPM, implement React+Ant Design+ECharts dashboard code, and iterate with screenshot comparison until ≥85% similarity. TRIGGER when the user provides a dashboard screenshot and wants to reconstruct it as a working web app, or says 'rebuild this dashboard', 'convert this screenshot to code', 'reproduce this BI dashboard', 'screenshot to React dashboard'. Do NOT use for: non-dashboard screenshots, Figma-to-code without screenshot, general UI development without a reference image.

2026-05-219

diagnose

binrogithub/1-3-Cloud-Adoption-Skills

Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression.

2026-05-179

grill-me

binrogithub/1-3-Cloud-Adoption-Skills

Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".

2026-05-179

name

LiteLLM-Huawei-MaaS-Proxy

description

LiteLLM Huawei MaaS Proxy

Deploy an OpenAI-compatible API proxy backed by PostgreSQL, Prometheus, and Grafana, routing through Huawei ModelArts MaaS (ap-southeast-1) with multi-key load balancing.

This repo ships runtime stack files for deterministic clone-and-run deployment. The file templates below serve as reference for understanding each file or building from scratch when git is not available.

When to Use

Situation	Route
Deploy the full stack from scratch	Follow Deployment Workflow
Add or modify a model in the proxy	Follow Adding a new model
Troubleshoot a broken deployment	Follow Repair Playbook, then Common failure modes
Validate an existing deployment	Follow Validation Sequence
Manage virtual keys, budgets, or teams	Follow Virtual key management
Extend observability (custom metrics, dashboards)	Read Metrics and Grafana Dashboard sections
Backup, restore, or reset data	Follow Operations

When NOT to use:

Direct MaaS API calls without proxy (no spend tracking, no rate limiting)
Non-Huawei LLM providers (this stack is MaaS-specific)
Multi-host / Kubernetes deployment (this is a single-host Docker Compose stack)

Required Inputs

Confirm before making changes:

Huawei MaaS API key — from ModelArts MaaS console. Mandatory main key.
Additional MaaS API keys (optional) — for load balancing. Each extra key multiplies effective RPM/TPM per model.
Huawei MaaS API base — https://api-ap-southeast-1.modelarts-maas.com/openai/v1 (ap-southeast-1 / CN-Hong Kong). Do not swap regions without re-validating models and quotas.
Explicit MaaS model IDs to expose (e.g. glm-5.1, deepseek-v4-flash). Verify in MaaS console — do not guess. If the user only gives one model, prefer explicit routing for that model instead of adding all five.
LiteLLM listen port — default 4000. Override in docker-compose.yml ports section if colliding with an existing service.
Prometheus listen port — default 9090.
Grafana listen port — default 3000.
Prometheus retention — default 15d. Adjust for disk capacity.
Whether virtual keys already exist — if yes, LITELLM_SALT_KEY is immutable and cannot be changed.
Docker 20.10+ with Compose V2 on the target host.

All of these are collected by ./scripts/init_env.sh (interactive, --auto, or --ci mode). The user can always choose manual .env editing instead.

Core Rules

Never commit .env, real API keys, virtual keys, or bearer tokens. Secrets live in .env (gitignored) with 0600 permissions.
Never change LITELLM_SALT_KEY after virtual keys exist. Recovery requires docker compose down -v and fresh start.
Model names are case-sensitive. Must match MaaS console exactly.
MaaS is region-locked to ap-southeast-1.
LiteLLM config is generated by scripts/generate_config.sh. Do not edit litellm_config.yaml directly — it will be overwritten. Edit .env and re-run generate_config.sh.
LiteLLM config is read-only at startup. Changes require docker compose restart litellm.
Every model must have non-zero input_cost_per_token and output_cost_per_token for budget enforcement to work.
Keep master key admin-only. Mint child virtual keys per team/service/environment.
Make proxy the only egress path for MaaS traffic so budgets, rate limits, and spend logs stay centralized.
STORE_MODEL_IN_DB: True — DB models take precedence over config file models.
drop_params: True — unsupported parameters silently dropped rather than causing errors.
TTFT and ITL custom metrics are streaming-only.
With N MaaS API keys, each model has N deployments. LiteLLM load-balances across them. Effective RPM/TPM = per-key × N.
HUAWEI_MAAS_API_KEY is mandatory. Extra keys are optional. HUAWEI_MAAS_API_KEY_COUNT and HUAWEI_MAAS_API_KEY_N are set by init_env.sh.

Multi-Key Load Balancing

The proxy supports multiple MaaS API keys for increased throughput and resilience:

Concept	Detail
Main key	`HUAWEI_MAAS_API_KEY` — mandatory, always required
Extra keys	Optional, collected by `init_env.sh` or set via `HUAWEI_MAAS_EXTRA_API_KEYS` (CI mode)
Internal vars	`HUAWEI_MAAS_API_KEY_COUNT=N`, `HUAWEI_MAAS_API_KEY_0` through `_N-1` — set by `init_env.sh`
Config generation	`scripts/generate_config.sh` reads `.env` and generates `litellm_config.yaml` with N deployments per model
Routing strategy	`simple-shuffle` (default) — round-robin with random shuffle. Alternatives: `least-busy`, `latency-based-routing`
Cooldown	`cooldown_time: 30`, `allowed_fails: 3` — failed deployments temporarily removed from rotation
Effective capacity	RPM/TPM per model = per-key × N keys
Backward compatible	Single key (N=1) = identical behavior to before

Adding/removing MaaS API keys

Add a key:

Add HUAWEI_MAAS_API_KEY_N=<key> to .env (next index)
Increment HUAWEI_MAAS_API_KEY_COUNT
Run scripts/generate_config.sh
Run docker compose restart litellm

Remove a key:

Remove the HUAWEI_MAAS_API_KEY_N line from .env
Re-number remaining keys to be contiguous (_0, _1, ...)
Update HUAWEI_MAAS_API_KEY_COUNT
Run scripts/generate_config.sh
Run docker compose restart litellm

Change routing strategy:

./scripts/generate_config.sh --routing-strategy=least-busy
docker compose restart litellm

Architecture

Client → LiteLLM (:4000) → Huawei MaaS (ap-southeast-1)
               │               │
               │          ┌────┴────┐
               │          │ N API   │  (N = HUAWEI_MAAS_API_KEY_COUNT)
               │          │ keys    │  LiteLLM load-balances across N deployments
               │          └────────┘
               ├── PostgreSQL (:5432)  — keys, usage, spend
               ├── Prometheus (:9090)  — /metrics scrape every 15s
               └── Grafana   (:3000)  — pre-built dashboard

Startup chain: PostgreSQL (pg_isready) → LiteLLM (/health/liveliness) → Prometheus (scrape) → Grafana.

Request flow: Client → LiteLLM:4000 → (router selects deployment) → Huawei MaaS. LiteLLM logs usage/spend to PostgreSQL, exposes /metrics for Prometheus, returns response.

With N MaaS API keys, each model has N deployments. LiteLLM's router distributes requests across deployments using the configured routing_strategy (default: simple-shuffle). Total effective RPM/TPM = per-key × N.

Multi-Key Load Balancing

The proxy supports multiple MaaS API keys for load balancing and increased throughput:

Environment variables

Variable	Set by	Description
`HUAWEI_MAAS_API_KEY`	Manual / init_env.sh	Main MaaS API key (mandatory)
`HUAWEI_MAAS_API_KEY_COUNT`	init_env.sh	Total number of keys (1 + extra)
`HUAWEI_MAAS_API_KEY_0`	init_env.sh	Indexed key 0 (same as main key)
`HUAWEI_MAAS_API_KEY_N`	init_env.sh	Indexed keys 1, 2, 3...
`HUAWEI_MAAS_EXTRA_API_KEYS`	Manual (CI mode)	Comma-separated extra keys

Config generation

litellm_config.yaml is generated by scripts/generate_config.sh from litellm_config.yaml.example
The generated file is gitignored — never edit it directly
With N keys, each model has N deployments (e.g., glm-5.1 → glm-5.1--maas-key-0, glm-5.1--maas-key-1, ...)
Total effective RPM/TPM = per-key × N

Router settings

Setting	Value	Purpose
`routing_strategy`	`simple-shuffle`	Random selection across healthy deployments
`cooldown_time`	30	Seconds to remove failed deployment from rotation
`allowed_fails`	3	Failures before cooldown triggers

Backward compatibility

Single key = identical behavior to before. No changes required for existing single-key deployments.

Codebase

.
├── README.md                                       human-facing overview
├── SKILL.md                                        agent-facing workflow (this file)
├── docker-compose.yml                              4-service orchestrator (references assets/config/)
├── assets/config/
│   ├── litellm_config.yaml.example                model catalog example (tracked in git)
│   ├── litellm_config.yaml                         generated config (gitignored)
│   ├── custom_callbacks.py                         TTFT/TPOT/ITL Prometheus histograms
│   ├── prometheus.yml                              15s scrape → litellm:4000
│   ├── .env.example                                environment template
│   └── grafana/
│       └── provisioning/
│           ├── datasources/prometheus.yml           auto-linked Prometheus datasource
│           └── dashboards/
│               ├── dashboards.yml                   file-based provider, 30s refresh
│               └── litellm_overview.json            pre-built overview dashboard
├── references/
│   ├── architecture.md                              topology, services, volumes, environment
│   ├── metrics-and-dashboards.md                    PromQL, custom metrics, Grafana panel config
│   ├── operations.md                                health checks, backup, restart, usage, endpoints
│   └── troubleshooting.md                           repair playbook, failure modes, common mistakes
├── scripts/
│   ├── init_env.sh                                  interactive .env setup (manual or agent-guided)
│   ├── generate_config.sh                           generates litellm_config.yaml from .env
│   └── validate_e2e.sh                              12-step end-to-end validation
├── .env                                             actual secrets (gitignored)
└── .gitignore                                       .env and litellm_config.yaml

File-by-file reference

File	Role	Key details
`docker-compose.yml`	Service orchestration	YAML anchor, 4 services with healthcheck chain, named volumes, mounts from `./assets/config/`
`assets/config/litellm_config.yaml.example`	Model catalog example	`openai/` prefix + MaaS endpoint, `tpm`/`rpm` per model, per-token pricing, tracked in git
`assets/config/litellm_config.yaml`	Generated config	Created by `generate_config.sh`, gitignored, N deployments per model
`assets/config/custom_callbacks.py`	Custom Prometheus metrics	`PrometheusTTFTTPOTITL(CustomLogger)`, 3 histograms labeled by `model`, `model_group`, `api_provider`
`assets/config/prometheus.yml`	Scrape config	Single job `litellm` at 15s interval
`assets/config/grafana/provisioning/datasources/prometheus.yml`	Datasource	Prometheus type, proxy access, `http://prometheus:9090`
`assets/config/grafana/provisioning/dashboards/dashboards.yml`	Dashboard provider	File-based, org 1, 30s update interval
`assets/config/grafana/provisioning/dashboards/litellm_overview.json`	Pre-built dashboard	UID `litellm-overview`, 10s auto-refresh, template variables: `model`, `datasource`, includes Deployment Load Balancing row
`scripts/generate_config.sh`	Config generator	Reads `.env`, generates `litellm_config.yaml` from template, creates N deployments per model

Docker Compose Services

Service	Image	Container name	Port	Healthcheck	Depends on
`litellm`	`ghcr.io/berriai/litellm:v1.83.14-stable.patch.3`	`litellm_proxy`	`4000:4000`	`GET /health/liveliness` every 30s, 10s timeout, 3 retries, 40s start period	`db` (healthy)
`db`	`postgres:16-alpine`	`litellm_pg_db`	(internal 5432)	`pg_isready` every 5s, 5s timeout, 10 retries	—
`prometheus`	`prom/prometheus:v3.3.1`	`litellm_prometheus`	`9090:9090`	`GET /-/healthy` every 15s, 5s timeout, 3 retries, 10s start period	`litellm` (healthy)
`grafana`	`grafana/grafana:11.5.2`	`litellm_grafana`	`3000:3000`	`GET /api/health` every 15s, 5s timeout, 3 retries, 15s start period	`prometheus` (healthy)

Volume mounts

Service	Host path	Container path	Mode
`litellm`	`./assets/config/litellm_config.yaml`	`/app/config.yaml`	ro (generated file)
`litellm`	`./assets/config/custom_callbacks.py`	`/app/custom_callbacks.py`	ro
`db`	`postgres_data` volume	`/var/lib/postgresql/data`	rw
`prometheus`	`./assets/config/prometheus.yml`	`/etc/prometheus/prometheus.yml`	ro
`prometheus`	`prometheus_data` volume	`/prometheus`	rw
`grafana`	`./assets/config/grafana/provisioning`	`/etc/grafana/provisioning`	ro
`grafana`	`grafana_data` volume	`/var/lib/grafana`	rw

Named volumes

Volume name	Survives `down`?	Removed by
`litellm_postgres_data`	Yes	`docker compose down -v`
`litellm_prometheus_data`	Yes	`docker compose down -v`
`litellm_grafana_data`	Yes	`docker compose down -v`

LiteLLM container environment

Set via env_file: .env plus explicit environment:

Variable	Source	Value
`DATABASE_URL`	docker-compose	`postgresql://llmproxy:${DB_PASSWORD}@db:5432/litellm`
`STORE_MODEL_IN_DB`	docker-compose	`True`
`LITELLM_MASTER_KEY`	.env	Admin key, must start with `sk-`
`LITELLM_SALT_KEY`	.env	Key encryption salt
`HUAWEI_MAAS_API_KEY`	.env	Main Huawei MaaS API key
`HUAWEI_MAAS_API_KEY_N`	.env	Indexed MaaS API keys (0, 1, 2...)
`HUAWEI_MAAS_API_KEY_COUNT`	.env	Number of MaaS API keys
`HUAWEI_MAAS_API_BASE`	.env	`https://api-ap-southeast-1.modelarts-maas.com/openai/v1`

LiteLLM command: --config=/app/config.yaml

Deployment Workflow

Follow in order. Do not skip validation steps.

0. Preflight

docker --version          # expect 20.10+
docker compose version    # expect v2

1. Install from monorepo

MONOREPO="https://github.com/binrogithub/1-3-Cloud-Adoption-Skills.git"
TEMP_DIR="/home/1-3-Cloud-Adoption-Skills"
LITELLM_DIR="/home/LiteLLM-Huawei-MaaS-Proxy"

git clone --depth 1 "$MONOREPO" "$TEMP_DIR"
cp -r "$TEMP_DIR/AI/AI-Coding/LiteLLM-Huawei-MaaS-Proxy" "$LITELLM_DIR"
rm -rf "$TEMP_DIR"
cd "$LITELLM_DIR"

2. Configure `.env`

Two paths — choose one:

Guided (recommended for agents and first-time deployers):

./scripts/init_env.sh              # interactive — prompt for each secret, ask for number of extra MaaS keys
./scripts/init_env.sh --auto       # agent mode — auto-generate all, prompt for MaaS API key(s)
./scripts/init_env.sh --ci         # CI mode — all from env vars (including HUAWEI_MAAS_EXTRA_API_KEYS)

The script writes .env with 0600 permissions, validates required values, and refuses to proceed if HUAWEI_MAAS_API_KEY is missing or placeholder. After collecting keys, it sets HUAWEI_MAAS_API_KEY_COUNT and indexed HUAWEI_MAAS_API_KEY_N vars.

Manual (full control over every value):

cp assets/config/.env.example .env
$EDITOR .env                       # fill all secrets and HUAWEI_MAAS_API_KEY(s)
chmod 600 .env

Or generate secrets individually:

python3 -c "import secrets; print('sk-' + secrets.token_urlsafe(32))"   # MASTER_KEY
python3 -c "import secrets; print(secrets.token_urlsafe(32))"            # SALT_KEY, passwords

2b. Generate config

After .env is configured:

./scripts/generate_config.sh       # generates litellm_config.yaml from template and .env

This reads HUAWEI_MAAS_API_KEY_COUNT and creates N deployments per model.

3. Pre-deploy validation

Before starting the stack, verify .env is complete:

source .env
for VAR in LITELLM_MASTER_KEY LITELLM_SALT_KEY DB_PASSWORD HUAWEI_MAAS_API_KEY; do
  VAL="${!VAR:-}"
  [ -z "$VAL" ] && echo "MISSING: $VAR" || echo "OK: $VAR (len=${#VAL})"
done

If any variable is missing or contains a placeholder, halt and fix before proceeding. The stack will fail at runtime with incomplete secrets.

4. Start the stack

docker compose up -d

5. Wait for healthy services

docker compose ps

All four services must show healthy or running. LiteLLM has a 40s start period.

6. Validate direct MaaS connectivity

curl -s https://api-ap-southeast-1.modelarts-maas.com/openai/v1/models \
  -H "Authorization: Bearer $HUAWEI_MAAS_API_KEY" | jq '.data[].id'

Expect a list of model IDs. If 403, key is wrong or expired.

7. Validate LiteLLM health

curl -s http://localhost:4000/health/liveliness
curl -s http://localhost:4000/health \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '.healthy_count, .unhealthy_count'

Expect unhealthy_count: 0.

8. Validate proxied chat completion

curl -s http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "glm-5.1", "messages": [{"role": "user", "content": "Reply with OK only."}]}' | jq '.choices[0].message.content'

9. Validate streaming

curl -s http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "deepseek-v4-flash", "messages": [{"role": "user", "content": "Count to 3."}], "stream": true}' | head -5

Expect SSE chunks (data: {...}).

10. Validate Prometheus metrics

curl -s http://localhost:4000/metrics | grep -c "litellm_"
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

Expect metric count > 0 and Prometheus target health = up.

11. Validate Grafana dashboard

curl -s -o /dev/null -w "%{http_code}\n" http://localhost:3000

Expect 200.

12. Validate virtual key minting

curl -s -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"models": ["glm-5.1"], "max_budget": 1.0, "duration": "1d"}' | jq '.key'

Expect a virtual key starting with sk-.

Validation Sequence

For an existing deployment, run in order:

docker compose ps — all services healthy
curl http://localhost:4000/health/liveliness — LiteLLM process up
curl http://localhost:4000/health -H "Authorization: Bearer $LITELLM_MASTER_KEY" — upstream reachable per model
Chat completion with master key on glm-5.1 — sync path
Streaming completion — SSE path
/key/generate — mints a virtual key
Chat completion with virtual key — multi-user path and budget hooks
/metrics | grep -c litellm_ — metrics flowing
Prometheus targets — scraping
http://localhost:3000 — Grafana reachable

Environment Reference

Variable	Required	Default	Description
`LITELLM_MASTER_KEY`	Yes	—	Admin key, must start with `sk-`
`LITELLM_SALT_KEY`	Yes	—	Key encryption salt — immutable after first virtual key
`DB_PASSWORD`	Yes	—	PostgreSQL password for `llmproxy` user
`HUAWEI_MAAS_API_KEY`	Yes	—	Main MaaS API key from ModelArts console (CN-Hong Kong)
`HUAWEI_MAAS_API_BASE`	Yes	—	`https://api-ap-southeast-1.modelarts-maas.com/openai/v1`
`HUAWEI_MAAS_API_KEY_COUNT`	Auto	1	Number of MaaS API keys (set by init_env.sh)
`HUAWEI_MAAS_API_KEY_N`	Auto	—	Indexed keys (0, 1, 2...). Set by init_env.sh.
`HUAWEI_MAAS_EXTRA_API_KEYS`	No	—	Comma-separated extra keys for CI mode
`PROMETHEUS_RETENTION`	No	`15d`	Prometheus TSDB retention period
`GRAFANA_PASSWORD`	No	`admin`	Grafana admin password

Endpoints

Service	URL	Auth
LiteLLM API	`http://localhost:4000`	`Authorization: Bearer <key>`
LiteLLM Admin UI	`http://localhost:4000/ui`	Login with `LITELLM_MASTER_KEY`
Prometheus	`http://localhost:9090`	None
Grafana	`http://localhost:3000`	admin / `GRAFANA_PASSWORD`

LiteLLM API routes

Route	Method	Description
`/v1/chat/completions`	POST	OpenAI-compatible chat completions
`/v1/models`	GET	List available models
`/health/liveliness`	GET	Liveness probe (used by healthcheck)
`/health`	GET	Per-model health (auth required)
`/metrics`	GET	Prometheus metrics endpoint
`/key/generate`	POST	Generate scoped virtual key
`/key/info`	POST	Get key info
`/key/update`	POST	Update key settings
`/key/delete`	POST	Delete a key
`/model/info`	GET	Model details including pricing (auth required)
`/ui`	GET	Admin UI

Models

Name	in / out	RPM	TPM	Cost (in/out per token)
`glm-5.1`	192K / 128K	30	500K	$1.078 / $3.774 × 10⁻⁶
`glm-5`	192K / 64K	30	500K	$0.809 / $2.965 × 10⁻⁶
`deepseek-v4-pro`	1M / 128K	3	30K	$1.617 / $3.235 × 10⁻⁶
`deepseek-v4-flash`	1M / 128K	3	30K	$0.135 / $0.270 × 10⁻⁶
`deepseek-v3.2`	128K / 32K	700	500K	$0.270 / $0.404 × 10⁻⁶

Model configuration structure

Single key (backward compatible):

- model_name: <public-name>
  litellm_params:
    model: openai/<maas-model-name>
    api_base: os.environ/HUAWEI_MAAS_API_BASE
    api_key: os.environ/HUAWEI_MAAS_API_KEY
    tpm: <tokens-per-minute>
    rpm: <requests-per-minute>
  model_info:
    max_tokens: <total>
    max_input_tokens: <input>
    max_output_tokens: <output>
    input_cost_per_token: <price>
    output_cost_per_token: <price>

Multi-key (N keys → N deployments per model):

With N MaaS API keys, generate_config.sh creates N deployments per model:

- model_name: <public-name>
  litellm_params:
    model: openai/<maas-model-name>
    api_base: os.environ/HUAWEI_MAAS_API_BASE
    api_key: os.environ/HUAWEI_MAAS_API_KEY_0
    tpm: <tokens-per-minute>
    rpm: <requests-per-minute>
  model_info: { ... }

- model_name: <public-name>
  litellm_params:
    model: openai/<maas-model-name>
    api_base: os.environ/HUAWEI_MAAS_API_BASE
    api_key: os.environ/HUAWEI_MAAS_API_KEY_1
    tpm: <tokens-per-minute>
    rpm: <requests-per-minute>
  model_info: { ... }

# ... one entry per key

LiteLLM's router load-balances across all deployments for the same model_name.

Adding a new model

Find model name and rate/price info in ModelArts MaaS console
Add entry to model_list in assets/config/litellm_config.yaml.example following the structure above
Ensure model_name matches MaaS exactly (case-sensitive)
Set tpm/rpm from MaaS console quotas
Set non-zero input_cost_per_token and output_cost_per_token (per-token, not per-1K)
Regenerate config: ./scripts/generate_config.sh
Restart: docker compose restart litellm
Verify: curl -s http://localhost:4000/v1/models -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '.data[].id'
Confirm pricing: curl -s http://localhost:4000/model/info -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '.data[] | {model: .model_name, input_cost: .input_cost_per_token, output_cost: .output_cost_per_token}'

Adding/removing MaaS API keys

To add a key:

Add the new key to .env as HUAWEI_MAAS_API_KEY_N (next index)
Increment HUAWEI_MAAS_API_KEY_COUNT
Regenerate config: ./scripts/generate_config.sh
Restart: docker compose restart litellm
Verify deployment count: curl -s http://localhost:4000/model/info -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '[.data[].model_name] | length'

To remove a key:

Remove the key from .env
Decrement HUAWEI_MAAS_API_KEY_COUNT
Re-index remaining keys if needed (keys must be 0, 1, 2... contiguous)
Regenerate config: ./scripts/generate_config.sh
Restart: docker compose restart litellm

To rotate an expired key:

Replace the expired key value in .env
Regenerate config: ./scripts/generate_config.sh
Restart: docker compose restart litellm

Proxy settings

Configured in assets/config/litellm_config.yaml under litellm_settings:

Setting	Value	Meaning
`num_retries`	3	Retry failed calls 3 times within same deployment
`request_timeout`	600	Raise TimeoutError after 600s (full request latency; matches LiteLLM default)
`stream_timeout`	60	Raise TimeoutError after 60s waiting for first token (TTFT only)
`drop_params`	True	Drop unsupported params instead of erroring
`set_verbose`	False	Suppress debug logging
`callbacks`	`["prometheus", "custom_callbacks.my_prometheus_logger"]`	Built-in Prometheus + custom TTFT/TPOT/ITL

Under general_settings:

Setting	Value	Meaning
`database_connection_pool_limit`	10	Max DB connections
`database_connection_timeout`	60	DB connection timeout in seconds

Usage

Chat completion

curl -s http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "glm-5.1", "messages": [{"role": "user", "content": "Hello!"}]}'

Streaming

curl -s http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "deepseek-v4-flash", "messages": [{"role": "user", "content": "Count to 5."}], "stream": true}'

Thinking mode (DeepSeek)

curl -s http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "deepseek-v4-pro", "messages": [{"role": "user", "content": "Solve step by step."}], "extra_body": {"thinking": {"type": "enabled"}}}'

Python SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:4000/v1", api_key="sk-...")
response = client.chat.completions.create(
    model="glm-5.1",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Virtual key management

curl -s -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"models": ["glm-5.1", "deepseek-v4-flash"], "max_budget": 10.0, "duration": "30d"}'

curl -s -X POST http://localhost:4000/key/info \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"key": "sk-..."}'

curl -s -X POST http://localhost:4000/key/update \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"key": "sk-...", "max_budget": 50.0}'

curl -s -X POST http://localhost:4000/key/delete \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"keys": ["sk-..."]}'

Metrics

Built-in LiteLLM metrics (on `/metrics`)

Metric	Type	Description
`litellm_proxy_total_requests_metric`	counter	Total requests
`litellm_request_total_latency_metric`	histogram	End-to-end latency
`litellm_llm_api_latency_metric`	histogram	Upstream API latency only
`litellm_spend_metric`	counter	Cumulative spend (USD)
`litellm_input_tokens_metric`	counter	Input tokens
`litellm_output_tokens_metric`	counter	Output tokens
`litellm_deployment_state`	gauge	0=healthy, 1=partial, 2=outage

Deployment-level metrics (multi-key)

Metric	Type	Labels	Description
`litellm_deployment_success_responses_total`	counter	`litellm_model_name`	Successful responses per deployment
`litellm_deployment_failure_responses_total`	counter	`litellm_model_name`, `exception_status`, `exception_class`	Failed responses per deployment
`litellm_deployment_cooled_down`	gauge	`litellm_model_name`	1 if deployment is in cooldown
`litellm_deployment_latency_per_output_token`	histogram	`litellm_model_name`	Latency per output token by deployment

Label note: Deployment metrics use litellm_model_name (includes deployment suffix like glm-5.1--maas-key-0), not model.

Custom metrics (`custom_callbacks.py`)

Metric	Type	When	Bucket range
`litellm_custom_ttft_seconds`	histogram	Streaming only	0.01s → 30s
`litellm_custom_tpot_seconds`	histogram	Always	0.001s → 5s
`litellm_custom_itl_seconds`	histogram	Streaming only	0.001s → 5s

All custom metrics labeled: model, model_group, api_provider.

Custom callback internals

PrometheusTTFTTPOTITL(CustomLogger):

TTFT = completion_start_time - api_call_start_time (streaming only, when > 0)
TPOT = (end_time - start_time) / output_tokens (always, when output_tokens > 0)
ITL = (end_time - completion_start_time) / (output_tokens - 1) (streaming only, when output_tokens > 1)

Useful PromQL

rate(litellm_proxy_total_requests_metric[5m]) * 60
histogram_quantile(0.99, rate(litellm_request_total_latency_metric_bucket[5m]))
rate(litellm_spend_metric[1d])
rate(litellm_input_tokens_metric[5m])*60 + rate(litellm_output_tokens_metric[5m])*60
litellm_deployment_state == 2
histogram_quantile(0.95, rate(litellm_custom_ttft_seconds_bucket[5m]))
rate(litellm_custom_tpot_seconds_sum[5m]) / rate(litellm_custom_tpot_seconds_count[5m])

Grafana Dashboard

Pre-built dashboard (litellm_overview.json):

Auto-refresh: 10s
Default time range: Last 1 hour
Template variables: model (multi-select), datasource (Prometheus selector)
Panel sections: Request Rates, Latency Percentiles, Spend, Token Rates, Deployment State, Custom TTFT/TPOT/ITL, Deployment Load Balancing

Deployment Load Balancing panels (multi-key)

When multiple MaaS API keys are configured, the dashboard includes:

Panel	Description
Deployments Per Model	Shows N deployments per model (bar chart)
Request Distribution	Per-deployment request counts (pie/bar)
Cooldown Events	Deployments temporarily removed from rotation (time series)
Per-Deployment Latency	Latency breakdown by deployment (heatmap or time series)
Deployment Health	Current health status per deployment (stat panel)

Access at http://localhost:3000, login with admin / GRAFANA_PASSWORD.

Operations

Health checks

docker compose ps
curl -s http://localhost:4000/health/liveliness
curl -s http://localhost:4000/health \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '.healthy_count, .unhealthy_count'
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
curl -s https://api-ap-southeast-1.modelarts-maas.com/openai/v1/models \
  -H "Authorization: Bearer $HUAWEI_MAAS_API_KEY"

Backup & restore

docker compose exec db pg_dump -U llmproxy litellm > backup_$(date +%Y%m%d).sql
cat backup_20260516.sql | docker compose exec -T db psql -U llmproxy litellm

Restart & reset

docker compose restart litellm
docker compose down && docker compose up -d
docker compose down -v && docker compose up -d

Troubleshooting commands

docker compose logs litellm
docker compose logs -f litellm
docker compose logs db
docker compose exec db pg_isready -d litellm -U llmproxy
docker compose logs prometheus
docker compose logs grafana
docker volume ls | grep litellm
docker compose exec litellm env | grep -E '^(LITELLM|DB_|HUAWEI|STORE_)'

Repair Playbook

Inspect state — docker compose ps and docker compose logs litellm --tail 50
Inspect config — read assets/config/litellm_config.yaml and .env before editing
Confirm environment — verify .env contains real MaaS key (not placeholder)
Check DB — docker compose exec db pg_isready -d litellm -U llmproxy
Check LiteLLM health — curl -s http://localhost:4000/health -H "Authorization: Bearer $LITELLM_MASTER_KEY"
Fix the issue — see Common failure modes below
Restart if config changed — docker compose restart litellm
Re-validate — run scripts/validate_e2e.sh or Validation Sequence

Common failure modes

Symptom	Cause	Fix
`litellm` keeps restarting	DB not ready or wrong `DB_PASSWORD`	Check `docker compose logs db`, verify `.env`
401 from `/v1/chat/completions`	Wrong or missing API key	Verify `Authorization: Bearer sk-...` header
404 model not found	Model name mismatch	Names are case-sensitive, must match MaaS console
No metrics in Prometheus	LiteLLM healthcheck failing	Check `docker compose ps`, ensure litellm healthy
`LITELLM_SALT_KEY` error	Salt changed after keys created	Use original salt; if lost, `docker compose down -v`
MaaS 403	Wrong region or expired key	Verify key in ModelArts console, region must be `ap-southeast-1`
Callback import error	`custom_callbacks.py` not mounted	Check volume mount in `docker-compose.yml`
`unhealthy_count > 0` in `/health`	Upstream model unreachable	Check MaaS key, model ID, region; do not add wildcards
Budget not consumed	Zero `input_cost_per_token` / `output_cost_per_token`	Set non-zero pricing; verify with `/model/info`
Prometheus target down	LiteLLM not healthy or not started	Check healthcheck chain: `db` → `litellm` → `prometheus`
Grafana shows no data	Prometheus not scraping or wrong datasource	Check targets; verify datasource URL is `http://prometheus:9090`
Virtual key 403	Key expired, over budget, or model not in allow-list	Check key with `/key/info`
Intermittent TimeoutError	`request_timeout` too low for LLM calls (applies to full request latency, not just TTFT)	Increase `request_timeout` (default 600s); add `stream_timeout` for tighter TTFT deadline

Sanitization Rules

Never write real secrets into committed files. Use .env (gitignored) with 0600 permissions.
In output or documentation, use placeholders: sk-<master-key>, <maas-api-key>, <db-password>.
In configuration demos, read secrets from env vars (os.environ/...) or $VAR_NAME placeholders.
Mask discovered keys as <prefix>...<suffix> (len=N) or ***redacted***.
LiteLLM may print api_key values in startup logs. Scan after troubleshooting: docker compose logs litellm 2>&1 | grep -i 'api_key\|sk-'; set set_verbose: False if keys appear.

Common Mistakes

Mistake	Why it's wrong	Correct approach
Committing `.env` to git	Leaks all secrets	`.env` is gitignored; never `git add .env`
Changing `LITELLM_SALT_KEY` after creating virtual keys	All existing keys unreadable	Keep original salt; if lost, full reset
Giving clients the raw `HUAWEI_MAAS_API_KEY`	Bypasses spend tracking, rate limiting, audit	Mint virtual keys via `/key/generate`
Using per-1K-token pricing in `model_info`	LiteLLM expects per-token pricing	Use `input_cost_per_token` (e.g. `0.000001078`, not `0.001078`)
Adding a model with zero pricing	Budgets don't consume spend	Set non-zero `input_cost_per_token` and `output_cost_per_token`
Guessing model names	MaaS model IDs are case-sensitive	Verify exact name in MaaS console
Editing config without restarting	Config is read at startup only	`docker compose restart litellm` after changes
Running `docker compose down` expecting data loss	Volumes survive `down`	Use `docker compose down -v` to destroy data
Checking `/health/liveliness` instead of `/health` for model status	Liveliness only checks process	Use `/health` with auth for model-level diagnostics
Setting `request_timeout` too low (e.g. 10s)	LLM calls routinely exceed 10s end-to-end, causing intermittent TimeoutErrors	Use `request_timeout: 600` (default); use `stream_timeout` for tighter TTFT control

Reference: File Templates

Templates for reference — to understand each file's structure or build from scratch when git is not available. When cloning this repo, the actual files take precedence.

`.gitignore`

.env

`.env.example`

# ── Proxy Auth ───────────────────────────────────
LITELLM_MASTER_KEY="sk-change-me"
LITELLM_SALT_KEY="change-me-to-a-long-random-string"

# ── Database ─────────────────────────────────────
DB_PASSWORD="change-me-to-a-strong-password"

# ── Huawei MaaS ──────────────────────────────────
HUAWEI_MAAS_API_KEY="change-me-to-your-maas-api-key"
HUAWEI_MAAS_API_KEY_0="change-me-to-your-maas-api-key"
HUAWEI_MAAS_API_BASE="https://api-ap-southeast-1.modelarts-maas.com/openai/v1"

# ── Prometheus ───────────────────────────────────
PROMETHEUS_RETENTION="15d"

# ── Grafana ──────────────────────────────────────
GRAFANA_PASSWORD="change-me-to-a-strong-password"

`docker-compose.yml`

x-default: &default
  restart: unless-stopped
  logging:
    driver: "json-file"
    options:
      max-size: "10m"
      max-file: "3"

services:
  litellm:
    <<: *default
    container_name: litellm_proxy
    image: ghcr.io/berriai/litellm:v1.83.14-stable.patch.3
    ports:
      - "4000:4000"
    volumes:
      - ./assets/config/litellm_config.yaml:/app/config.yaml:ro
      - ./assets/config/custom_callbacks.py:/app/custom_callbacks.py:ro
    environment:
      DATABASE_URL: "postgresql://llmproxy:${DB_PASSWORD}@db:5432/litellm"
      STORE_MODEL_IN_DB: "True"
    env_file:
      - .env
    depends_on:
      db:
        condition: service_healthy
    healthcheck:
      test: ["CMD-SHELL", "python3 -c \"import urllib.request; urllib.request.urlopen('http://localhost:4000/health/liveliness')\""]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    command:
      - "--config=/app/config.yaml"

  db:
    <<: *default
    image: postgres:16-alpine
    container_name: litellm_pg_db
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: llmproxy
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -d litellm -U llmproxy"]
      interval: 5s
      timeout: 5s
      retries: 10

  prometheus:
    <<: *default
    image: prom/prometheus:v3.3.1
    container_name: litellm_prometheus
    volumes:
      - ./assets/config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=${PROMETHEUS_RETENTION:-15d}"
    depends_on:
      litellm:
        condition: service_healthy
    healthcheck:
      test: ["CMD-SHELL", "wget --spider -q http://localhost:9090/-/healthy || exit 1"]
      interval: 15s
      timeout: 5s
      retries: 3
      start_period: 10s

  grafana:
    <<: *default
    image: grafana/grafana:11.5.2
    container_name: litellm_grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
      GF_USERS_ALLOW_SIGN_UP: "false"
    volumes:
      - ./assets/config/grafana/provisioning:/etc/grafana/provisioning:ro
      - grafana_data:/var/lib/grafana
    depends_on:
      prometheus:
        condition: service_healthy
    healthcheck:
      test: ["CMD-SHELL", "wget --spider -q http://localhost:3000/api/health || exit 1"]
      interval: 15s
      timeout: 5s
      retries: 3
      start_period: 15s

volumes:
  postgres_data:
    name: litellm_postgres_data
  prometheus_data:
    name: litellm_prometheus_data
  grafana_data:
    name: litellm_grafana_data

`litellm_config.yaml`

model_list:

  # ───────── Huawei MaaS Models ───────────

  - model_name: glm-5.1
    litellm_params:
      model: openai/glm-5.1
      api_base: os.environ/HUAWEI_MAAS_API_BASE
      api_key: os.environ/HUAWEI_MAAS_API_KEY
      tpm: 500000
      rpm: 30
    model_info:
      max_tokens: 198000
      max_input_tokens: 192000
      max_output_tokens: 128000
      input_cost_per_token: 0.000001078
      output_cost_per_token: 0.000003774

  - model_name: glm-5
    litellm_params:
      model: openai/glm-5
      api_base: os.environ/HUAWEI_MAAS_API_BASE
      api_key: os.environ/HUAWEI_MAAS_API_KEY
      tpm: 500000
      rpm: 30
    model_info:
      max_tokens: 198000
      max_input_tokens: 192000
      max_output_tokens: 64000
      input_cost_per_token: 0.000000809
      output_cost_per_token: 0.000002965

  - model_name: deepseek-v4-pro
    litellm_params:
      model: openai/deepseek-v4-pro
      api_base: os.environ/HUAWEI_MAAS_API_BASE
      api_key: os.environ/HUAWEI_MAAS_API_KEY
      tpm: 30000
      rpm: 3
    model_info:
      max_tokens: 1000000
      max_input_tokens: 1000000
      max_output_tokens: 128000
      input_cost_per_token: 0.000001617
      output_cost_per_token: 0.000003235

  - model_name: deepseek-v4-flash
    litellm_params:
      model: openai/deepseek-v4-flash
      api_base: os.environ/HUAWEI_MAAS_API_BASE
      api_key: os.environ/HUAWEI_MAAS_API_KEY
      tpm: 30000
      rpm: 3
    model_info:
      max_tokens: 1000000
      max_input_tokens: 1000000
      max_output_tokens: 128000
      input_cost_per_token: 0.000000135
      output_cost_per_token: 0.00000027

  - model_name: deepseek-v3.2
    litellm_params:
      model: openai/deepseek-v3.2
      api_base: os.environ/HUAWEI_MAAS_API_BASE
      api_key: os.environ/HUAWEI_MAAS_API_KEY
      tpm: 500000
      rpm: 700
    model_info:
      max_tokens: 160000
      max_input_tokens: 128000
      max_output_tokens: 32000
      input_cost_per_token: 0.00000027
      output_cost_per_token: 0.000000404


litellm_settings:
  num_retries: 3
  request_timeout: 600
  stream_timeout: 60
  drop_params: True
  set_verbose: False
  callbacks:
    - "prometheus"
    - custom_callbacks.my_prometheus_logger
  ui_theme_config:
    logo_url: "https://upload.wikimedia.org/wikipedia/en/thumb/0/04/Huawei_Standard_logo.svg/3840px-Huawei_Standard_logo.svg.png"
    favicon_url: "https://upload.wikimedia.org/wikipedia/en/thumb/0/04/Huawei_Standard_logo.svg/3840px-Huawei_Standard_logo.svg.png"

router_settings:
  routing_strategy: simple-shuffle
  num_retries: 3
  cooldown_time: 30
  allowed_fails: 3

general_settings:
  database_connection_pool_limit: 10
  database_connection_timeout: 60

`custom_callbacks.py`

"""
Custom LiteLLM callback that emits TTFT, TPOT, and ITL as Prometheus histograms.

TTFT  = completion_start_time - api_call_start_time  (streaming only)
TPOT  = total_latency / output_tokens
ITL   = (end_time - completion_start_time) / max(output_tokens - 1, 1)  (streaming only)
"""

from datetime import datetime
from litellm.integrations.custom_logger import CustomLogger
from prometheus_client import Histogram


def _to_timestamp(val):
    """Convert datetime or numeric to a float unix timestamp."""
    if val is None:
        return None
    if isinstance(val, (int, float)):
        return float(val)
    if hasattr(val, "timestamp"):
        return val.timestamp()
    return None


class PrometheusTTFTTPOTITL(CustomLogger):
    """Custom callback that emits TTFT, TPOT, and ITL as Prometheus histograms."""

    def __init__(self):
        super().__init__()

        self.ttft = Histogram(
            "litellm_custom_ttft_seconds",
            "Time to first token in seconds (streaming only)",
            labelnames=["model", "model_group", "api_provider"],
            buckets=(0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0),
        )

        self.tpot = Histogram(
            "litellm_custom_tpot_seconds",
            "Time per output token in seconds",
            labelnames=["model", "model_group", "api_provider"],
            buckets=(0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 5.0),
        )

        self.itl = Histogram(
            "litellm_custom_itl_seconds",
            "Inter-token latency in seconds (average between successive tokens, streaming only)",
            labelnames=["model", "model_group", "api_provider"],
            buckets=(0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 5.0),
        )

    async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
        try:
            stream = kwargs.get("stream", False)
            completion_start_time = kwargs.get("completion_start_time")
            api_call_start_time = kwargs.get("api_call_start_time")

            slo = kwargs.get("standard_logging_object") or {}
            model = slo.get("model") or kwargs.get("model", "unknown")
            model_group = slo.get("model_group") or model
            api_provider = slo.get("custom_llm_provider") or "unknown"

            labels = {"model": model, "model_group": model_group, "api_provider": api_provider}

            output_tokens = 0
            if response_obj is not None:
                usage = None
                if hasattr(response_obj, "get"):
                    usage = response_obj.get("usage")
                elif hasattr(response_obj, "usage"):
                    usage = response_obj.usage
                if usage is not None:
                    if isinstance(usage, dict):
                        output_tokens = usage.get("completion_tokens", 0) or 0
                    elif hasattr(usage, "completion_tokens"):
                        output_tokens = usage.completion_tokens or 0

            start_ts = _to_timestamp(start_time)
            end_ts = _to_timestamp(end_time)
            api_start_ts = _to_timestamp(api_call_start_time)
            comp_start_ts = _to_timestamp(completion_start_time)

            if stream and api_start_ts and comp_start_ts:
                ttft_seconds = comp_start_ts - api_start_ts
                if ttft_seconds > 0:
                    self.ttft.labels(**labels).observe(ttft_seconds)

            if output_tokens > 0 and start_ts and end_ts:
                total_latency = end_ts - start_ts
                tpot_seconds = total_latency / output_tokens
                self.tpot.labels(**labels).observe(tpot_seconds)

                if stream and comp_start_ts:
                    streaming_duration = end_ts - comp_start_ts
                    if streaming_duration > 0 and output_tokens > 1:
                        itl_seconds = streaming_duration / (output_tokens - 1)
                        self.itl.labels(**labels).observe(itl_seconds)

        except Exception as e:
            print(f"[PrometheusTTFTTPOTITL] Error: {e}")


my_prometheus_logger = PrometheusTTFTTPOTITL()

`prometheus.yml`

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "litellm"
    static_configs:
      - targets: ["litellm:4000"]

`grafana/provisioning/datasources/prometheus.yml`

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

`grafana/provisioning/dashboards/dashboards.yml`

apiVersion: 1

providers:
  - name: LiteLLM
    orgId: 1
    folder: ""
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /etc/grafana/provisioning/dashboards
      foldersFromFilesStructure: false

The pre-built litellm_overview.json dashboard is at assets/config/grafana/provisioning/dashboards/litellm_overview.json.

Bundled Resources

references/architecture.md — topology, services, volumes
references/metrics-and-dashboards.md — PromQL, custom metrics, Grafana
references/operations.md — health checks, backup, restart, usage
references/troubleshooting.md — repair playbook, failure modes
scripts/init_env.sh — interactive .env setup (manual, agent-guided, or CI)
scripts/generate_config.sh — generates litellm_config.yaml from .env
scripts/validate_e2e.sh — 12-step end-to-end validation

Output Expectations

On completion, leave behind:

docker compose ps with all four services healthy
.env populated with real secrets, chmod 600, no placeholders
Validated: direct MaaS request, proxied request, streaming, metrics, Grafana, virtual key
Operator note listing: endpoints, file paths, master key location, MaaS region, virtual keys created

litellm-huawei-maas-proxy

Plus depuis ce dépôt

Plus depuis ce dépôt

LiteLLM Huawei MaaS Proxy

When to Use

Required Inputs

Core Rules

Multi-Key Load Balancing

Adding/removing MaaS API keys

Architecture

Multi-Key Load Balancing

Environment variables

Config generation

Router settings

Backward compatibility

Codebase

File-by-file reference

Docker Compose Services

Volume mounts

Named volumes

LiteLLM container environment

Deployment Workflow

0. Preflight

1. Install from monorepo

2. Configure .env

2b. Generate config

3. Pre-deploy validation

4. Start the stack

5. Wait for healthy services

6. Validate direct MaaS connectivity

7. Validate LiteLLM health

8. Validate proxied chat completion

9. Validate streaming

10. Validate Prometheus metrics

11. Validate Grafana dashboard

12. Validate virtual key minting

Validation Sequence

Environment Reference

Endpoints

LiteLLM API routes

Models

Model configuration structure

Adding a new model

Adding/removing MaaS API keys

Proxy settings

Usage

Chat completion

Streaming

Thinking mode (DeepSeek)

Python SDK

Virtual key management

Metrics

Built-in LiteLLM metrics (on /metrics)

Deployment-level metrics (multi-key)

Custom metrics (custom_callbacks.py)

Custom callback internals

Useful PromQL

Grafana Dashboard

Deployment Load Balancing panels (multi-key)

Operations

Health checks

Backup & restore

Restart & reset

Troubleshooting commands

Repair Playbook

Common failure modes

Sanitization Rules

Common Mistakes

Reference: File Templates

.gitignore

.env.example

docker-compose.yml

litellm_config.yaml

custom_callbacks.py

prometheus.yml

grafana/provisioning/datasources/prometheus.yml

grafana/provisioning/dashboards/dashboards.yml

Bundled Resources

Output Expectations

Verification Exit Criteria

2. Configure `.env`

Built-in LiteLLM metrics (on `/metrics`)

Custom metrics (`custom_callbacks.py`)

`.gitignore`

`.env.example`

`docker-compose.yml`

`litellm_config.yaml`

`custom_callbacks.py`

`prometheus.yml`

`grafana/provisioning/datasources/prometheus.yml`

`grafana/provisioning/dashboards/dashboards.yml`

2. Configure `.env`

Built-in LiteLLM metrics (on `/metrics`)

Custom metrics (`custom_callbacks.py`)