Autonomyx LLM Gateway — end-to-end LiteLLM proxy setup for 14 providers: Ollama, vLLM, TGI, OpenAI, Claude, Gemini, Mistral, Groq, Fireworks, Together.ai, OpenRouter, Azure, Bedrock. Produces config.yaml, docker-compose.yml, .env.example, token-count test script, Prometheus/ Grafana billing stack, Langflow wiring, and autonomyx-mcp wiring. Deploys to Coolify or generic Docker. ALWAYS trigger for: LiteLLM, LLM proxy, LLM gateway, model routing, virtual keys, token tracking, cost tracking, LLM billing, "route to multiple models", "unified LLM API", "OpenAI-compatible endpoint", "connect Langflow to LLMs", "LiteLLM config", "docker compose for LLM", or any request to configure, deploy, or extend multi-model LLM infra.
Autonomyx LLM Gateway Skill
Produces a complete, working LiteLLM proxy setup for the Autonomyx stack.
Outputs this skill always produces
Artifact
Description
config.yaml
LiteLLM model list, routing, fallbacks, rate limits
docker-compose.yml
LiteLLM proxy + Postgres + Prometheus + Grafana
prometheus.yml
Scrape config for LiteLLM metrics endpoint
grafana-dashboard.json
Token + cost dashboard (importable)
.env.example
All required env vars, documented
test-tokens.sh
curl-based token count + cost verify script
langflow-integration.md
How to point Langflow custom LLM at the gateway
mcp-integration.md
How to wire autonomyx-mcp to use the gateway
Step 1 — Gather Inputs
Before generating any file, confirm:
Which providers are active — check which API keys the user has. Generate model blocks only for confirmed providers. Include commented-out stubs for the rest.
Local model hosts — for Ollama/vLLM/TGI: ask for the host:port (default: host.docker.internal:11434 for Ollama, host.docker.internal:8000 for vLLM).
Coolify vs generic Docker — affects volume paths and network mode.
Master key — generate a secure default or let user supply one.
Postgres credentials — generate defaults or user-supplied.
If the user says "just generate defaults", use the values in references/defaults.md.
Step 2 — Generate config.yaml
Read references/config-template.md for the full annotated template.
Key rules:
Every model entry needs: model_name (alias), litellm_params.model (provider/model string), litellm_params.api_key (env var ref, never hardcoded)
Group models by provider in comments
Always include a router_settings block with routing_strategy: usage-based-routing
Always include fallbacks — local model → cloud fallback
Token counting: set max_tokens per model using values from references/model-limits.md
Budget limits: set max_budget and budget_duration in litellm_settings
Step 3 — Generate docker-compose.yml
Read references/docker-compose-template.md.
Services to include:
litellm — proxy container
litellm-db — Postgres 15
prometheus — scrapes /metrics on LiteLLM
grafana — token/cost dashboards
Coolify-specific:
Add labels block for Coolify reverse proxy (Traefik)
Use named volumes (not bind mounts)
Network: coolify external network
Generic Docker:
Use bind mounts for config files
Expose port 4000 directly
Step 4 — Generate .env.example
One block per provider. See references/env-vars.md for the full list. Rules:
Never emit real keys — always YOUR_KEY_HERE placeholders
Group: Local → OpenAI-compatible cloud → Enterprise
Include LITELLM_MASTER_KEY, DATABASE_URL, STORE_MODEL_IN_DB=True
Read references/lago-integration.md in full. Produce:
Lago docker-compose additions — append lago-api, lago-worker, lago-clock, lago-front, lago-db, lago-redis to the existing compose file. Add lago-db-data and lago-redis-data to the volumes block.
lago_callback.py — LiteLLM custom success callback that fires usage events to Lago per completed request (input tokens, output tokens, request count). Maps LiteLLM key_alias → Lago external_customer_id.
config.yaml addition — register LagoCallback under litellm_settings.success_callback.
Lago env vars — appended to .env.example (LAGO_API_KEY, LAGO_SECRET_KEY_BASE, encryption keys).
Preflight metadata attached to request for downstream logging
Dockerfile.litellm — extends base LiteLLM image, adds transformers sentencepiece
docker-compose update — mount preflight_guard.py, use custom image build
config.yaml update — register PreflightGuard under custom_callbacks
.env.example addition — DEFAULT_TPM_LIMIT
Redis upgrade note — show how to replace in-memory TPM with lago-redis
Step 8.8 — Model Recommender
Read references/model-recommender.md in full. Produce:
recommender.py — FastAPI router registered in LiteLLM config.yaml under additional_routers. Exposes POST /recommend. Infers task type from prompt via Claude Haiku. Reads budget from LiteLLM Postgres + Lago. Reads latency/error rate from Prometheus. Returns ranked list with fit scores, budget remaining, tokens remaining, reset time.
model_registry.json — one entry per model in config.yaml, with cost/1k, capabilities, quality score, latency tier, privacy flag.
config.yaml addition — register recommender.router under general_settings.additional_routers.
docker-compose additions — mount recommender.py and model_registry.json into LiteLLM container. Add RECOMMENDER_INFERENCE_MODEL and PROMETHEUS_URL env vars.
Langflow wiring — HTTP Request node pattern to auto-select model per request.
Env vars — RECOMMENDER_INFERENCE_MODEL, PROMETHEUS_URL in .env.example.
Note: infer_task() in recommender.py must use the local classifier (see Step 8.9) not a cloud API. Cloud is auto-upgrade only.
Step 8.9 — Local Classifier
Read references/local-classifier.md in full. Produce:
classifier/Dockerfile — Python 3.12-slim, installs sentence-transformers, scikit-learn, FastAPI. Serves on port 8100 (internal only).
classifier/classifier_server.py — FastAPI server. On startup: loads all-MiniLM-L6-v2, trains LogisticRegression on training_data.json, saves model to /app/model. Endpoints: POST /classify, POST /retrain, GET /health.
classifier/training_data.json — 80 seed examples across 8 task types (chat, code, reason, summarise, extract, vision, long_context, agent). Instruct user to extend to 100+ per class.
docker-compose addition — classifier service on coolify network, classifier-model named volume for model persistence.
Updated infer_task() in recommender.py — local classifier first via http://classifier:8100/classify. If confidence >= threshold (default 0.80): return local. If below threshold AND RECOMMENDER_MODE=auto AND cloud key present: upgrade to cheapest available cloud model (Haiku > gpt-4o-mini > groq/llama3). Fall back to local on any cloud error.
Env vars — CLASSIFIER_URL, CONFIDENCE_THRESHOLD, RECOMMENDER_MODE, AUTO_RETRAIN_ON_STARTUP in .env.example.
Verify commands — standalone classifier test + full /recommend test with RECOMMENDER_MODE=local and no cloud keys.
Step 8.10 — Local Model Catalogue
Read references/local-model-catalogue.md in full. Produce:
ollama-pull.sh — pulls all 6 Tier 1 models on startup. Prints Tier 2 manual pull commands. Prints Tier 3 vLLM note.
docker-compose additions — ollama service with OLLAMA_MAX_LOADED_MODELS=2, OLLAMA_FLASH_ATTENTION=1. vllm service commented out (GPU opt-in).
config.yaml additions — Tier 1 Ollama model entries active. Tier 2 and Tier 3 entries present but commented with # opt-in markers.
model_registry.json updates — add task_default_for, tier, and private: true to all local model entries.
Fallback chain — update router_settings.fallbacks in config.yaml: local Tier 1 → local Tier 2 → cloud fast → cloud flagship per task type.
VPS spec note — document minimum (16GB RAM, 60GB disk) and recommended (32GB RAM) in setup instructions.
Benchmark table — include in output so user understands why each model was chosen.
Step 8.11 — Profitability & Pricing
Read references/profitability.md in full. Produce:
Lago plan configs — three plans (Starter ₹999, SaaS Basic ₹14,999, Enterprise custom) with correct billable metric codes matching lago_callback.py.
complexity_score addition to recommender.py — 7 scoring signals, routes complexity > 0.8 to cloud or /think mode.
pricing.html — single-page pricing page with benchmark table, three tiers, free tier CTA. Autonomyx brand colours.
GPU phase config — vLLM docker-compose service with continuous batching (max-num-seqs: 256), speculative decoding (Qwen3-3B draft), prefix caching enabled.
Cost model comment block — in docker-compose.yml header, document current cost/token and revenue ceiling per deployment phase.
DPDP enterprise note — one-page markdown: what DPDP Act 2023 means for enterprise customers, what Autonomyx provides (DPA, data residency, audit trail).
Read references/langfuse-integration.md in full. Produce:
docker-compose additions — langfuse-server, langfuse-db, langfuse-redis on Coolify network with Traefik labels at traces.openautonomyx.com. Keycloak OIDC SSO wired.
Extended lago_callback.py — add per-tenant Langfuse trace routing. Each virtual key alias maps to a Langfuse project via LANGFUSE_TENANT_KEYS env var. Falls back to default project if key not mapped.
Extended kc_lago_sync.py — on GROUP_CREATE: create Langfuse org + project alongside Lago customer + LiteLLM key. Store returned API keys for tenant use.
Langfuse Keycloak OIDC client — add to realm setup commands in keycloak-integration.md.
Env vars — all Langfuse vars in .env.example with generation commands.
Multi-tenancy claims table — what you can and cannot claim, for sales and marketing use.
Read references/model-improvement.md in full. Produce:
improvement/anonymiser.py — PII stripping with Indian-specific patterns (PAN, Aadhaar, Indian mobile, email, card numbers). Required before any trace is stored for training.
Extended lago_callback.py — opt-in check before logging to improvement dataset. Segment lookup from TENANT_SEGMENTS env var.
improvement/rag_middleware.py — RAG enrichment using nomic-embed-text via Ollama + SurrealDB vector search. Per-tenant, per-segment collection scoping.
improvement/ingest.py — document ingestion pipeline: chunk → embed → store in SurrealDB with tenant_id isolation.
improvement/finetune.py — QLoRA fine-tuning with Unsloth on opt-in traces. Produces LoRA adapter per segment. GPU-only, run on RunPod A100 not VPS.
SurrealDB schema — vector collections per segment with tenant_id isolation and MTREE index.
model_registry.json additions — fine_tuned_for field, segment-specific fine-tuned model entries.
Opt-in ToS language — plain English, suitable for SaaS ToS insertion.
Step 8.14 — Human Feedback Capture
Read references/human-feedback.md in full. Produce:
feedback.py — FastAPI router mounted on LiteLLM. POST /feedback accepts trace_id, score (0/1), comment, virtual_key, source. Routes to correct Langfuse tenant project. Returns score ID.
Registered in config.yaml — under additional_routers alongside recommender.router.
Feedback widget — self-contained JS + CSS snippet, zero external dependencies. Thumbs up → immediate submit. Thumbs down → shows comment box. Customer embeds after each AI response using response id as trace_id.
Python SDK snippet — submit_feedback() function for developer-side feedback.
JavaScript SDK snippet — submitFeedback() for frontend integration.
Feedback volume targets — month-by-month targets to DPO readiness.
Step 8.15 — Local Language Translation
Read references/translation.md in full. Produce:
translator_server.py — FastAPI sidecar on port 8200. Loads: fastText LID (917KB, auto-download), IndicTrans2 distilled 200M ×2 directions (MIT licence), Opus-MT pairs lazy-loaded on first use (Apache 2.0). Endpoints: POST /translate, GET /languages, GET /health.
docker-compose addition — translator service, 8GB mem_limit, translator-models volume for model caching.
translation_middleware.py — detect language → if native Qwen3: route direct → if not: pivot translate to English → call LLM → translate response back.
Language support matrix — which languages go native vs IndicTrans2 vs Opus-MT. Include RAM allocation update (4GB added).
Licence guardrail — explicitly note NLLB-200 and SeamlessM4T are CC-BY-NC — do NOT include them.
Env vars — TRANSLATOR_URL, INDICTRANS2_DEVICE in .env.example.
Step 8.16 — Two-Node Setup
Read references/two-node-setup.md in full. Produce:
Updated docker-compose.yml for 96GB node — remove Langfuse, Lago, Keycloak services. Add Qwen2.5-14B to Ollama pull list. Update mem_limit: 76g. Update env vars pointing to 48GB node endpoints.
docker-compose-secondary.yml — Langfuse + Lago + Keycloak stack for 48GB node. Include all volumes, healthchecks, Traefik labels.
Migration script — migrate-to-secondary.sh: pg_dump from 96GB, scp, restore on 48GB, verify, stop old containers.
Updated config.yaml — Qwen2.5-14B model entry, updated fallback chain with 14B in extract/summarise path.
Updated model_registry.json — Qwen2.5-14B entry with always_on: true, tier: 2.
Updated RAM maps — both nodes showing three operating states and headroom.
Network options — public TLS vs private network, env var changes per option.
K8s not-needed rationale — document explicitly so future team members understand the decision.