| name | ai-llm-backend |
| description | Build LLM features on the backend — deterministic agent loops (round-trip every tool call by id), RAG over a vector store, token/cost accounting, streaming, eval harness, and prompt-injection defense (treat all model context as untrusted). Use when adding an AI feature, building RAG, or wiring an agent loop. Not for the AI streaming UI on the frontend (use frontend-toolkit's AI integration) or general boundary input parsing (use data-validation). |
| license | MIT |
AI / LLM Backend
Purpose
Build production LLM features that are deterministic where they must be, cost-controlled, observable, and safe against prompt injection — rather than a fragile prompt glued to an API call.
Universal — agent-loop discipline, RAG architecture, token accounting, streaming, eval, and treating model context as untrusted are LLM-backend principles independent of the model vendor; pgvector/Postgres is the default vector store.
Procedure
-
Distinguish workflows from agents
- Workflow — predefined LLM call sequence (classify → extract → format); deterministic, cheaper, debuggable; prefer this
- Agent — model dynamically chooses tools/steps in a loop; powerful but less predictable; use only when the path genuinely can't be predefined
-
Make the agent loop deterministic and bounded
- Round-trip every tool call correctly: each tool-use block gets a matching tool-result carrying the SAME tool-call id — mismatches corrupt the conversation (Anthropic names this field
tool_use_id; OpenAI calls it tool_call_id)
- Cap loop iterations (no infinite tool-calling); cap tool-result token size (compaction)
- Invest in the tool interface (clear schemas + descriptions) as much as the prompt
-
RAG: keep the vector store in the existing database
- One datastore avoids operating a second system; an embeddings column alongside your data is enough for most workloads
- Use an approximate-nearest-neighbour (ANN) index tuned for the recall/latency balance you need
- Chunk deliberately, store source metadata for citations, retrieve top-k then re-rank
- Pin the embedding model id with each stored vector — changing the embedding model invalidates every existing vector (different model = different vector space). Plan a reindex (or dual-write embeddings during a window) before swapping; this is a one-way migration and the #1 RAG operational landmine
-
Account for tokens and cost per call — and survive provider limits
- Log input/output tokens + model + cost for every LLM call
- Set per-user / per-session budgets; alert on spikes (a runaway agent loop or prompt-injection can explode cost)
- Handle 429/503 from the provider with exponential backoff + jitter (see
resilience-patterns); cap parallel in-flight calls per key; for high-availability paths, define a model-fallback chain (primary → secondary → cached/degraded)
- For deterministic prompts (temperature = 0, same input) cache the response — see
caching-strategy (this is what makes evals cheap to re-run)
-
Stream responses
- SSE /
ReadableStream for token-by-token output (pairs with frontend-toolkit AI streaming)
- Handle mid-stream cancellation (client disconnect → stop generation → stop billing)
-
Treat ALL model-context content as untrusted (prompt injection is structural)
- User input, retrieved documents, tool results — all can carry injection; you can't fully "patch" it
- Channel separation is the structural defense: keep user-supplied content in the
user role, never concatenated into the system prompt or a tool description. Same for retrieved docs — wrap each as a user message with a clear "untrusted retrieved content" boundary
- Defenses: never let the model's raw output trigger privileged actions without a gate; validate/parse tool arguments (see
data-validation); a Human-in-the-loop gate for high-stakes actions; least-privilege tools
-
Build an eval harness
- A fixed test set of inputs + expected properties; score outputs (exact, rubric, LLM-judge)
- Run on prompt/model changes — regressions in LLM features are invisible without evals
-
Validate (validation loop)
- Run the eval set; if quality drops below threshold on a prompt/model change → revert or fix and re-run
- Inject a prompt-injection payload via a retrieved doc → verify it can't trigger a privileged action
- Force a tool error → verify the loop handles it (tool-call id still round-tripped, doesn't hang)
Anti-patterns
| ❌ Anti-pattern | ✅ Correct |
|---|
| Agent loop with no iteration cap | Bounded loop + tool-result compaction |
| Mismatched/ignored tool-call id | Round-trip every tool call by id |
| Trusting retrieved docs / tool output as safe | Treat all context as untrusted; gate privileged actions |
| No token/cost logging | Per-call token + cost accounting + budgets |
| Shipping prompt changes with no eval | Eval harness gates prompt/model changes |
| Standing up a second vector DB when your DB can store vectors | Vector store in the existing database |
| Swapping embedding models without a reindex plan | Pin the embedding model id with each vector; reindex (or dual-write) before swap |
| User text concatenated into the system prompt | Channel separation: user content in the user role only |
| No backoff / fallback on provider 429 / 503 | Exponential backoff + jitter + parallel-call cap; multi-model fallback for critical paths |
Severity tiers
| Tier | Examples | Action SLA |
|---|
| Critical | Prompt injection can trigger a privileged action (delete data, send money); unbounded agent loop / cost; raw model output executed; embedding model swapped with no reindex (RAG silently returns garbage) | Block release; fix immediately |
| Major | No token/cost accounting; no eval harness; tool-call-id mishandling causing failures; user content mixed into the system prompt (collapses channel-separation defense) | Fix this sprint |
| Minor | Suboptimal chunking; ANN index params untuned; missing stream cancellation; no multi-model fallback | Schedule within 2 sprints |
Completion Criteria
Output
- AI feature code: agent loop / RAG pipeline / streaming endpoint
- Eval harness: test set + scoring + CI integration
- Cost dashboard: per-feature token/cost metrics
- Commit format:
feat(ai): RAG over <corpus> with pgvector / feat(ai): eval harness for <feature>
Implementation
TypeScript + Postgres(pgvector) + Anthropic SDK (default)
- Agent loop: Anthropic SDK tool-use; match
tool_use_id on every tool_result; cap iterations
- RAG:
pgvector extension, vector column, HNSW index (USING hnsw (embedding vector_cosine_ops)); embeddings via the model provider
- Streaming: SSE from a NestJS endpoint or
ReadableStream; pairs with frontend-toolkit ai-llm UI
- Cost: log
usage (input/output tokens) per call to observability
- Eval: a test suite of prompts + assertions (run in CI)
Other stacks
- Python / FastAPI: Anthropic/OpenAI SDK; pgvector via SQLAlchemy or
pgvector-python; LangChain/LlamaIndex optional (prefer thin)
- Go: provider SDKs; pgvector via
pgx
- Universal: agent-loop discipline, prompt-injection-is-untrusted, token accounting, and eval are vendor-agnostic; pgvector is Postgres (alternatives: Qdrant/Weaviate, but prefer one datastore)
Related skills
data-validation — tool inputs and model outputs are untrusted — parse them
observability-setup — token/cost/latency are first-class metrics for AI features
caching-strategy — cache embeddings and deterministic completions
Reference
- Key insight encoded: Distinguish workflows (predefined, prefer) from agents (dynamic); treat ALL model-context content as untrusted (prompt injection is structural, not patchable) and defend with channel separation — user content stays in the user role, never concatenated into the system prompt; make the loop deterministic — round-trip every tool call by id (
tool_use_id), cap tool-result tokens, stream with explicit cost accounting. Two operational landmines: changing the embedding model invalidates all stored vectors (plan the reindex), and provider rate-limit / outage handling needs explicit backoff + multi-model fallback for critical paths.