con un clic
ai-observability-promptfoo
Testing and evaluation framework for LLM prompts and applications -- promptfooconfig.yaml, assertions, model-graded evals, red teaming, CI/CD integration, custom providers, and comparative evaluation
Menú
Testing and evaluation framework for LLM prompts and applications -- promptfooconfig.yaml, assertions, model-graded evals, red teaming, CI/CD integration, custom providers, and comparative evaluation
Hugging Face Inference SDK patterns for TypeScript/Node.js — InferenceClient setup, chat completion, text generation, streaming, embeddings, image generation, audio transcription, translation, summarization, and Inference Endpoints
LiteLLM proxy server setup, TypeScript client patterns via OpenAI SDK, model routing, fallbacks, load balancing, spend tracking, virtual keys, and production deployment
Serverless GPU compute platform for AI model deployment — web endpoints, GPU functions, model serving, and TypeScript client patterns
Local LLM inference with the Ollama JavaScript client -- chat, streaming, tool calling, vision, embeddings, structured output, model management, and OpenAI-compatible endpoint
Replicate SDK patterns for TypeScript/Node.js -- client setup, predictions, streaming, webhooks, file handling, model versioning, deployments, and training
Together AI SDK patterns for TypeScript — client setup, chat completions, streaming, structured output, function calling, embeddings, image generation, fine-tuning, and OpenAI-compatible endpoints
| name | ai-observability-promptfoo |
| description | Testing and evaluation framework for LLM prompts and applications -- promptfooconfig.yaml, assertions, model-graded evals, red teaming, CI/CD integration, custom providers, and comparative evaluation |
Quick Guide: Use promptfoo for systematic LLM evaluation. Define prompts, providers, and test cases in
promptfooconfig.yaml. Use assertion types (contains,is-json,llm-rubric,similar,cost,latency) to validate outputs. Usepromptfoo evalto run (exits with code 100 on test failures),promptfoo viewfor results UI. Use model-graded assertions (llm-rubric,factuality) for subjective quality. Usepromptfoo redteam runfor security scanning. Use--shareflag orpromptfoo shareto share results. All provider API keys come from environment variables -- never hardcode them.
<critical_requirements>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST define test cases with explicit assert arrays -- tests without assertions only capture output without validating it)
(You MUST use llm-rubric for subjective quality evaluation -- do NOT rely solely on deterministic assertions for natural language output)
(You MUST set threshold on similarity and model-graded assertions -- omitting thresholds uses defaults that may not match your quality bar)
(You MUST use environment variables for all API keys -- never hardcode keys in promptfooconfig.yaml or provider configs)
(You MUST verify promptfoo eval exit code in CI pipelines -- it returns exit code 100 on test failures, exit code 1 on other errors)
</critical_requirements>
Auto-detection: promptfoo, promptfooconfig, promptfooconfig.yaml, promptfoo eval, promptfoo view, promptfoo redteam, llm-rubric, model-graded-closedqa, promptfoo share, promptfoo cache, assertion type, LLM evaluation, prompt testing, red teaming, PROMPTFOO_CONFIG
When to use:
Key patterns covered:
promptfooconfig.yaml structure (prompts, providers, tests, defaultTest)evaluate() function)When NOT to use:
Promptfoo brings test-driven development to LLM applications. Instead of manually checking outputs, you define expected behaviors as assertions and run them systematically across prompts and providers.
Core principles:
promptfooconfig.yaml. No code required for standard evaluations.contains, is-json, equals) for structured output; model-graded assertions (llm-rubric, factuality) for subjective quality.promptfoo eval exits with code 100 on test failures, making it a natural CI quality gate.Every promptfoo project starts with promptfooconfig.yaml. Three required sections: prompts, providers, tests.
# promptfooconfig.yaml
description: "Translation quality evaluation"
prompts:
- "Convert the following to {{language}}: {{input}}"
providers:
- openai:gpt-4o
- anthropic:messages:claude-sonnet-4-6
tests:
- vars:
language: French
input: Hello world
assert:
- type: icontains
value: "bonjour"
- type: llm-rubric
value: "Output is a natural French translation, not word-for-word"
Why good: Declarative config, multi-provider comparison, both deterministic and model-graded assertions
# BAD: Tests without assertions
tests:
- vars:
language: French
input: Hello world
# No assert array -- output is captured but never validated
Why bad: Tests without assertions only log output, they never fail -- you lose the entire point of automated evaluation
See: examples/core.md for prompts from files, provider config, defaultTest, variable loading from CSV
Use for outputs with predictable, verifiable structure.
assert:
# String matching
- type: contains
value: "error"
- type: icontains # case-insensitive
value: "success"
- type: not-contains
value: "internal server error"
- type: starts-with
value: "{"
- type: regex
value: "\\d{4}-\\d{2}-\\d{2}" # date pattern
# Structured output
- type: is-json
- type: contains-json
- type: is-valid-openai-tools-call
# Performance
- type: cost
threshold: 0.01 # max $0.01 per call
- type: latency
threshold: 5000 # max 5 seconds
Why good: Fast, deterministic, no LLM cost for evaluation, catches structural regressions immediately
# BAD: Using llm-rubric for JSON validation
assert:
- type: llm-rubric
value: "Output must be valid JSON"
Why bad: Expensive (requires LLM call), slower, non-deterministic -- is-json does this deterministically for free
See: examples/core.md for all deterministic assertion types with examples
Use for subjective quality where deterministic checks cannot capture intent.
assert:
- type: llm-rubric
value: "Response is helpful, accurate, and conversational in tone"
provider: openai:gpt-4o
- type: factuality
value: "The capital of France is Paris. It has a population of ~2.1 million."
provider: openai:gpt-4o
- type: similar
value: "The weather in Paris is sunny today"
threshold: 0.8
- type: model-graded-closedqa
value: "Paris is the capital of France"
provider: openai:gpt-4o
Why good: Evaluates subjective quality that deterministic assertions cannot capture, configurable grading provider
# BAD: No threshold on similar assertion
assert:
- type: similar
value: "expected output"
# Missing threshold -- uses default which may be too lenient or strict
Why bad: Default similarity threshold may not match your quality bar, always set it explicitly
See: examples/model-graded.md for llm-rubric with custom providers, context evaluation, factuality, custom grading prompts
Use redteam section to scan for security vulnerabilities.
# promptfooconfig.yaml
targets:
- openai:gpt-4o
redteam:
purpose: "Customer support chatbot for an e-commerce platform"
numTests: 10
plugins:
- harmful
- pii
- contracts
- hallucination
- prompt-extraction
strategies:
- jailbreak
- prompt-injection
Why good: Declarative security scanning, purpose provides context for realistic attacks, composable plugins and strategies
# BAD: Red team without purpose
redteam:
plugins:
- harmful
# Missing purpose -- attacks will be generic and less effective
Why bad: Without purpose, the red team generator creates generic attacks that miss application-specific vulnerabilities
See: examples/red-teaming.md for presets (OWASP, NIST), advanced strategies, multi-turn attacks
Use when your LLM integration is not a direct API call (RAG pipelines, agent chains, custom middleware).
// providers/my-app.ts
import type {
ApiProvider,
ProviderOptions,
ProviderResponse,
CallApiContextParams,
} from "promptfoo";
// NOTE: default export required by promptfoo's file:// provider loader
export default class MyAppProvider implements ApiProvider {
private config: Record<string, unknown>;
constructor(options: ProviderOptions) {
this.config = options.config || {};
}
id(): string {
return "my-app-provider";
}
async callApi(
prompt: string,
context?: CallApiContextParams,
): Promise<ProviderResponse> {
// Call your application's LLM pipeline
const result = await myApp.processQuery(prompt);
return {
output: result.answer,
tokenUsage: {
total: result.totalTokens,
prompt: result.promptTokens,
completion: result.completionTokens,
},
cost: result.cost,
};
}
}
# promptfooconfig.yaml
providers:
- file://providers/my-app.ts
Why good: Type-safe, full control over LLM pipeline, reports token usage and cost for assertions
See: examples/custom-providers.md for inline function providers, programmatic API, CI/CD integration
Run evaluations in CI with quality gates.
# .github/workflows/llm-eval.yml
name: LLM Eval
on:
pull_request:
paths:
- "prompts/**"
- "promptfooconfig.yaml"
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "22"
- uses: actions/cache@v4
with:
path: ~/.cache/promptfoo
key: ${{ runner.os }}-promptfoo-v1
- name: Run eval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: npx promptfoo@latest eval -o results.json --share
Why good: Caches LLM responses across runs, promptfoo eval exits with code 100 on test failures (CI fails automatically), --share generates a shareable results URL
See: examples/custom-providers.md for npm scripts, quality gate thresholds, programmatic evaluation
<decision_framework>
What are you validating?
+-- Exact or structural match?
| +-- Exact text -> equals
| +-- Contains substring -> contains / icontains
| +-- Regex pattern -> regex
| +-- Valid JSON -> is-json
| +-- Valid function call -> is-valid-openai-tools-call
| +-- Cost under budget -> cost (with threshold)
| +-- Response time -> latency (with threshold)
+-- Subjective quality?
| +-- General quality criteria -> llm-rubric
| +-- Factual accuracy against ground truth -> factuality
| +-- Semantic similarity -> similar (with threshold)
| +-- Closed-domain QA accuracy -> model-graded-closedqa
| +-- RAG context fidelity -> context-faithfulness
+-- Custom logic?
+-- JavaScript function -> javascript
+-- Python function -> python
+-- External service -> webhook
What are you testing?
+-- Prompt quality and correctness?
| +-- Use promptfoo eval with test cases and assertions
+-- Security vulnerabilities?
| +-- Use promptfoo redteam run with plugins and strategies
+-- Both?
+-- Run eval for quality, redteam for security -- separate configs or sections
How does your LLM integration work?
+-- Direct API call to OpenAI/Anthropic/etc?
| +-- Use built-in provider: openai:gpt-4o, anthropic:messages:claude-sonnet-4-6
+-- Custom pipeline (RAG, agents, middleware)?
| +-- Use custom TypeScript provider: file://providers/my-app.ts
+-- HTTP endpoint?
| +-- Use HTTP provider: id: https://api.example.com/chat
+-- Multiple providers to compare?
+-- List all in providers array -- promptfoo runs tests against each
</decision_framework>
<red_flags>
High Priority Issues:
assert arrays (output is captured but never validated -- tests always "pass")promptfoo eval exit code in CI (promptfoo eval exits 100 on test failures -- ensure your CI pipeline treats non-zero exit codes as failures)promptfooconfig.yaml (use environment variables)llm-rubric for checks that is-json or contains can do deterministically (wastes money and adds non-determinism)purpose (generic attacks miss application-specific vulnerabilities)Medium Priority Issues:
threshold on similar assertions (default may not match your quality bar)model-graded-closedqa when llm-rubric would be simpler (closedqa is for specific ground-truth QA)provider on model-graded assertions (uses default which may not be the grader you want)numTests: 5 in production scans (too few for comprehensive coverage)Common Mistakes:
prompts (the LLM prompt templates) with tests (the evaluation cases) -- prompts define what to send, tests define what to checkequals for natural language output (LLM output is non-deterministic, use llm-rubric or similar){{variable}} syntax in prompts (promptfoo uses Nunjucks templating, not ${variable})defaultTest that should only apply to specific tests (assertions in defaultTest apply to ALL tests)file:// paths without the prefix (promptfoo treats bare paths as literal strings, not file references)Gotchas & Edge Cases:
promptfoo eval caches LLM responses by default -- use promptfoo cache clear or --no-cache to force fresh calls--share uploads results to promptfoo's servers -- do not use with sensitive data unless self-hostingstrategies wrap plugins output -- a plugin generates the malicious content, a strategy delivers it (e.g., via jailbreak encoding)defaultTest.assert merges with per-test assertions, it does not replace them -- both arrays runinput becomes {{input}} in promptstransform in test options runs JavaScript on the output before assertions -- useful for extracting JSON from markdown-wrapped responsesconfig: key for model parameters (temperature, max_tokens), not top-level fieldsweight property on assertions affects scoring in the results UI but does not change pass/fail behavior</red_flags>
<critical_reminders>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST define test cases with explicit assert arrays -- tests without assertions only capture output without validating it)
(You MUST use llm-rubric for subjective quality evaluation -- do NOT rely solely on deterministic assertions for natural language output)
(You MUST set threshold on similarity and model-graded assertions -- omitting thresholds uses defaults that may not match your quality bar)
(You MUST use environment variables for all API keys -- never hardcode keys in promptfooconfig.yaml or provider configs)
(You MUST verify promptfoo eval exit code in CI pipelines -- it returns exit code 100 on test failures, exit code 1 on other errors)
Failure to follow these rules will produce untested, insecure, or falsely-passing LLM evaluation pipelines.
</critical_reminders>