| name | promptfoo-evals |
| description | Write, refine, run, and QA promptfoo evaluation suites: promptfooconfig.yaml, prompts, providers, vars, tests, assertions, model-graded rubrics, transforms, datasets, exports, and CI gates. Use for non-redteam eval coverage, regression tests, or new eval matrices. Do not use for adversarial redteam plugin or strategy setup.
|
Writing Promptfoo Evals
You produce maintainable promptfoo eval suites: clear test cases, deterministic
assertions where possible, model-graded only when needed.
See references/cheatsheet.md for the full assertion and provider reference.
For deep questions about promptfoo features, consult https://www.promptfoo.dev/llms-full.txt
Inputs (infer from repo context if not provided)
- What is being evaluated (prompt, agent, endpoint, RAG pipeline)?
- What are the inputs and outputs (text, JSON, multi-turn chat, tool calls)?
- What does "good" look like (acceptance criteria, failure modes)?
If context is insufficient, scaffold with TODO markers and starter tests.
Workflow
1. Find or create the eval suite
Search for existing configs: promptfooconfig.yaml, promptfooconfig.yml,
or any promptfoo/evals folder. Extend existing suites when possible.
For new suites, use this layout (unless the repo uses another convention):
evals/<suite-name>/
promptfooconfig.yaml
prompts/
tests/
Always add # yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
at the top of config files.
2. Write prompts
- Put prompts in
prompts/*.txt (plain) or prompts/*.json (chat format)
- Reference via
file://prompts/main.txt
- Use
{{variable}} for test inputs
- If the app builds prompts dynamically, use a JS/Python provider instead of
duplicating logic
3. Choose providers
Pick the simplest option that matches the real system:
| Scenario | Provider pattern |
|---|
| Compare models | openai:chat:gpt-4.1-mini, anthropic:messages:claude-sonnet-4-6 |
| Test an HTTP API | id: https with config.url, config.body, and transformResponse |
| Test local code | file://provider.py or file://provider.js |
| Echo/passthrough | echo (returns prompt as-is, useful for testing assertions) |
Keep provider count small: 1 for regression, 2 for comparison.
For JSON output, add response_format to the provider config:
config:
temperature: 0
response_format:
type: json_object
4. Write tests
Use file-based tests so they scale: tests: file://tests/*.yaml
For larger suites, use dataset-backed tests:
tests: file://tests.csv
tests: file://generate_tests.py:create_tests
Every test should have:
description - short, specific
vars - the inputs
assert - validations (when automatable)
Cover: happy paths, edge cases, known regressions, safety/refusal checks,
output format compliance.
5. Add assertions
Deterministic first (fast, reliable, free):
equals, contains, icontains, regex, is-json, contains-json,
starts-with, cost, latency, javascript, python
Model-graded sparingly (slow, costs money, non-deterministic):
llm-rubric, factuality, answer-relevance, context-faithfulness
Assertions support optional weight (for scoring relative importance) and
metric (named score in reports). threshold is assertion-specific: for
graded assertions it is usually a minimum score (0-1), while for assertions
like cost/latency it is a maximum allowed value.
For model-graded assertions, explicitly set the grader provider so grading is
stable across runs:
defaultTest:
options:
provider: openai:gpt-5-mini
tests:
- description: 'Model-graded quality check'
assert:
- type: llm-rubric
value: 'Accurate and concise'
Hallucination / faithfulness pattern:
When checking that output is grounded in source material, include the source in
the rubric so the grader can compare. Use context-faithfulness when you have
a context var, or inline the source in the llm-rubric value:
assert:
- type: llm-rubric
value: |
The summary only states facts from this source article:
"{{article}}"
It does not add, infer, or fabricate any claims.
JSON output pattern:
assert:
- type: is-json
value:
type: object
required: [name, score]
- type: javascript
value: 'JSON.parse(output).score >= 0.8'
Transform pattern (preprocess output before assertions):
When models wrap JSON in markdown fences or add preamble text, use
options.transform on the test to clean output before assertions run:
options:
transform: "output.replace(/```json\\n?|```/g, '').trim()"
Use defaultTest for assertions shared across all tests (cost limits, format
checks, etc.).
6. Validate and run
Before finishing, validate and provide run commands. Always use --no-cache
during development to avoid stale results. Only run eval if credentials are
available and safe to call.
npx promptfoo@latest validate config -c <config>
npx promptfoo@latest eval -c <config> -o output.json --no-cache --no-share
For CI/non-UI workflows, prefer the -o output.json command and inspect
success, score, and error fields.
If working in the promptfoo repo itself, prefer the local build:
source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c <config>
npm run local -- eval -c <config> -o output.json --no-cache --no-share
Add --env-file .env only when the eval needs local credentials and that file
exists.
Do not run npm run local -- view unless explicitly asked.
Common mistakes
apiKey: $OPENAI_API_KEY
apiKey: '{{env.OPENAI_API_KEY}}'
- type: llm-rubric
value: 'Only contains info from the original article'
- type: llm-rubric
value: |
Only states facts from: "{{article}}"
Output contract
When done, state:
- What the suite evaluates (1-3 bullets)
- Files created/modified (paths)
- How to run (copy-pastable commands)
- Required env vars
- TODOs left behind (only if unavoidable)