| name | promptfoo-evals |
| description | Write, refine, run, and QA non-redteam promptfoo eval suites after the target or provider already works: prompts, vars, test cases, assertions, model-graded rubrics, transforms, datasets, output exports, filters, and CI gates. Use for regression tests and eval-suite authoring. Do not use for connecting a new target/provider, mapping HTTP requests or auth, smoke-testing an endpoint, or redteam plugin/strategy setup; use `promptfoo-provider-setup` for connection work instead.
|
Promptfoo Evals
Build a small eval that answers one product question clearly, run it with fresh
results, then inspect the exported artifact before expanding.
Read references/eval-patterns.md when you need concrete YAML patterns,
assertion examples, or CI snippets.
For deep promptfoo feature questions that are not covered here, consult
https://www.promptfoo.dev/llms-full.txt.
Inputs
Infer these from the repo or user prompt:
- Behavior being evaluated and what "good" means.
- Target/provider already configured, or whether
promptfoo-provider-setup is
needed first.
- Prompt shape and variables.
- Test data source: inline cases, CSV/JSON, generated data, production examples,
or hand-picked regressions.
- Assertion style: deterministic checks, structured output validation,
JavaScript assertions, model-graded rubrics, or a mix.
- Output needs: JSON export, comparison, CI gate, or human triage.
If the provider does not work yet, switch to promptfoo-provider-setup. If the
task is adversarial security scanning, switch to promptfoo-redteam-setup or
promptfoo-redteam-run.
Workflow
1. State the eval question
Search for existing configs first: promptfooconfig.yaml,
promptfooconfig.yml, or repo evals/promptfoo directories. Extend an
existing suite when possible.
Write one sentence for the behavior under test, then choose 3-10 starter cases.
Include both ordinary success cases and edge cases that have broken before.
For new suites, prefer this layout unless the repo already has a convention:
evals/<suite-name>/
promptfooconfig.yaml
prompts/
tests/
2. Choose assertions
Prefer deterministic assertions first:
- Exact or substring behavior:
equals, contains, icontains, regex
- Structured output:
is-json, contains-json, javascript
- Numeric or score-like outputs:
javascript returning a boolean or score
- Semantic quality:
llm-rubric with an explicit grader provider when possible
Use model-graded assertions sparingly for qualities that deterministic checks
cannot capture. Configure a local or explicit grader for reproducible QA.
3. Write the config
Include:
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
- A short
description
- Field order:
description, env, prompts, providers, defaultTest,
scenarios, tests
prompts via file://prompts/*.txt or file://prompts/*.json when prompts
are more than a one-line smoke test
tests: file://tests/*.yaml for suites that will grow beyond a few cases
defaultTest only for shared assertions/options
options.transform when parsing JSON once makes assertions cleaner
- Stable metric names only when they help compare dashboards over time
Keep secrets as {{env.VAR}}; do not commit .env values.
When checking faithfulness or hallucination with llm-rubric, inline the source
material in the rubric via {{variable}} so the grader can actually compare.
4. Validate and run
From the promptfoo repo:
source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c path/to/promptfooconfig.yaml
npm run local -- eval -c path/to/promptfooconfig.yaml -o /tmp/eval-results.json --no-cache --no-share
Outside the repo:
npx promptfoo@latest validate config -c path/to/promptfooconfig.yaml
npx promptfoo@latest eval -c path/to/promptfooconfig.yaml -o /tmp/eval-results.json --no-cache --no-share
Inspect the output JSON for results.stats, response.output, score,
gradingResult.reason, and error.
5. Iterate deliberately
- Add cases when a failure represents real expected behavior.
- Tighten assertions when false positives pass.
- Use
--filter-pattern, --filter-metadata, or --filter-failing for focused
reruns.
- Keep
--no-cache while developing so you are not validating stale outputs.
- Use
--no-share unless the user asks for a shareable URL.
Common Mistakes
- type: llm-rubric
value: Is this good?
- type: llm-rubric
value: >-
The answer must cite the requested invoice id, state approved/denied, and
avoid inventing fields not present in the tool result.
- type: javascript
value: ['billing', 'technical'].includes(output.category)
- type: javascript
value: "['billing', 'technical'].includes(output.category)"
prompts:
- 'Classify: {{text}}. Return {"category": "..."} JSON.'
prompts:
- file://./prompts/classify.txt
assert:
- type: javascript
value: JSON.parse(output).status === 'approved'
options:
transform: JSON.parse(output)
assert:
- type: javascript
value: output.status === 'approved'
Output Contract
When done, state:
- Eval question and target/provider used
- Files created or changed
- Assertion strategy and why
- Validation/eval commands run
- Result stats and any failures/errors
- Required environment variables
- Follow-up cases or assertions to add next