Run any Skill in Manus with one click

$pwd:

promptfoo-evals

Name: Promptfoo Evals
Author: promptfoo

// Write, refine, run, and QA non-redteam promptfoo eval suites after the target or provider already works: prompts, vars, test cases, assertions, model-graded rubrics, transforms, datasets, output exports, filters, and CI gates. Use for regression tests and eval-suite authoring. Do not use for connecting a new target/provider, mapping HTTP requests or auth, smoke-testing an endpoint, or redteam plugin/strategy setup; use `promptfoo-provider-setup` for connection work instead.

Run Skill in Manus

$ git log --oneline --stat

stars:21,532

forks:1,893

updated:May 6, 2026 at 18:16

File Explorer

3 files

SKILL.md

readonly

name

promptfoo-evals

description

Write, refine, run, and QA non-redteam promptfoo eval suites after the target or provider already works: prompts, vars, test cases, assertions, model-graded rubrics, transforms, datasets, output exports, filters, and CI gates. Use for regression tests and eval-suite authoring. Do not use for connecting a new target/provider, mapping HTTP requests or auth, smoke-testing an endpoint, or redteam plugin/strategy setup; use `promptfoo-provider-setup` for connection work instead.

Promptfoo Evals

Build a small eval that answers one product question clearly, run it with fresh results, then inspect the exported artifact before expanding.

Read references/eval-patterns.md when you need concrete YAML patterns, assertion examples, or CI snippets. For deep promptfoo feature questions that are not covered here, consult https://www.promptfoo.dev/llms-full.txt.

Inputs

Infer these from the repo or user prompt:

Behavior being evaluated and what "good" means.
Target/provider already configured, or whether promptfoo-provider-setup is needed first.
Prompt shape and variables.
Test data source: inline cases, CSV/JSON, generated data, production examples, or hand-picked regressions.
Assertion style: deterministic checks, structured output validation, JavaScript assertions, model-graded rubrics, or a mix.
Output needs: JSON export, comparison, CI gate, or human triage.

If the provider does not work yet, switch to promptfoo-provider-setup. If the task is adversarial security scanning, switch to promptfoo-redteam-setup or promptfoo-redteam-run.

Workflow

1. State the eval question

Search for existing configs first: promptfooconfig.yaml, promptfooconfig.yml, or repo evals/promptfoo directories. Extend an existing suite when possible.

Write one sentence for the behavior under test, then choose 3-10 starter cases. Include both ordinary success cases and edge cases that have broken before.

For new suites, prefer this layout unless the repo already has a convention:

evals/<suite-name>/
  promptfooconfig.yaml
  prompts/
  tests/

2. Choose assertions

Prefer deterministic assertions first:

Exact or substring behavior: equals, contains, icontains, regex
Structured output: is-json, contains-json, javascript
Numeric or score-like outputs: javascript returning a boolean or score
Semantic quality: llm-rubric with an explicit grader provider when possible

Use model-graded assertions sparingly for qualities that deterministic checks cannot capture. Configure a local or explicit grader for reproducible QA.

3. Write the config

Include:

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
A short description
Field order: description, env, prompts, providers, defaultTest, scenarios, tests
prompts via file://prompts/*.txt or file://prompts/*.json when prompts are more than a one-line smoke test
tests: file://tests/*.yaml for suites that will grow beyond a few cases
defaultTest only for shared assertions/options
options.transform when parsing JSON once makes assertions cleaner
Stable metric names only when they help compare dashboards over time

Keep secrets as {{env.VAR}}; do not commit .env values. When checking faithfulness or hallucination with llm-rubric, inline the source material in the rubric via {{variable}} so the grader can actually compare.

4. Validate and run

From the promptfoo repo:

source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c path/to/promptfooconfig.yaml
npm run local -- eval -c path/to/promptfooconfig.yaml -o /tmp/eval-results.json --no-cache --no-share

Outside the repo:

npx promptfoo@latest validate config -c path/to/promptfooconfig.yaml
npx promptfoo@latest eval -c path/to/promptfooconfig.yaml -o /tmp/eval-results.json --no-cache --no-share

Inspect the output JSON for results.stats, response.output, score, gradingResult.reason, and error.

5. Iterate deliberately

Add cases when a failure represents real expected behavior.
Tighten assertions when false positives pass.
Use --filter-pattern, --filter-metadata, or --filter-failing for focused reruns.
Keep --no-cache while developing so you are not validating stale outputs.
Use --no-share unless the user asks for a shareable URL.

Common Mistakes

# WRONG: vague rubric with no examples or grader control
- type: llm-rubric
  value: Is this good?

# BETTER: concrete success criteria
- type: llm-rubric
  value: >-
    The answer must cite the requested invoice id, state approved/denied, and
    avoid inventing fields not present in the tool result.

# WRONG: unquoted JS expression that starts with [ or { is parsed as YAML flow
- type: javascript
  value: ['billing', 'technical'].includes(output.category)

# BETTER: quote any assertion value that begins with [, {, *, &, or !
- type: javascript
  value: "['billing', 'technical'].includes(output.category)"

# WRONG: inline prompts that contain JSON-like braces are misread as file paths
prompts:
  - 'Classify: {{text}}. Return {"category": "..."} JSON.'

# BETTER: move non-trivial prompts (JSON examples, multi-line, quotes) to a file
prompts:
  - file://./prompts/classify.txt

# WRONG: reparsing JSON in every assertion
assert:
  - type: javascript
    value: JSON.parse(output).status === 'approved'

# BETTER: parse once for the test
options:
  transform: JSON.parse(output)
assert:
  - type: javascript
    value: output.status === 'approved'

Output Contract

When done, state:

Eval question and target/provider used
Files created or changed
Assertion strategy and why
Validation/eval commands run
Result stats and any failures/errors
Required environment variables
Follow-up cases or assertions to add next

related-skills.json

same repository

redteam-plugin-development.md

from "promptfoo/promptfoo"

Standards for creating redteam plugins and graders. Use when creating new plugins, writing graders, or modifying attack templates.

2026-05-0621.5k

search-params.md

from "promptfoo/promptfoo"

URL search param and hash state management. Use when adding or modifying URL search params, working with useSearchParams, setSearchParams, useSearchParamState, or navigate() with query strings or hash fragments, or fixing browser back/forward button issues.

2026-05-0621.5k

promptfoo-evals.md

from "promptfoo/promptfoo"

Write, refine, run, and QA promptfoo evaluation suites: promptfooconfig.yaml, prompts, providers, vars, tests, assertions, model-graded rubrics, transforms, datasets, exports, and CI gates. Use for non-redteam eval coverage, regression tests, or new eval matrices. Do not use for adversarial redteam plugin or strategy setup.

2026-05-0621.5k

promptfoo-redteam-setup.md

from "promptfoo/promptfoo"

Create or refine promptfoo redteam setup configs: purpose, targets, plugins, strategies, frameworks, multi-input target inputs, policy text, grader guidance, contexts, and static-code-derived target/threat mapping. Use when preparing a red team scan plan from live probes, code evidence, or provider configs, or when generating adversarial test cases for QA. Do not use for basic provider wiring alone or for running/evaluating an already-generated redteam scan.

2026-05-0621.5k

discount-review.md

from "promptfoo/promptfoo"

Inspect the discount policy fixture with a repeatable review checklist and helper script.

2026-05-0421.5k

review-standards.md

from "promptfoo/promptfoo"

Use this skill when asked to review authentication code for security issues.

2026-04-2921.5k

package.json

"author": "promptfoo"

"repository": "promptfoo/promptfoo"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

name

promptfoo-evals

description

Promptfoo Evals

Build a small eval that answers one product question clearly, run it with fresh results, then inspect the exported artifact before expanding.

Inputs

Infer these from the repo or user prompt:

Behavior being evaluated and what "good" means.
Target/provider already configured, or whether promptfoo-provider-setup is needed first.
Prompt shape and variables.
Test data source: inline cases, CSV/JSON, generated data, production examples, or hand-picked regressions.
Assertion style: deterministic checks, structured output validation, JavaScript assertions, model-graded rubrics, or a mix.
Output needs: JSON export, comparison, CI gate, or human triage.

If the provider does not work yet, switch to promptfoo-provider-setup. If the task is adversarial security scanning, switch to promptfoo-redteam-setup or promptfoo-redteam-run.

Workflow

1. State the eval question

Search for existing configs first: promptfooconfig.yaml, promptfooconfig.yml, or repo evals/promptfoo directories. Extend an existing suite when possible.

Write one sentence for the behavior under test, then choose 3-10 starter cases. Include both ordinary success cases and edge cases that have broken before.

For new suites, prefer this layout unless the repo already has a convention:

evals/<suite-name>/
  promptfooconfig.yaml
  prompts/
  tests/

2. Choose assertions

Prefer deterministic assertions first:

Exact or substring behavior: equals, contains, icontains, regex
Structured output: is-json, contains-json, javascript
Numeric or score-like outputs: javascript returning a boolean or score
Semantic quality: llm-rubric with an explicit grader provider when possible

Use model-graded assertions sparingly for qualities that deterministic checks cannot capture. Configure a local or explicit grader for reproducible QA.

3. Write the config

Include:

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
A short description
Field order: description, env, prompts, providers, defaultTest, scenarios, tests
prompts via file://prompts/*.txt or file://prompts/*.json when prompts are more than a one-line smoke test
tests: file://tests/*.yaml for suites that will grow beyond a few cases
defaultTest only for shared assertions/options
options.transform when parsing JSON once makes assertions cleaner
Stable metric names only when they help compare dashboards over time

4. Validate and run

From the promptfoo repo:

source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c path/to/promptfooconfig.yaml
npm run local -- eval -c path/to/promptfooconfig.yaml -o /tmp/eval-results.json --no-cache --no-share

Outside the repo:

npx promptfoo@latest validate config -c path/to/promptfooconfig.yaml
npx promptfoo@latest eval -c path/to/promptfooconfig.yaml -o /tmp/eval-results.json --no-cache --no-share

Inspect the output JSON for results.stats, response.output, score, gradingResult.reason, and error.

5. Iterate deliberately

Add cases when a failure represents real expected behavior.
Tighten assertions when false positives pass.
Use --filter-pattern, --filter-metadata, or --filter-failing for focused reruns.
Keep --no-cache while developing so you are not validating stale outputs.
Use --no-share unless the user asks for a shareable URL.

Common Mistakes

# WRONG: vague rubric with no examples or grader control
- type: llm-rubric
  value: Is this good?

# BETTER: concrete success criteria
- type: llm-rubric
  value: >-
    The answer must cite the requested invoice id, state approved/denied, and
    avoid inventing fields not present in the tool result.

# WRONG: unquoted JS expression that starts with [ or { is parsed as YAML flow
- type: javascript
  value: ['billing', 'technical'].includes(output.category)

# BETTER: quote any assertion value that begins with [, {, *, &, or !
- type: javascript
  value: "['billing', 'technical'].includes(output.category)"

# WRONG: inline prompts that contain JSON-like braces are misread as file paths
prompts:
  - 'Classify: {{text}}. Return {"category": "..."} JSON.'

# BETTER: move non-trivial prompts (JSON examples, multi-line, quotes) to a file
prompts:
  - file://./prompts/classify.txt

# WRONG: reparsing JSON in every assertion
assert:
  - type: javascript
    value: JSON.parse(output).status === 'approved'

# BETTER: parse once for the test
options:
  transform: JSON.parse(output)
assert:
  - type: javascript
    value: output.status === 'approved'

Output Contract

When done, state:

Eval question and target/provider used
Files created or changed
Assertion strategy and why
Validation/eval commands run
Result stats and any failures/errors
Required environment variables
Follow-up cases or assertions to add next

promptfoo-evals

Promptfoo Evals

Inputs

Workflow

1. State the eval question

2. Choose assertions

3. Write the config

4. Validate and run

5. Iterate deliberately

Common Mistakes

Output Contract

More from this repository

More from this repository

Promptfoo Evals

Inputs

Workflow

1. State the eval question

2. Choose assertions

3. Write the config

4. Validate and run

5. Iterate deliberately

Common Mistakes

Output Contract