Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

promptfoo-evals

Name: Promptfoo Evals
Author: promptfoo

// Write, refine, run, and QA promptfoo evaluation suites: promptfooconfig.yaml, prompts, providers, vars, tests, assertions, model-graded rubrics, transforms, datasets, exports, and CI gates. Use for non-redteam eval coverage, regression tests, or new eval matrices. Do not use for adversarial redteam plugin or strategy setup.

In Manus ausführen

$ git log --oneline --stat

stars:21.532

forks:1.893

updated:6. Mai 2026 um 18:16

Datei-Explorer

2 Dateien

SKILL.md

readonly

related-skills.json

gleiches Repository

redteam-plugin-development.md

from "promptfoo/promptfoo"

Standards for creating redteam plugins and graders. Use when creating new plugins, writing graders, or modifying attack templates.

2026-05-0621.5k

search-params.md

from "promptfoo/promptfoo"

URL search param and hash state management. Use when adding or modifying URL search params, working with useSearchParams, setSearchParams, useSearchParamState, or navigate() with query strings or hash fragments, or fixing browser back/forward button issues.

2026-05-0621.5k

promptfoo-evals.md

from "promptfoo/promptfoo"

Write, refine, run, and QA non-redteam promptfoo eval suites after the target or provider already works: prompts, vars, test cases, assertions, model-graded rubrics, transforms, datasets, output exports, filters, and CI gates. Use for regression tests and eval-suite authoring. Do not use for connecting a new target/provider, mapping HTTP requests or auth, smoke-testing an endpoint, or redteam plugin/strategy setup; use `promptfoo-provider-setup` for connection work instead.

2026-05-0621.5k

promptfoo-redteam-setup.md

from "promptfoo/promptfoo"

Create or refine promptfoo redteam setup configs: purpose, targets, plugins, strategies, frameworks, multi-input target inputs, policy text, grader guidance, contexts, and static-code-derived target/threat mapping. Use when preparing a red team scan plan from live probes, code evidence, or provider configs, or when generating adversarial test cases for QA. Do not use for basic provider wiring alone or for running/evaluating an already-generated redteam scan.

2026-05-0621.5k

discount-review.md

from "promptfoo/promptfoo"

Inspect the discount policy fixture with a repeatable review checklist and helper script.

2026-05-0421.5k

review-standards.md

from "promptfoo/promptfoo"

Use this skill when asked to review authentication code for security issues.

2026-04-2921.5k

package.json

"author": "promptfoo"

"repository": "promptfoo/promptfoo"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

Softwarequalitätssicherungsanalysten und -testerInformatik- und Mathematikberufe15-1253L4

name	promptfoo-evals
description	Write, refine, run, and QA promptfoo evaluation suites: promptfooconfig.yaml, prompts, providers, vars, tests, assertions, model-graded rubrics, transforms, datasets, exports, and CI gates. Use for non-redteam eval coverage, regression tests, or new eval matrices. Do not use for adversarial redteam plugin or strategy setup.

Writing Promptfoo Evals

You produce maintainable promptfoo eval suites: clear test cases, deterministic assertions where possible, model-graded only when needed.

See references/cheatsheet.md for the full assertion and provider reference. For deep questions about promptfoo features, consult https://www.promptfoo.dev/llms-full.txt

Inputs (infer from repo context if not provided)

What is being evaluated (prompt, agent, endpoint, RAG pipeline)?
What are the inputs and outputs (text, JSON, multi-turn chat, tool calls)?
What does "good" look like (acceptance criteria, failure modes)?

If context is insufficient, scaffold with TODO markers and starter tests.

Workflow

1. Find or create the eval suite

Search for existing configs: promptfooconfig.yaml, promptfooconfig.yml, or any promptfoo/evals folder. Extend existing suites when possible.

For new suites, use this layout (unless the repo uses another convention):

evals/<suite-name>/
  promptfooconfig.yaml
  prompts/
  tests/

Always add # yaml-language-server: $schema=https://promptfoo.dev/config-schema.json at the top of config files.

2. Write prompts

Put prompts in prompts/*.txt (plain) or prompts/*.json (chat format)
Reference via file://prompts/main.txt
Use {{variable}} for test inputs
If the app builds prompts dynamically, use a JS/Python provider instead of duplicating logic

3. Choose providers

Pick the simplest option that matches the real system:

Scenario	Provider pattern
Compare models	`openai:chat:gpt-4.1-mini`, `anthropic:messages:claude-sonnet-4-6`
Test an HTTP API	`id: https` with `config.url`, `config.body`, and `transformResponse`
Test local code	`file://provider.py` or `file://provider.js`
Echo/passthrough	`echo` (returns prompt as-is, useful for testing assertions)

Keep provider count small: 1 for regression, 2 for comparison.

For JSON output, add response_format to the provider config:

config:
  temperature: 0
  response_format:
    type: json_object

4. Write tests

Use file-based tests so they scale: tests: file://tests/*.yaml

For larger suites, use dataset-backed tests:

tests: file://tests.csv
# or
tests: file://generate_tests.py:create_tests

Every test should have:

description - short, specific
vars - the inputs
assert - validations (when automatable)

Cover: happy paths, edge cases, known regressions, safety/refusal checks, output format compliance.

5. Add assertions

Deterministic first (fast, reliable, free): equals, contains, icontains, regex, is-json, contains-json, starts-with, cost, latency, javascript, python

Model-graded sparingly (slow, costs money, non-deterministic): llm-rubric, factuality, answer-relevance, context-faithfulness

Assertions support optional weight (for scoring relative importance) and metric (named score in reports). threshold is assertion-specific: for graded assertions it is usually a minimum score (0-1), while for assertions like cost/latency it is a maximum allowed value.

For model-graded assertions, explicitly set the grader provider so grading is stable across runs:

defaultTest:
  options:
    provider: openai:gpt-5-mini

tests:
  - description: 'Model-graded quality check'
    assert:
      - type: llm-rubric
        value: 'Accurate and concise'
        # Optional per-assertion override:
        # provider: anthropic:messages:claude-sonnet-4-6

Hallucination / faithfulness pattern: When checking that output is grounded in source material, include the source in the rubric so the grader can compare. Use context-faithfulness when you have a context var, or inline the source in the llm-rubric value:

assert:
  - type: llm-rubric
    value: |
      The summary only states facts from this source article:
      "{{article}}"
      It does not add, infer, or fabricate any claims.

JSON output pattern:

assert:
  - type: is-json
    value: # optional JSON Schema
      type: object
      required: [name, score]
  - type: javascript
    value: 'JSON.parse(output).score >= 0.8'

Transform pattern (preprocess output before assertions): When models wrap JSON in markdown fences or add preamble text, use options.transform on the test to clean output before assertions run:

options:
  transform: "output.replace(/```json\\n?|```/g, '').trim()"

Use defaultTest for assertions shared across all tests (cost limits, format checks, etc.).

6. Validate and run

Before finishing, validate and provide run commands. Always use --no-cache during development to avoid stale results. Only run eval if credentials are available and safe to call.

npx promptfoo@latest validate config -c <config>
npx promptfoo@latest eval -c <config> -o output.json --no-cache --no-share

For CI/non-UI workflows, prefer the -o output.json command and inspect success, score, and error fields.

If working in the promptfoo repo itself, prefer the local build:

source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c <config>
npm run local -- eval -c <config> -o output.json --no-cache --no-share

Add --env-file .env only when the eval needs local credentials and that file exists.

Do not run npm run local -- view unless explicitly asked.

Common mistakes

# ❌ WRONG — shell-style env vars don't work in YAML configs
apiKey: $OPENAI_API_KEY

# ✅ CORRECT — use Nunjucks syntax with quotes
apiKey: '{{env.OPENAI_API_KEY}}'

# ❌ WRONG — rubric references "the article" but grader can't see it
- type: llm-rubric
  value: 'Only contains info from the original article'

# ✅ CORRECT — inline the source so the grader can compare
- type: llm-rubric
  value: |
    Only states facts from: "{{article}}"

Output contract

When done, state:

What the suite evaluates (1-3 bullets)
Files created/modified (paths)
How to run (copy-pastable commands)
Required env vars
TODOs left behind (only if unavoidable)

promptfoo-evals

Mehr aus diesem Repository

Mehr aus diesem Repository

Writing Promptfoo Evals

Inputs (infer from repo context if not provided)

Workflow

1. Find or create the eval suite

2. Write prompts

3. Choose providers

4. Write tests

5. Add assertions

6. Validate and run

Common mistakes

Output contract

Writing Promptfoo Evals

Inputs (infer from repo context if not provided)

Workflow

1. Find or create the eval suite

2. Write prompts

3. Choose providers

4. Write tests

5. Add assertions

6. Validate and run

Common mistakes

Output contract