| name | output-dev-eval-testing |
| description | Create offline evaluation tests for Output SDK workflows using @outputai/evals. Use when implementing test evaluators with verify(), creating dataset YAML files, building eval workflows, or running workflow tests via CLI. |
| allowed-tools | ["Bash","Read","Write","Edit"] |
Offline Evaluation Testing
Overview
The @outputai/evals package provides an offline evaluation framework for testing workflow quality using datasets and evaluators. This is complementary to the runtime evaluator() from @outputai/core:
| Aspect | Runtime Evaluators (@outputai/core) | Offline Eval Tests (@outputai/evals) |
|---|
| When | During workflow execution | After execution, at test time |
| Where | evaluators.ts in workflow folder | tests/evals/ in workflow folder |
| Purpose | Live quality scoring with confidence | Dataset-driven pass/fail verification |
| Triggered by | Workflow orchestration | output workflow test CLI command |
| Returns | EvaluationBooleanResult, etc. | Verdict helpers (pass/partial/fail) |
Use offline eval testing when you want to validate workflow behavior against known datasets, build regression test suites, or assess subjective quality with LLM judges.
When to Use This Skill
- Creating files in
tests/evals/ or tests/datasets/
- Writing evaluators that use
verify() from @outputai/evals
- Creating YAML dataset files for test cases
- Building eval workflows with
evalWorkflow()
- Running
output workflow test commands
- Setting up ground truth data for evaluators
Directory Structure
Add a tests/ directory inside the workflow folder:
src/workflows/{workflow_name}/
āāā workflow.ts
āāā steps.ts
āāā evaluators.ts # Runtime evaluators (optional)
āāā types.ts
āāā tests/
āāā datasets/
ā āāā happy_path.yml
ā āāā edge_case.yml
āāā evals/
āāā evaluators.ts # Offline eval test evaluators
āāā workflow.ts # Eval workflow definition
āāā judge_topic@v1.prompt # LLM judge prompts (optional)
Creating Evaluators with verify()
Import verify and Verdict from @outputai/evals (not @outputai/core):
import { verify, Verdict } from '@outputai/evals';
import { z } from '@outputai/core';
verify() Signature
verify(options, checkFn)
Options:
name ā unique evaluator identifier (snake_case)
input ā Zod schema for the workflow input (optional, defaults to z.any())
output ā Zod schema for the workflow output (optional, defaults to z.any())
Check function receives:
{
input,
output,
context: {
ground_truth: Record<string, unknown>
}
}
Returns: any Verdict helper result.
Basic Example
import { verify, Verdict } from '@outputai/evals';
import { z } from '@outputai/core';
export const evaluateSum = verify(
{
name: 'evaluate_sum',
input: z.object({ values: z.array(z.number()) }),
output: z.object({ result: z.number() })
},
({ input, output }) =>
Verdict.equals(output.result, input.values.reduce((a, b) => a + b, 0))
);
Using Ground Truth
Ground truth values come from the dataset YAML and are available via context.ground_truth:
export const lengthCheck = verify(
{ name: 'length_check', input: blogInput, output: blogOutput },
({ output, context }) =>
Verdict.gte(output.blog_post.length, Number(context.ground_truth.min_length ?? 100))
);
Verdict Helpers
All deterministic helpers return results with confidence 1.0.
Equality & Comparison
| Method | Description |
|---|
Verdict.equals(actual, expected) | Strict equality (===) |
Verdict.closeTo(actual, expected, tolerance) | Within numeric tolerance |
Verdict.gt(actual, threshold) | Greater than |
Verdict.gte(actual, threshold) | Greater than or equal |
Verdict.lt(actual, threshold) | Less than |
Verdict.lte(actual, threshold) | Less than or equal |
Verdict.inRange(actual, min, max) | Within inclusive range |
String & Array
| Method | Description |
|---|
Verdict.contains(haystack, needle) | String includes substring |
Verdict.matches(value, pattern) | Regex match |
Verdict.includesAll(actual, expected) | Array contains all expected values |
Verdict.includesAny(actual, expected) | Array contains at least one expected value |
Boolean
| Method | Description |
|---|
Verdict.isTrue(value) | Value is true |
Verdict.isFalse(value) | Value is false |
Manual Verdicts
| Method | Description |
|---|
Verdict.pass(reasoning?) | Explicit pass |
Verdict.partial(confidence, reasoning?, feedback?) | Partial pass with confidence |
Verdict.fail(reasoning, feedback?) | Explicit fail |
LLM Judge Evaluators
Before writing a judge prompt, identify the specific failure mode via error analysis (output-eval-error-analysis). Design the judge following output-eval-judge-prompt. After writing it, validate against human labels using output-eval-validate-judge.
For subjective quality assessments, use judge functions with .prompt files:
import { verify, judgeVerdict, judgeScore, judgeLabel } from '@outputai/evals';
export const evaluateTopic = verify(
{ name: 'evaluate_topic', input: blogInput, output: blogOutput },
async ({ input, output, context }) =>
judgeVerdict({
prompt: 'judge_topic@v1',
variables: {
blog_title: output.title,
blog_post: output.blog_post,
required_topic: String(context.ground_truth.required_topic ?? input.topic)
}
})
);
export const evaluateQuality = verify(
{ name: 'evaluate_quality', input: blogInput, output: blogOutput },
async ({ input, output }) =>
judgeScore({
prompt: 'judge_quality@v1',
variables: { blog_title: output.title, blog_post: output.blog_post, topic: input.topic }
})
);
export const evaluateTone = verify(
{ name: 'evaluate_tone', input: blogInput, output: blogOutput },
async ({ output }) =>
judgeLabel({
prompt: 'judge_tone@v1',
variables: { blog_title: output.title, blog_post: output.blog_post }
})
);
Judge .prompt File Format
Judge prompt files live alongside evaluators in tests/evals/:
---
provider: anthropic
model: claude-haiku-4-5-20251001
temperature: 0
maxTokens: 1000
---
<system>
You are an evaluation judge. Assess whether a blog post is faithfully about the required topic.
Return a JSON object with:
- verdict: "pass" if the blog clearly focuses on the topic, "partial" if it mentions the topic but lacks depth, "fail" if it is not about the topic
- reasoning: a brief explanation of your judgment
</system>
<user>
Required topic: {{ required_topic }}
Blog title: {{ blog_title }}
Blog post:
{{ blog_post }}
Judge whether this blog post is faithfully about the required topic.
</user>
Creating Eval Workflows
The eval workflow wires evaluators together and defines how to interpret results.
import { evalWorkflow } from '@outputai/evals';
import { evaluateSum } from './evaluators.js';
export default evalWorkflow({
name: 'simple_eval',
evals: [
{
evaluator: evaluateSum,
criticality: 'required',
interpret: { type: 'boolean' }
}
]
});
Eval Definition Fields
Each entry in the evals array has:
evaluator ā the function created by verify()
criticality ā 'required' (affects pass/fail) or 'informational' (reported but doesn't block)
interpret ā how to convert the evaluator's return value into a verdict
Interpret Types
| Type | Evaluator Returns | Mapping |
|---|
{ type: 'boolean' } | Verdict.equals(), Verdict.gte(), etc. | true = pass, false = fail |
{ type: 'verdict' } | judgeVerdict() or Verdict.pass/partial/fail() | Direct pass-through |
{ type: 'number', pass: 0.7, partial: 0.4 } | judgeScore() | >=pass = pass, >=partial = partial, else fail |
{ type: 'string', pass: ['a', 'b'], partial: ['c'] } | judgeLabel() | Label in pass list = pass, in partial list = partial, else fail |
Full Example with Mixed Evaluators
export default evalWorkflow({
name: 'blog_generator_eval',
evals: [
{
evaluator: lengthOfOutput,
criticality: 'required',
interpret: { type: 'boolean' }
},
{
evaluator: evaluateTopic,
criticality: 'required',
interpret: { type: 'verdict' }
},
{
evaluator: evaluateQuality,
criticality: 'required',
interpret: { type: 'number', pass: 0.7, partial: 0.4 }
},
{
evaluator: evaluateContent,
criticality: 'informational',
interpret: { type: 'boolean' }
},
{
evaluator: evaluateTone,
criticality: 'informational',
interpret: { type: 'string', pass: ['professional', 'informative'], partial: ['casual'] }
}
]
});
Naming Convention
The eval workflow name must end in _eval and match the pattern {workflow_name}_eval. The CLI resolves this automatically ā output workflow test blog_generator looks for blog_generator_eval.
Dataset Files
For methodology on designing diverse datasets that cover failure-prone regions, see output-eval-dataset-design.
Datasets are YAML files in tests/datasets/. Each file represents one test case.
Basic Format
name: basic_input
input:
values:
- 1
- 2
- 3
- 4
- 5
last_output:
output:
result: 15
executionTimeMs: 100
date: '2026-02-13T00:00:00.000Z'
With Ground Truth
Ground truth provides expected values for evaluators. You can set global values and per-evaluator overrides:
name: stripe_blog
input:
topic: "Stripe the payment processor"
requirements: "Include a link to https://stripe.com/en-gb/pricing"
last_output:
output:
title: "Stripe: The Modern Payment Processing Platform"
blog_post: |
Stripe has revolutionized online payment processing...
executionTimeMs: 5000
date: '2026-02-16T00:00:00.000Z'
ground_truth:
notes: "Known good case"
evals:
length_of_output:
min_length: 100
evaluate_topic:
required_topic: "Stripe the payment processor"
evaluate_content:
required_content: "https://stripe.com/en-gb/pricing"
The ground_truth.evals.<evaluator_name> values are merged with the top-level ground truth and passed to the evaluator via context.ground_truth.
CLI Commands
output workflow test <workflow_name>
Runs evaluations against all datasets for a workflow.
| Flag | Description |
|---|
--cached | Use cached output from dataset files (skip workflow execution) |
--save | Run workflow fresh and save output + eval results back to dataset files |
--dataset <names> | Comma-separated list of dataset names to run (default: all) |
--format <type> | Output format: text (default) or json |
Execution flow:
- Loads all dataset YAML files from
tests/datasets/
- Without
--cached: executes the workflow for each dataset to get fresh output
- Sends all datasets to the
{workflow_name}_eval workflow
- Reports per-dataset and per-evaluator verdicts
- Exits with code 1 if any required evaluator fails
output workflow dataset list <workflow_name>
Lists all datasets for a workflow with their cached status.
| Flag | Description |
|---|
--format <type> | Output format: table (default), text, or json |
output workflow dataset generate <workflow_name> [scenario]
Generates a new dataset file by running the workflow.
| Flag | Description |
|---|
--input <json> | Workflow input as a JSON string or file path |
--name <name> | Dataset filename (defaults to scenario name) |
--trace <path> | Generate from a local trace file instead of running the workflow |
--download | Download traces from S3 and convert to datasets |
--limit <n> | Max traces to download from S3 (default: 5) |
Common Usage
output workflow dataset generate my_workflow --input '{"key": "value"}' --name my_test
output workflow dataset generate my_workflow basic
output workflow test my_workflow --cached
output workflow test my_workflow --save
output workflow test my_workflow --dataset happy_path,edge_case
output workflow dataset list my_workflow
Typical Workflow
npm run output:dev
output workflow dataset generate blog_generator --input '{"topic": "AI"}' --name ai_post
output workflow test blog_generator --save
output workflow test blog_generator --cached
output workflow dataset list blog_generator
Verification Checklist
Related Skills
output-dev-evaluator-function ā Runtime evaluators using evaluator() from @outputai/core
output-dev-scenario-file ā Creating scenario JSON files for workflow execution
output-dev-folder-structure ā Understanding project directory layout
output-dev-prompt-file ā Creating .prompt files for LLM operations
output-eval-error-analysis ā Identify failure modes before building evaluators
output-eval-judge-prompt ā Design effective LLM judge prompts
output-eval-dataset-design ā Generate diverse test datasets
output-eval-validate-judge ā Validate LLM judges against human labels
output-eval-audit ā Audit an existing eval suite for trustworthiness