| name | agent-eval |
| description | Create, run, and maintain TypeScript evals with @ls-stack/agent-eval. Use when adding eval coverage for an LLM or agent workflow, updating *.eval.ts files, checking eval results, configuring agent-evals.config.ts, inspecting saved .agent-evals run artifacts, or wiring product source code with evalTracer spans. |
Agent Eval
Local-first eval runner for LLM and agent systems. Evals are strict TypeScript
modules named *.eval.ts, discovered from agent-evals.config.ts, and
executed through the CLI (agent-evals run) or local app (agent-evals app).
Runs persist to .agent-evals/ so results, traces, and caches survive across
processes.
This skill covers the mental model and conventions. For exhaustive field lists
(config options, eval shape, column formats, score/chart/stats options, trace
display rules), read the TypeScript declarations shipped with the package:
AgentEvalsConfig, EvalDefinition, EvalCase, EvalOutputs,
EvalColumnOverride, EvalDeriveConfig, EvalScoreDef,
EvalManualScoreDef, EvalTraceTree, TraceSpanInfo, and z are exported
from @ls-stack/agent-eval.
.d.ts files land in node_modules/@ls-stack/agent-eval/dist/.
- CLI surface:
agent-evals --help and agent-evals <command> --help.
Unknown help targets exit non-zero instead of falling back to global help.
- The CLI automatically loads
.env from the current workspace. Shell-provided
environment variables win; pass --no-env to disable .env loading once.
- Unfiltered
agent-evals run is disabled by default; use --eval or --case
for targeted CLI runs, or --tags-filter <expr> to run cases matching tags.
Set allowCliRunAll: true in
agent-evals.config.ts to opt into run-all CLI behavior.
agent-evals run --temporary persists a run like normal history, but deletes
it before the next run starts. Temporary runs appear in show-runs while
present; normal runs are never deleted by temporary-run cleanup.
agent-evals app watches agent-evals.config.ts and reloads config in
place when the runner is idle. If config changes during an active run, the
reload applies after the current run reaches a terminal state.
Assume that enumerated tables in this document may lag behind the types —
treat the types as source of truth when they disagree.
Where tracing lives
Tracing belongs in the product source code, not in the eval file. The eval
file wires up cases and scoring; the real evalTracer.span(...) calls sit
inside the workflow, agent, or tool functions that both production and evals
invoke.
evalTracer, evalSpan, output helpers, evalLog, evalAssert, and
evalExpect are ambient no-ops when called outside an eval case scope, so
leaving them in
production paths is safe — they only record anything when the product code runs
inside an eval's execute. Use isInEvalScope() to branch on eval-only behavior in shared code
(e.g. skip a real network side effect): it returns null outside eval-owned
work and returns 'env', 'cases', 'eval', 'derive', 'outputsSchema', or
'scorer' during runner phases. Top-level modules imported while a run is being
prepared see 'env'; code called from execute sees 'eval'. Use
getEvalCaseInput() to read the current case input, or
getEvalCaseInput('customer.tier') for nested dot-path access; outside a case
scope it returns undefined. Use nextEvalId() inside eval-scoped code when a
stable generated id is needed; it includes the eval file, eval id, case id, and
a per-case sequence number, and throws outside an eval case scope.
Use evalLog(level, ...args) for intentional per-case logs. The runner also
captures console.log, console.info, console.warn, and console.error
during case-owned phases by default; log arguments are stored as JSON-safe
values. Logs inside cached operations are not replayed from cache hits.
Use eval tags to target related coverage without naming every case:
AgentEvalsConfig.tags applies workspace-wide tags, defineEval({ tags })
adds eval tags, case.tags adds case-only tags, and removeTags disables a
configured global tag for one eval. CLI filters support Vitest-style tag
expressions such as agent-evals run --tags-filter "refunds && !slow".
Inside eval-scoped code, use matchesEvalTags('tag') or
matchesEvalTags({ all, any, not }); it uses typed exact tag names and returns
false outside a case scope. Projects can narrow tag names with a .d.ts
module augmentation:
import '@ls-stack/agent-eval';
declare module '@ls-stack/agent-eval' {
interface AgentEvalTagRegistry {
tags: 'refunds' | 'media' | 'manual' | 'slow';
}
}
Product code (instrumented once, reused everywhere)
import {
appendToEvalOutput,
captureEvalSpanError,
evalAssert,
evalExpect,
evalSpan,
evalTracer,
getEvalCaseInput,
incrementEvalOutput,
mergeEvalOutput,
nextEvalId,
setEvalOutput,
startEvalBackgroundJob,
} from '@ls-stack/agent-eval';
export async function runRefundWorkflow(input: RefundInput) {
return evalTracer.span(
{ kind: 'agent', name: 'refund-workflow' },
async () => {
evalSpan.setAttribute('input', input);
const plan = await evalTracer.span(
{
kind: 'llm',
name: 'plan-refund',
cache: {
namespace: 'refund-workflow.plan-refund',
key: { prompt: input.message, model: 'gpt-4o-mini' },
},
},
async () => {
let text: string;
let usage: { inputTokens: number; outputTokens: number };
try {
({ text, usage } = await llm.complete(input.message));
} catch (error) {
captureEvalSpanError(error);
({ text, usage } = await llm.completeWithFallback(input.message));
}
evalSpan.setAttributes({
model: 'gpt-4o-mini',
provider: 'openai',
usage,
});
const expectedLocale = getEvalCaseInput('locale');
if (typeof expectedLocale === 'string') {
evalSpan.setAttribute('expectedLocale', expectedLocale);
}
evalSpan.incrementAttribute('llmCalls', 1);
evalSpan.appendToAttribute('models', 'gpt-4o-mini');
return text;
},
);
const result = await applyRefund(plan);
const reviewId = nextEvalId();
setEvalOutput('response', result.finalText);
setEvalOutput('reviewId', reviewId);
mergeEvalOutput('metadata', { approved: result.approved });
evalAssert(result.approved, 'refund workflow should approve the case');
evalExpect(result.finalText).toMatch(/refund/i);
evalSpan.setAttribute('output', { result, reviewId });
return result;
},
);
}
Span kind values are open-ended strings. Use familiar kinds such as
agent, tool, llm, api, retrieval, scorer, or checkpoint when they
fit, and preserve external tracer kinds such as mastra.workflow.step when they
are more specific. Only the input and output span attributes are promoted
automatically in the trace tree; use traceDisplay for other span attributes
such as model or usage. Eval-level LLM usage outputs, columns, stats, and
charts are derived from matching LLM spans by default. Prefer
llmCalls.pricing for LLM-call cost display; built-in costs ignore span
costUsd attributes.
Use captureEvalSpanError(error) for recoverable errors on the active
evalTracer.span(...), such as optional model/tool failures that fall back and
continue. You can pass one error, multiple error arguments, or an array. The
span is still marked error. Pass 'warning' or { level: 'warning' } as the
final argument for diagnostics that should not change an otherwise successful
span's status.
If a span callback throws, the SDK automatically marks that span as error,
stores the thrown error on it, and rethrows so the case errors. Use that for
terminal failures; use captureEvalSpanError(...) for recoverable failures that
continue through fallback logic.
Fire-and-forget spans started during execute are awaited before outputs,
deriveFromTracing, scores, and trace data are finalized, so void evalTracer.span(...) is safe when the span result is not needed. Register
non-span promises with startEvalBackgroundJob(promise). The runner only waits
for settlement; promise and span errors keep their normal behavior. Use
waitForBackgroundJob: false on a span, or waitForBackgroundJobs: false on an
eval definition, when background work should not delay finalization.
Eval Date APIs use a shifted wall clock by default: new Date() and
Date.now() start at 2026-04-10T00:00:00.000Z during case generation,
execution, tracing, derived outputs, and scorers, then continue advancing with
real elapsed time. Set startTime on a specific defineEval(...) to use
another initial clock value, or set startTime: 'now' for that eval to use the
real current clock. Timers are not faked, so async waits still run normally.
Set freezeTime: true to keep Date APIs frozen until they are moved manually.
Use evalTime.startTime to read the captured wall-clock start as a Dayjs
object, and evalTime.dayjs(...) to create other Dayjs date objects. Use
evalTime.advance(amount, unit) inside an eval to move the shifted clock
forward with Dayjs add(...) units. It throws for evals with
startTime: 'now', unless freezeTime: true is also set.
For libraries or observability exporters that already emit span lifecycle
events, use evalTracer.startSpan(...), evalTracer.updateSpan(...),
evalTracer.endSpan(...), or evalTracer.recordSpan(...) to translate those
events into the eval trace tree without wrapping the upstream work in a
callback. Pass the upstream span id and parent id when available so saved trace
JSON and deriveFromTracing use the recorded hierarchy.
Eval file (thin)
import { defineEval, z } from '@ls-stack/agent-eval';
import { runRefundWorkflow } from '../src/workflows/refundWorkflow.ts';
const outputsSchema = z.object({
response: z.string(),
costUsd: z.number().optional(),
toolCalls: z.number(),
llmTurns: z.number(),
});
type RefundOutputs = z.infer<typeof outputsSchema>;
defineEval<RefundInput, RefundOutputs>({
id: 'refund-workflow',
cases: [
{ id: 'simple-text', input: { message: 'I want a refund for order #123' } },
],
outputsSchema,
execute: async ({ input }) => {
await runRefundWorkflow(input);
},
deriveFromTracing: ({ trace }) => ({
toolCalls: trace.findSpansByKind('tool').length,
}),
scores: {
mentionsRefund: {
passThreshold: 1,
compute: ({ outputs }) => (/refund/i.test(outputs.response) ? 1 : 0),
},
},
});
execute usually just calls the product code. Push any placeholder
evalTracer.span(...) wrappers out of the eval and into the product module
they describe so production runs get the same trajectory. Only keep tracing
inside execute when the behavior being measured is eval-specific (e.g. a
judge-only sub-step with no production analogue).
Case id values anchor historical runs, caches, and manual scores — keep them
stable. See EvalDefinition / EvalCase in the types for every supported
field.
Manual input
Use manualInput instead of cases when each run should pause for the user
to type values:
const inputSchema = z.object({
name: z.string().min(1),
tone: z.enum(['friendly', 'formal']),
notes: z.string().max(500).optional(),
sendEmail: z.boolean().default(false),
});
defineEval<z.infer<typeof inputSchema>>({
id: 'manual-input-greeting',
manualInput: {
schema: inputSchema,
title: 'Greet someone',
submitLabel: 'Greet',
fields: { notes: { multiline: true, rows: 4 } },
},
execute: ({ input, setOutput }) => {
setOutput('greeting', `Hi, ${input.name}!`);
},
});
manualInput configures the local app form descriptor derived from the schema
(z.string -> text, z.enum -> select, z.boolean -> checkbox, etc.; nested
shapes fall back to JSON input). The CLI accepts --input '<json>' for a
single targeted eval or --input-file <path> mapping eval keys/ids to inputs.
Each run produces one synthetic case <evalId>-manual with the validated
submission; mixing manualInput with cases is rejected at discovery time.
For file or image fields, set { asFile: true, accept?, maxSizeBytes? } and
type the field with manualInputFileValueSchema. The runtime value carries
{ name, mimeType, sizeBytes, sha256, path }, where path is a
workspace-relative run artifact. Use readManualInputFile(value) when bytes,
Blob, File, text, or parsed JSON are needed. In CLI runs, provide path
objects such as
{ "image": { "path": "./screenshot.png" } }; the CLI stages the file before
starting the run.
Scoring
Every score returns a normalized 0..1 value. Pass/fail is per-score: a case
fails if any score with passThreshold falls below it, if an assertion fails,
or if the case errors. Scores without passThreshold are informational.
Score functions run in their own trace scope, separate from the execution
trace, so LLM-as-judge scorers can use evalTracer.span(...) and cached spans
without polluting the agent trajectory. Outputs set inside a scorer stay
private to that score. Spanless evalTracer.cache(...) calls made directly
inside a scorer are stored on that score trace's cacheRefs payload.
manualScores declares score columns that reviewers fill in after a run.
Pending values keep the eval in an unscored state instead of failing.
See EvalScoreDef / EvalManualScoreDef in the types for the full shape
(format, threshold, column overrides).
Outputs, columns, trace display
setEvalOutput(key, value) writes reviewable data for the case. Values are
stored as received: primitives, objects/arrays, explicit file refs, and
native Blob/File values. columns.format only controls visualization.
Non-JSON runtime values such as Date, Map, Set, BigInt, typed arrays,
and class instances use the tagged value serializer instead of a string
fallback. Native Blob/File values are copied to run artifacts because
saved run files are JSON. Inside execute, prefer the context
setOutput(key, value) helper when writing schema-backed outputs; it is
typed from the eval's outputs generic. Keep setEvalOutput for shared
workflow code that does not receive the execute context.
- Use
incrementEvalOutput(key, delta) for numeric totals,
appendToEvalOutput(key, value) for arrays that preserve existing scalar
values, and mergeEvalOutput(key, patch) for shallow object updates.
evalSpan has matching incrementAttribute, appendToAttribute, and
mergeAttribute helpers for span attributes.
outputsSchema validates final outputs after execute and
deriveFromTracing, before computed scores. For Zod object schemas, only
declared keys are passed to the schema; parsed fields merge back into the raw
output map, so defaults/transforms apply to configured fields and
unconfigured outputs stay visible as before. Validation failures fail the case
and skip computed scores. When you pass a narrowed outputs type as the second
defineEval generic, outputsSchema is required.
columns overrides the display for output and score keys (label, format,
alignment, visibility). The set of supported formats is declared by the
ColumnFormat union and EvalColumnOverride in the types. Global
columns in agent-evals.config.ts apply to every eval; eval-level
columns override matching global keys. Use hideIfNoValue: true to hide a
column when every row is missing the value, null, or an empty string; 0
and false still count as values. Use format: 'image', 'html', 'pdf',
'audio', 'video', or 'file' for Blob/File outputs or repoFile(...)
references that should render as reviewable artifacts. Persisted Blob/File
artifacts include byte sizes in their run artifact refs; pass the optional
repoFile(..., ..., sizeBytes) hint when a repository file card should show
a size.
deriveFromTracing can be authored globally in agent-evals.config.ts or
locally on one eval. Prefer the keyed map form for shared metrics:
deriveFromTracing: { toolCalls: ({ trace }) => trace.findSpansByKind('tool').length }.
The older object-returning function form remains supported. Global
derivations run first; runtime outputs are never overwritten, and eval-level
derivations only fill keys still missing after global derivations. In keyed
form, return undefined to omit one output for that case.
traceDisplay promotes selected span attributes into the trace tree and
detail pane; it supports aggregation across subtrees (scope, mode) and
user-defined transform(...) for derived views (e.g. currency conversion).
See the TraceDisplayInputConfig type.
llmCalls (in agent-evals.config.ts) configures how LLM-call spans are
summarized for review. Defaults to kind: 'llm' spans with model,
usage.*, latencyMs, input, output, etc. read from conventional
attribute paths. latencyMs is time to first token; duration, total tokens,
output tokens/sec, and USD costs are derived. Override kinds to broaden the filter,
override attributes.<field> for non-default primitive span shapes, configure
model-keyed pricing to derive USD costs from token counts, with nested
providers entries for provider-specific rates, add costCurrencies to show
converted cost columns in the expanded breakdown table only, add
derivedAttributes to persist computed values back onto matching LLM spans
before trace consumers run, and add entries to metrics to surface arbitrary user metrics
(format: 'string' | 'number' | 'duration' | 'json' | 'boolean',
placements: ['header' | 'body']). derivedAttributes can be a keyed map
for one-off fields or one callback that returns multiple path/value pairs.
Derived keys are dot-paths under span.attributes; return undefined to
skip one span or one returned key.
- Default usage config derives missing eval outputs from matching LLM/API spans
before
outputsSchema and scores run: apiCalls, costUsd, llmTurns,
inputTokens, outputTokens, totalTokens, cachedInputTokens,
cacheCreationInputTokens, reasoningTokens, and llmDurationMs. Authored
outputs and column overrides win. Default usage columns, stats, and charts
use hideIfNoValue: true. Default LLM usage charts configure cost, input
tokens, and output tokens separately and use dedupeConsecutiveValues: true
to skip repeated adjacent chart values. totalTokens is input + output only;
cache read/write tokens stay separate and affect costUsd at their own
rates.
Derived base input cost uses inputTokens - cachedInputTokens - cacheCreationInputTokens so cache details are not double-counted.
cacheCreationInputTokens is the total cache-write count; optional
cacheCreationInput1hTokens only splits that total for 1-hour write pricing
via cacheCreationInput1hUsdPerMillion. llmDurationMs sums elapsed matched
LLM span durations; it is not time-to-first-token latency.
Remove defaults globally or per eval with removeDefaultConfig: true or a
key list such as
removeDefaultConfig: ['apiCalls', 'reasoningTokens'].
apiCalls (in agent-evals.config.ts) configures how API-call spans are
summarized for review. Defaults to kind: 'api', 'http', 'http.client',
and 'fetch' spans with method, url, statusCode, request,
response, requestBody, responseBody, headers, durationMs, and
error read from conventional attribute paths. Override kinds or
attributes.<field> for external tracers, add derivedAttributes as a
keyed map or object-returning callback for computed persisted API span
attributes, and add metrics with the same formats and placements as
LLM-call metrics.
runLogs (in agent-evals.config.ts) controls case log capture. Use
runLogs: { captureConsole: false } to keep console output in the terminal
without persisting console calls to case details. Manual evalLog(...) calls
are still captured. Captured log locations store the selected user-facing
source frame and the full JavaScript stack so agents can inspect additional
frames in persisted artifacts when diagnosing where a log came from.
Stats rows and history charts can be authored via stats / charts on the eval
definition. Global stats in agent-evals.config.ts combine with eval-level
stats. Native stat kinds include cases, passRate, duration, and
cacheHits; cacheHits shows Agent Eval operation-level cache hits over total
cache operations (hits/total) from spans and evalTracer.cache(...) refs, not
LLM provider prompt-cache read tokens such as cachedInputTokens. Usage stats
and LLM usage charts are added by default unless removed with
removeDefaultConfig. Column stats can override format and numberFormat,
otherwise they inherit from the matching column. Number formats use
maxDecimalPlaces to cap decimals and minDecimalPlaces to pad trailing
zeroes. Without maxDecimalPlaces, the default cap is 3 decimal places. Stats
and charts support hideIfNoValue: true. Charts support
dedupeConsecutiveValues: true to omit consecutive points whose plotted metrics
and tooltip extras match the previous kept point.
Their shapes live in the types; no need to memorize the option set.
Cached operations
Wrap a costly pure span in cache: { namespace, key } so later runs replay its
recorded effects without re-executing:
await evalTracer.span(
{
kind: 'llm',
name: 'plan-refund',
cache: {
namespace: 'refund-workflow.plan-refund',
key: { prompt: input.message, model: 'gpt-4o-mini' },
},
},
async () => {
const result = await llm.complete(input.message);
evalSpan.setAttributes({
model: 'gpt-4o-mini',
provider: 'openai',
usage: result.usage,
output: result,
});
return result;
},
);
Use evalTracer.cache(...) for pure values that should not create their own
trace span:
const context = await evalTracer.cache(
{ name: 'receipt-audit-context', key: { orderId: input.orderId } },
async () => {
const result = await loadReceiptContext(input);
evalSpan.setAttribute('receiptContext', result);
evalSpan.mergeAttribute('receiptSummary', { orderId: input.orderId });
return result;
},
);
Mental model:
- Only SDK-mediated effects replay on a hit: sub-spans, checkpoints,
output helper calls, span attributes. External side
effects (HTTP, DB writes, file I/O) do not replay — cache only pure
functions of the key.
evalTracer.cache(...) does not create a span. When it runs inside an active
span, that span gets a cache.refs entry with the value cache name, key,
namespace, and hit/miss status. When called directly from the case body
(no surrounding span), the ref is recorded on the case detail's cacheRefs
array. When called directly from a scorer, the ref is recorded on that
scoring trace's cacheRefs array.
- Cache identity is the namespace plus the authored key. Source-file
fingerprints are tracked for run freshness separately, but do not participate
in cache-key hashing.
- Cached spans require an explicit
cache.namespace. Value caches can also set
an explicit namespace; prefer doing that when the cache is part of a
documented workflow. Matching namespaces share entries across operations/evals
that use the same authored key.
- Per eval,
cache: { read?: boolean; store?: boolean } controls whether
authored cached operations may read or persist entries. Both default to
true. Use read: false to always execute instead of replaying hits, and
store: false to allow reads while preventing misses/refreshes from writing
cache or raw-key debug files. Run-level bypass/refresh controls still take
precedence.
- Authored eval ids are unique within one eval file. The exact eval identity is
the workspace-relative file path plus eval id, so the same id can be reused in
different files. Case ids must be unique within one eval; duplicate case ids
are reported as run errors.
- Cache keys should be deterministic primitives, arrays, and plain objects.
Buffer, ArrayBuffer, and typed arrays hash by bytes. Native Blob/File
keys use stable metadata by default (type, size, plus
name/lastModified for File) and do not read file bytes. Add
serializeFileBytes: true to a cached span or evalTracer.cache(...) call
when byte-level cache invalidation is required.
- Cache entries are stored as one Brotli-compressed JSON file per key under
.agent-evals/cache/<sanitizedNamespace>/<keyHash>.json.br; each namespace
is capped at 100 entries by default. Configure
cache.maxEntriesPerNamespace for the default cap and
cache.maxEntriesByNamespace for exact namespace-specific caps.
- Nested cached JSON values at or above roughly 10K JSON characters are stored
as content-addressed Brotli blobs under
.agent-evals/cache-blobs/ and
referenced from cache JSON by sha256. Identical large payloads share the same
blob.
- Authored raw cache keys are stored for debugging under
.agent-evals/cache-debug/<sanitizedNamespace>/<keyHash>.json. This folder
may include prompts, user inputs, full serialized cache payloads, or other
sensitive data, should be gitignored, and is not needed for cache reuse.
- Cached payloads use JSON-safe tagged serialization, so return values and
recorded SDK effects preserve richer built-ins such as
Date, Map, Set,
typed arrays, URL, Headers, Blob, and File on hits. Undefined values
are omitted by default instead of being written to cache files; direct
serializer callers can pass
{ preserveUndefined: true } when explicit undefined wrappers are needed.
Cache keys still use the deterministic key-hashing rules above.
- Cache mode per run is controlled by CLI flags (see
agent-evals run --help).
Artifacts
Run output lives under .agent-evals/runs/<run-id>/. Cache metadata lives under
.agent-evals/cache/<sanitizedNamespace>/<keyHash>.json.br. Do not rely on a
specific cache filename when authoring evals; configure cache namespaces
manually in eval code, then use agent-evals cache list to inspect the
persisted namespace/key entries. Files in a run directory include run metadata,
a run summary, per-case results, and per-case trace JSON. Inspect run files when
debugging persisted output, costs, columns, traces, or failures; inspect cache
entries when debugging replayed span/value-cache results.
Targeted evals in run.json are recorded by exact evalKeys
(filePath + evalId) rather than authored eval ids, so duplicate eval ids stay
unambiguous in saved history.
Temporary runs use the same directory layout, but are removed before the next
run of any kind starts.
Use agent-evals show-runs when you need stable file
paths before reading saved output:
agent-evals show-runs
agent-evals show-runs latest --json
jq . .agent-evals/runs/<run-id>/summary.json
jq -s . .agent-evals/runs/<run-id>/cases.jsonl
jq . .agent-evals/runs/<run-id>/case-details/<case-id>.json
jq . .agent-evals/runs/<run-id>/traces/<case-id>.json
Run ids can be full timestamp ids, short ids such as r0 from
agent-evals show-runs, or latest. show-runs is only an artifact index;
the files themselves remain the source of truth for detailed results and
traces.
Module mocking
For true module replacement inside an eval, register mock.module(...) from
node:test before dynamically importing the module graph. Agent Evals enables
Node's --experimental-test-module-mocks flag automatically for CLI and app
runs. Use dynamic
import(...) inside execute — static imports happen too early.
import { mock } from 'node:test';
import { defineEval } from '@ls-stack/agent-eval';
defineEval({
id: 'module-mock-demo',
cases: [{ id: 'mocked-dependency', input: { customerId: 'vip-100' } }],
execute: async ({ input, setOutput }) => {
mock.module('../src/customerLookup.ts', {
namedExports: { lookupCustomer: async () => ({ segment: 'vip' }) },
});
const { runWorkflow } = await import('../src/workflow.ts');
const result = await runWorkflow(input);
setOutput('segment', result.segment);
},
});
Workflow checklist
When adding or changing evals:
- Put the tracing + ambient SDK calls in the product code that runs in both
production and evals. Keep eval files thin.
- Use realistic cases drawn from real product flows; avoid placeholder inputs.
evalAssert for hard invariants and truthy type narrowing, evalExpect
for non-trivial comparisons, scores for graded signals, passThreshold
only on scores that should gate pass/fail.
- Surface reviewable values through execute-context
setOutput or ambient
setEvalOutput in shared workflow code, and shape them with columns
formats from the ColumnFormat type.
- Promote high-signal span attributes with
traceDisplay.
- Cache costly pure spans with
cache: { namespace, key } and pure spanless
values with evalTracer.cache(...); never cache operations whose external
side effects you depend on.
- Sanity-check after changes:
agent-evals list, then
agent-evals run --eval <id>; use --file <path|glob> to target one file
when multiple files use the same eval id.
- Locate saved artifacts with
agent-evals show-runs latest --json, then read
the relevant summary.json, cases.jsonl, case-details/<case-id>.json,
or traces/<case-id>.json file directly.