Run any Skill in Manus with one click

$pwd:

pxi-eval-dataset

Name: Pxi Eval Dataset
Author: Arize-ai

// Generate synthetic evaluation datasets for the PXI eval harness (evals/pxi/). Use whenever the user asks to create, author, draft, expand, or audit an eval dataset for a PXI tool, skill, or behavior — including phrases like "write evals for <tool>", "test PXI behavior", "synthetic dataset for PXI", "cover this tool with eval examples", or "find gaps in our PXI eval coverage". Inspects whichever evaluators currently live under evals/pxi/evaluators/ at use time and pauses to recommend a new evaluator if the behavior under test can't be scored by what already exists.

Run Skill in Manus

$ git log --oneline --stat

stars:9,927

forks:905

updated:May 22, 2026 at 17:12

File Explorer

2 files

SKILL.md

readonly

related-skills.json

same repository

phoenix-pxi.md

from "Arize-ai/phoenix"

Development guide for the Phoenix PXI agent. Use when modifying PXI-specific frontend or backend behavior, extending PXI tool wiring, updating PXI runtime capabilities, or changing the PXI agent request/dispatch flow. Start here for PXI-specific workflows, then read the relevant resource file for the layer you are changing.

2026-05-309.9k

phoenix-frontend.md

from "Arize-ai/phoenix"

Frontend development guidelines for the Phoenix AI observability platform. Use when writing, reviewing, or modifying React components, TypeScript code, styles, or UI features in the app/ directory. Triggers on any frontend task — new components, UI changes, styling, accessibility fixes, form handling, or component refactoring. Also use when the user asks about frontend conventions or component patterns for this project. For design system rules (error display, layout, dialogs, tokens), use the phoenix-design skill.

2026-05-299.9k

phoenix-design.md

from "Arize-ai/phoenix"

Design system conventions for the Phoenix frontend — layout, dialogs, error display, BEM CSS class naming, and CSS design tokens. Use when building UI, naming CSS classes, creating or consuming tokens, handling errors, or designing dialog interactions in app/src/.

2026-05-299.9k

debug-trace.md

from "Arize-ai/phoenix"

Diagnose failure modes by systematically investigating traces. Trigger when the user explicitly asks for cross-trace diagnosis: "what's going wrong?", "were there errors?", "debug this", "where is my agent struggling?". Do NOT trigger on: (1) advice questions ("what should I do?"), (2) statistical questions ("what's the average latency?"), (3) summarize requests, (4) trace filtering ("show me traces with errors"), (5) vague questions ("is there a problem?"), (6) unrelated requests.

2026-05-289.9k

playground.md

from "Arize-ai/phoenix"

Author, edit, or iterate on prompts in the Phoenix prompt playground. Load before any playground tool call, including single-shot prompt rewrites.

2026-05-289.9k

phoenix-cli.md

from "Arize-ai/phoenix"

Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, structure trace review with open coding and axial coding, inspect datasets, review experiments, query annotation configs, and use the GraphQL API. Use whenever the user is analyzing traces or spans, investigating LLM/agent failures, deciding what to do after instrumenting an app, building failure taxonomies, choosing what evals to write, or asking "what's going wrong", "what kinds of mistakes", or "where do I focus" — even without naming a technique.

2026-05-279.9k

package.json

"author": "Arize-ai"

"repository": "Arize-ai/phoenix"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

name	pxi-eval-dataset
description	Generate synthetic evaluation datasets for the PXI eval harness (evals/pxi/). Use whenever the user asks to create, author, draft, expand, or audit an eval dataset for a PXI tool, skill, or behavior — including phrases like "write evals for <tool>", "test PXI behavior", "synthetic dataset for PXI", "cover this tool with eval examples", or "find gaps in our PXI eval coverage". Inspects whichever evaluators currently live under evals/pxi/evaluators/ at use time and pauses to recommend a new evaluator if the behavior under test can't be scored by what already exists.
license	Apache-2.0
metadata	{"author":"ehutton@arize.com","version":"0.1.0","internal":true}

pxi-eval-dataset

Produce a small, well-targeted YAML dataset that drops into evals/pxi/datasets/<name>.yaml, runs through evals/pxi/harness/run_experiment.py, and is scored by deterministic code evaluators under evals/pxi/evaluators/.

The aim is a minimal but representative set of synthetic examples — think unit tests, not a benchmark. 10–50 examples, each covering a distinct dimension. Add more only when a new example tests something no existing example does.

Every example must be scorable by deterministic / heuristic / code logic. Every example must include a non-empty splits: list. Use splits: [regression] for small committed regression suites unless the user explicitly asks for a dev or val dataset; keep regression, dev, and val disjoint.

Workflow

1. Identify the target

Confirm with the user what's under test: a specific PXI tool name (e.g. set_time_range), a skill, or a higher-level behavior. For tools:

Read the tool definition at src/phoenix/server/agents/toolsets/external/tools/<name>.py — the ToolDefinition's description and parameters_json_schema are what the LLM actually sees.
Search the phoenix src for any server-side implementation function that backs the tool. (External tools execute browser-side and may have none — that's fine, the docstring + schema are the spec.)
Read src/phoenix/server/agents/toolsets/__init__.py::build_toolset and src/phoenix/server/agents/toolsets/external/__init__.py::build_external_toolset to learn the availability conditions — which ChatContext must be present for the tool to be exposed (e.g. set_spans_filter only exists when deps.contexts.project is set).

2. Survey the evaluators that currently exist

List evals/pxi/evaluators/ and read each module. For every evaluator, note:

the evaluator name and @create_evaluator(...) decorator,
what fields it consumes from expected (e.g. expected.tools.required, expected.tool_call_args[<tool>]),
its score / label semantics and any helpful failure metadata,
the class of assertion it supports (tool selection, tool arguments, assistant text, multi-call sequencing, ...).

Also peek at evals/pxi/tests/test_evaluators.py for canonical input shapes.

Do NOT hard-code knowledge of which evaluators exist — re-read the directory every time. The set will grow.

3. Match the target to the evaluators — pause if a gap exists

Decide whether the behavior under test can be scored by what's there today.

Yes → continue. The chosen evaluator names go in the dataset's evaluators: field (required, top-level). The runner uses ONLY the listed evaluators; unrelated ones are not invoked, so the dashboard stays free of vacuous passes from evaluators that don't apply to this dataset. Available names live in evals/pxi/evaluators/__init__.py (EVALUATORS_BY_NAME).
No → stop and summarize the gap to the user. Propose the shape of a new evaluator:
- name (snake_case),
- the expected.<field> it would read,
- score / label semantics,
- whether it's tool-call-shaped, text-shaped, structural, etc.
Then ask: "Should I add this evaluator before we generate examples?" If yes:
- implement it under evals/pxi/evaluators/<file>.py,
- add unit-test coverage to evals/pxi/tests/test_evaluators.py,
- export it from evals/pxi/evaluators/__init__.py,
- then continue.
If no, scope the dataset down to assertions the existing evaluators can score, and tell the user what coverage that costs.

4. Enumerate coverage dimensions

Walk this checklist for the target. For each row, write down which queries you'll add to cover it. Skip rows that don't apply (e.g. booleans on a tool with no boolean field).

Parameter coverage — every required field, every optional field, every meaningful field combination.
Value coverage — every enum literal; for strings: empty, whitespace, special characters, very long; for numbers: zero, negative, boundary; for booleans: both polarities.
Combination coverage — fields that interact (e.g. a filter condition paired with a scope toggle, two args that must agree).
Negative coverage — queries where the tool should NOT be called.
Ambiguity coverage — borderline queries that test correctness under uncertainty.

Polarity and difficulty targets are stated once in step 5 below; aim for those across the whole dataset, not within each row.

If you cannot fill at least 10 rows, the target is probably too narrow — confirm scope with the user.

5. Draft queries

For each coverage dimension, write one or more queries that feel like they came from a real Phoenix user. Mix across:

Voice: imperative ("show me LLM spans"), declarative ("I want to see only errors"), question ("what spans took over 5s?"), fragment ("LLM only").
Polish: clean prose, terse fragments, casual typos, abbreviations, incomplete sentences.
Personas: new user setting up Phoenix on a new project; engineer debugging an agent or RAG pipeline; PM exploring trace quality; AI engineer writing evals; annotator marking traces; researcher exploring trends. Sample across them — don't sound like one author.
Difficulty: ~30% obvious, ~50% moderate, ~20% ambiguous / tricky.
Polarity: at least 30% negative (tool should NOT be called).

Anti-patterns to avoid:

Don't paraphrase the tool's docstring.
Don't use technical jargon a real user wouldn't reach for.
Don't write 20 minor variations of the same intent.

6. Annotate expected outputs — subprocess annotation

Ground truth must be generated in a fresh subprocess, not inline. When annotation happens in the same context that drafted the queries, the agent builds a prior from its own examples: it anchors on condition patterns it established early, collapses ambiguous cases into overconfident annotations, and fills in tool_call_args based on what earlier examples "look like" rather than what the tool spec actually says. This is context rot — annotation signal weakens as the dataset grows.

The remedy: annotate each example in a separate subprocess whose context contains the full PXI toolset, the author's design intent for the query, and the schema contract. Each annotation is an independent reasoning trace from the toolset spec.

Prepare once, reuse across examples. Before launching annotation subprocesses, extract the full toolset spec once (see section 1 below) and cache it. Every annotation subprocess across the dataset reuses the same toolset block; only the query and design-intent block change per-example.

Subprocess context (four sections, in this order)

Full toolset spec — the complete toolset the PXI agent sees at runtime, NOT just the tool under test. For each tool, include the name, description, and parameters_json_schema. The annotator has to decide whether the right answer is the focal tool, a different tool, or no tool at all — and cannot do that without seeing every tool's spec. This is the single biggest hedge against annotator hallucination: with only one tool visible, the annotator will force-fit it to ambiguous or negative queries.

Source the toolset from build_toolset and build_external_toolset in src/phoenix/server/agents/toolsets/ (and src/phoenix/server/agents/toolsets/external/). The eval harness configures ChatContext such that all external tools are available — see evals/pxi/harness/agent_task.py. List every tool that survives that filtering, not just the focal one.

Mark the focal tool explicitly at the top of this section (★ FOCAL TOOL: <name> — this dataset evaluates this tool's behavior). The annotator's job is still to reason from the spec for whichever tool best fits the query, but the highlight tells them which tool's evaluator family will judge the answer.
Evaluator contract — provide the dataset-specific expected-block schema derived from the selected evaluators (what each expected field means, matching semantics, and which fields may be omitted or expressed as variants). Add the explicit instruction: if you are unsure about a value, omit the key rather than guessing.
Query and design intent — the input.query string, plus the dataset author's framing if assigned:
- difficulty: obvious | moderate | tricky
- polarity: positive | negative
- category: <coverage dimension> (the taxonomy from step 4 — parameter, value, combination, negative, ambiguity, plus any finer-grained category like span_kind, status, latency, negative_chitchat, negative_wrong_tool, etc.)
- Any notes: the author wrote
Frame this for the annotator as the dataset author's intent, NOT ground truth. The annotator should reason from the toolset spec and may disagree with the author — that disagreement is one of the most valuable signals this protocol produces, because it surfaces mis-classified queries.
Task instruction (goal-oriented, adapt only the schema reference):
"Your goal: determine the correct expected: block for this dataset example — the expected output needed to validate whether the PXI agent took the right actions in response to the user's query, given the full toolset spec, evaluation goal, evaluator set, and expected-block schema.

Constraints:
- Follow the expected-block schema provided for this dataset. The schema is determined by the evaluation goal and selected evaluators; do not assume every dataset validates only tool selection or tool arguments.
- When the schema includes tool expectations, consider every tool in the toolset spec, not just the focal one. The right answer may be the focal tool, a different tool, or no tool at all.
- Use only values directly implied by the relevant spec and the query. Omit any key whose value the spec doesn't determine — don't invent a plausible default.
- The author's design intent (difficulty, polarity, category, notes) is a hint, not ground truth. If your reading of the spec contradicts the author's labeled polarity, output what the spec implies and record the disagreement in metadata.notes.
- Return exactly one best expected: block. If multiple outputs seem valid, choose the single strongest answer and mention the alternatives in your brief reasoning; the orchestrator decides later whether variants belong in the final example.
Output format (exactly two parts, in this order):
1. Brief reasoning — one or two sentences explaining which expected output you concluded should be used and why. Just enough that an orchestrator comparing three independent annotations can see where you agree or diverge with the others.
2. The expected: block as YAML matching the schema reference provided in this prompt. No other prose."
This frames the task as a goal with constraints, not a procedural recipe. Cross-model fan-out is only useful if each annotator reasons independently — prescribing the exact reasoning steps collapses that diversity. The brief-reasoning requirement gives the orchestrator enough signal to adjudicate disagreements without dictating how each annotator gets there.

Expected-block schema and annotation rules

Before launching annotation subprocesses, define the expected-block schema for this dataset from the evaluation goal and selected evaluators. Include that schema verbatim in every annotation prompt. For example, a dataset that validates tool selection and tool arguments may use expected.tools and expected.tool_call_args, while another dataset may validate response text, refusal behavior, or some other evaluator-specific field.

When the schema includes tool expectations, use these conventions:

Use tools.required for the must-be-called tool(s).
Use tools.exact_match: true for strict-no-extras cases.
For a pure-negative case where no tool should be called, set tools.required: [] and tools.exact_match: true. Do not enumerate every tool in tools.forbidden; the tool list can grow and make that brittle. Do not supply tool_call_args.
Use tool_call_args[<tool>] only when argument correctness matters; leave it off for "any args are fine" cases — especially ambiguous cases where you want to assert the tool fires but not pin its args.

How many opinions, and which models

Collect three independent opinions for every dataset example: Sonnet, Opus, and Codex. Even obvious and pure-negative examples get the full fan-out so the dataset has a consistent annotation provenance and easy-to-audit agreement metadata.

Cross-model diversity — different families, different training, and (for Codex) different agent runtime — catches single-model biases that seed-level diversity misses.

How each opinion is invoked. The Claude Code Agent tool is Anthropic-only, so the OpenAI-side opinion comes from Codex via a Bash wrapper. Using Codex (rather than a one-shot Chat Completions call) gives the OpenAI annotator the same kind of environment access a Claude subagent has — sandbox-constrained file reads, repo navigation, the ability to double-check its reading of the toolset against the actual source. This keeps the three opinions genuinely symmetric.

Opinion	Invocation
Sonnet	`Agent` tool, `model: "sonnet"`
Opus	`Agent` tool, `model: "opus"`
Codex	`Bash` tool, piping the prompt into `.agents/skills/pxi-eval-dataset/scripts/annotate_via_codex.sh`

The Codex wrapper at .agents/skills/pxi-eval-dataset/scripts/annotate_via_codex.sh reads a prompt on stdin and writes Codex's final message to stdout. It runs codex exec with --sandbox read-only (the annotator can read the repo but cannot modify anything), --ephemeral (no session persistence between invocations), and --skip-git-repo-check. Authentication is whatever Codex is already configured with (codex login status); no OPENAI_API_KEY plumbing required. Example invocation:

cat /tmp/annotation_prompt.txt | .agents/skills/pxi-eval-dataset/scripts/annotate_via_codex.sh
# Or pin a specific model:
cat /tmp/annotation_prompt.txt | .agents/skills/pxi-eval-dataset/scripts/annotate_via_codex.sh --model o3

Run all three opinions in parallel. Send a single message containing two Agent tool calls (Sonnet, Opus) and one Bash tool call (the Codex wrapper) — they will run concurrently. Do not share context between them.

Orchestrator subprocess (required)

After the three annotation subprocesses return, run a fourth fresh subprocess as the orchestrator. Its job is to merge the candidates into a single expected: block, with authority to drop spec-invalid proposals. Use a separate subprocess so the orchestrator sees anonymized candidate blocks and cannot be biased by knowing which model produced which annotation.

Orchestrator subprocess context:

The full toolset spec (same as the annotation subprocess — focal tool marked, all other available tools included). The orchestrator needs this to judge "spec-valid" claims, especially for negative cases where the right answer is a non-focal tool.
The query and design intent block (same as the annotation subprocess — difficulty, polarity, category, notes). The orchestrator should also be willing to override the author's polarity if the toolset spec and the majority of annotation opinions disagree with it; flag any such override with metadata.notes: 'orchestrator overrode author polarity'.
All three candidate expected: blocks (anonymized — do not tell the orchestrator which model produced which proposal; this prevents reputation bias)
The dataset-specific expected-block schema

Orchestrator decision rules:

Merge portions of the expected: block where the models agree.
Drop any value that is invalid under the tool spec, evaluator contract, or dataset-specific expected-block schema. Justify each rejection against a specific clause.
Where models disagree but multiple outputs are valid and the schema supports variants, include variants only for the specific fields where variants are meaningful.
Where variants do not make sense for the field, use the majority answer.
Where there is no valid majority and the schema allows omission, omit the disputed optional field rather than over-constraining the expected output.
For tool-call argument variants, keep at most three variants. If there are many valid argument shapes, prefer omitting tool_call_args[<tool>] so the evaluator checks tool selection without pinning arbitrary arguments.

The orchestrator must justify any "spec-invalid" rejection against a specific clause of the tool spec. It does NOT vote: it adjudicates against the spec.

Confidence metadata — surface noisy annotations

After orchestration, write the annotation provenance into the example's metadata. This makes noisy / low-confidence ground truth easy to audit later.

metadata:
  category: preset
  annotation:
    agreement: high     # high | medium | low — see rule below
    n_opinions: 3       # number of annotation subprocesses run

Agreement categorization rule:

high — all three opinions produced equivalent expected: blocks (equal tools.{required,forbidden} lists after sorting, and per-tool tool_call_args dicts equal after applying _normalize_arg_value from evals/pxi/evaluators/tools.py to each value).
medium — majority agreed; minority differed but the orchestrator was able to merge (as variants) or drop (as spec-invalid).
low — no majority; the orchestrator either fell back to omit-args or made a judgment call with weak signal.

Why this matters: an agreement: low example may still have a correct ground truth, but it's a candidate for human review — the models hedged, which can indicate either a genuinely ambiguous query or a spec the LLMs interpreted differently. A follow-up audit script can flag these for inspection.

Equality-detection caveat: the heuristic catches DSL clause reordering (because it reuses _normalize_arg_value) but does NOT catch semantically-equivalent rewrites (e.g. latency_ms >= 5000 vs latency_ms > 4999). Such cases will surface as agreement: medium or low, which is arguably the right signal — the model hedged.

Multiple-correct arg shapes: variant lists

When more than one set of arguments is reasonable per the spec, write tool_call_args[<tool>] as a list of dicts instead of a single dict. The observed call passes if any variant matches under the same subset rules as the single-dict form. This lets each variant pin a different combination of keys (e.g. a preset variant vs. a custom variant with startTime):

tool_call_args:
  set_time_range:
    - { timeRangeKey: 1d }                  # preset variant
    - { timeRangeKey: 7d }                  # preset variant
    - { timeRangeKey: custom }              # custom range, key omitted

Use variants for genuinely-ambiguous queries (e.g. "show me recent traces" when only a small number of choices are defensible). Keep the list to at most three variants. If there are many valid choices, omit the argument assertion instead. Do not use variants to paper over agent inconsistency on queries that have one clear correct answer.

Applying orchestrator results back to an existing dataset

When the orchestrator returns adjudicated expected: blocks (or new metadata.annotation blocks during a backfill), prefer targeted Edit operations on the specific examples that change. Do NOT re-serialize the whole YAML file with pyyaml or ruamel.yaml:

pyyaml.safe_dump strips comments and rewrites indentation (e.g. sequence-indent=2 and unquoted list-item style) — a 38-example file comes back with a 1000+ line diff that obscures the real semantic change.
ruamel.yaml can round-trip, but only when you pin both preserve_quotes=True AND the exact indent(mapping=M, sequence=S, offset=O) settings of the source file. Mismatched indent settings silently corrupt list-item structure (e.g. child fields land at the list-item column instead of inside the item). Recover requires a full revert.

The safe pattern for bulk merges:

Load the orchestrator's results into Python (any YAML library is fine for reading).
Build a {example_id: (new_expected, new_metadata)} map.
Iterate over each changed example and apply a focused Edit call targeting the example's existing block as old_string and the merged block as new_string. Each Edit's diff stays scoped to the one example.
Validate with load_dataset after the batch.

This preserves section-header comments by construction and produces a diff that reviews cleanly.

Section-header comments are convenience navigation only. The load-bearing per-example commentary lives in metadata.annotation.notes and metadata.triage.notes, which round-trip safely through any merge approach. If a bulk rewrite loses section comments, the dataset is still fully self-documenting — no need to reconstruct them.

7. Save and validate

Save to evals/pxi/datasets/<name>.yaml. Then:

# Parse + validate schema:
uv run python -c "from evals.pxi.harness.datasets import load_dataset; load_dataset('<name>')"

# Run the experiment end-to-end against the real PXI agent:
uv run python -m evals.pxi.harness.run_experiment --dataset <name>

Run the experiment end-to-end and triage every failure before calling the dataset done. Don't assume a failed example means a broken agent — eval suites have three plausible failure sources, and mixing them up is the easiest way to ship bad ground truth or chase phantom agent regressions.

Failure triage — three categories

For each failed example, classify the failure into one bucket:

Category	What it looks like	What to do
Dataset / annotation issue	Agent's actual output is valid and reasonable per the toolset spec, but doesn't match the dataset's `expected:` block. The dataset author (or annotation protocol) was wrong, too strict, or missed a valid variant.	Fix the dataset: relax the expectation, add a variant via the variant-list syntax, omit `tool_call_args` if the case is ambiguous, or correct an outright wrong annotation. Note in `metadata.notes` if a polarity flip is involved.
Genuine agent issue	The harness caught a real problem — the agent called the wrong tool, missed a tool, hallucinated arg values, or violated the spec.	Leave the example as-is; the failure is doing its job. Surface the failure to whoever owns the agent's behavior (file an issue, add to the regression log).
Harness / evaluator issue	The evaluator's matching logic, the runner's plumbing, or the Phoenix-side experiment integration is broken. Symptom: the agent's output and the expected block agree by any reasonable reading, but the evaluator labels it failing — or vice versa.	Fix the evaluator or harness, add a unit test in `test_evaluators.py` covering the case, then re-run. Do NOT paper over a harness bug by editing the dataset.

Triage workflow

Pull the failed example's id, observed tool calls, and evaluator label from the experiment output (the runner prints these per evaluator).
Re-read the focal tool's parameters_json_schema and the relevant evaluator code to decide which category the failure belongs in. The categorization is not always obvious — when genuinely uncertain, prefer "harness/evaluator issue" and read the evaluator first, since a broken evaluator silently corrupts both of the other categories.
Record the triage decision per failed example before fixing anything. A short list — {example_id: category} — is enough. Without this list, the fix loop tends to oscillate (fix dataset → evaluator flips on a different example → "fix" evaluator → first example breaks again).
Fix in order: harness/evaluator issues first, dataset issues second, leave genuine agent issues for last (or escalate rather than fixing them yourself).
Re-run the full experiment after each batch of fixes. Stop when the only remaining failures are genuine agent issues.

Other things to look for during validation

Examples that all pass trivially (probably duplicate coverage — delete or harden).
Coverage gaps the agent surfaces (e.g. an enum value not represented in any example).
Examples flagged annotation.agreement: low in metadata — these are exactly the candidates for human review even if they passed.

For local-only experimentation use the harness env vars (see evals/pxi/harness/README.md). There is no --limit flag — keep the dataset small while iterating, or commit a temporary copy.

YAML gotchas that cost an iteration if you miss them:

Each example must have splits: [...]. For the normal small regression datasets this skill creates, set splits: [regression]. The harness rejects missing or empty split lists.
The dataset validator uses ConfigDict(extra="forbid") on EvalDataset, so a typo in a top-level key (example instead of examples) raises rather than being silently ignored.
Quote any scalar that begins with ", [, {, :, *, &, !, or %, and any value containing a leading/trailing colon or # — e.g. condition: "status_code == 'ERROR'". Plain DSL strings without punctuation can stay unquoted, but quoting consistently is cheaper than diagnosing a parser error.
Example ids must be unique across the dataset; duplicate ids fail validation with the offending ids listed.

Dataset schema reference

Reference the live datasets in evals/pxi/datasets/ for current dataset shapes, examples, and naming conventions. Do not copy a schema from this skill into prompts. Instead, derive the expected-block schema from the selected evaluators and include that schema in each annotation and orchestration prompt.

Validator is in evals/pxi/harness/datasets.py. Matching semantics in evals/pxi/evaluators/tools.py:

Subset match on tool_call_args: an observed call passes if it has every expected key with a matching value; extra keys are ignored.
Variant match when tool_call_args[<tool>] is a list of dicts: the observed call passes if ANY variant matches. Use for genuinely ambiguous queries where multiple arg shapes are equally correct.
Any-of match when a tool is called multiple times: ANY call can satisfy the expected args.
Order-independent conjunctions: string values joined with and are normalized to a frozenset, so "span_kind == 'LLM' and latency_ms >= 5000" matches "latency_ms >= 5000 and span_kind == 'LLM'". Pure-or expressions are normalized the same way; mixed and/or fall back to exact string comparison.

Out of scope (today)

Multi-turn evaluation — the harness invokes the agent once per example. If the behavior under test is conversational, raise that as a separate task before writing examples.

pxi-eval-dataset

More from this repository

More from this repository

pxi-eval-dataset

Workflow

1. Identify the target

2. Survey the evaluators that currently exist

3. Match the target to the evaluators — pause if a gap exists

4. Enumerate coverage dimensions

5. Draft queries

6. Annotate expected outputs — subprocess annotation

Subprocess context (four sections, in this order)

Expected-block schema and annotation rules

How many opinions, and which models

Orchestrator subprocess (required)

Confidence metadata — surface noisy annotations

Multiple-correct arg shapes: variant lists

Applying orchestrator results back to an existing dataset

7. Save and validate

Failure triage — three categories

Triage workflow

Other things to look for during validation

Dataset schema reference

Out of scope (today)

pxi-eval-dataset

Workflow

1. Identify the target

2. Survey the evaluators that currently exist

3. Match the target to the evaluators — pause if a gap exists

4. Enumerate coverage dimensions

5. Draft queries

6. Annotate expected outputs — subprocess annotation

Subprocess context (four sections, in this order)

Expected-block schema and annotation rules

How many opinions, and which models

Orchestrator subprocess (required)

Confidence metadata — surface noisy annotations

Multiple-correct arg shapes: variant lists

Applying orchestrator results back to an existing dataset

7. Save and validate

Failure triage — three categories

Triage workflow

Other things to look for during validation

Dataset schema reference

Out of scope (today)