| name | oxy-test-drafter |
| description | Use for ANY test or eval request involving an Oxy agent, workflow, or agentic workflow — 'add a test', 'create a test', 'write a test case', 'test this agent', 'run tests', 'add evals', 'evaluate my agent', 'create an evaluation', 'fill in expected answers', or 'bootstrap an eval suite'. Manages .test.yml eval files: scaffolding cases, running `oxy test --output-json`, parsing results, and drafting expected strings from observed outputs. |
Oxy Test Drafter
You are an expert at bootstrapping and refining .test.yml eval files for Oxy agents and workflows. Your role is to run agents against a set of prompts, parse the resulting JSON traces, and draft accurate, concise expected strings grounded in observed evidence — not in assumptions.
Scope boundary: This skill touches only .test.yml files. Never modify semantic layer files, agent YAML, workflow YAML, system instructions, or any code outside the test file.
When to Use This Skill
Activate when the user:
- Wants to create eval tests for a list of prompts
- Has a
.test.yml with DRAFT: placeholders or empty expected strings
- Asks to "run tests and fill in expected answers"
- Wants to bootstrap an eval suite without knowing the right answers upfront
Core Workflow
Phase 1 — Load context
Read the target .test.yml if it exists. Identify:
- Target agent/workflow path
- Cases where
expected is missing, a DRAFT: placeholder, or weak (vague/fabricated)
- Prompts with relative time language (
"last month", "recent", "this week") — flag these for rewriting once the agent resolves the actual period
Phase 2 — Scaffold if needed
If the user provides only a list of prompts (no test file yet), ask for:
- Target agent path (e.g.
analyst.agent.yml)
- Desired test file path (suggest
tests/<agent-stem>.<category>.test.yml)
- Any tag hints
Then generate a valid .test.yml with all expected set to:
DRAFT: expected answer to be synthesized from repeated runs and trace inspection.
Use these defaults: runs: 3, concurrency: 5, judge_model: openai-5-mini.
Phase 3 — Run evidence-gathering
When filling in a single DRAFT case (adding one new test, or resolving one placeholder), use --case to run only that case rather than the full suite:
cd <repo-root>
oxy test <test-file> --case <name> --output-json
oxy test <test-file> --case "How many users signed up last month?" --output-json
oxy test <test-file> --case 0 --output-json
The output JSON contains only that case's results, making it faster to parse and less noisy.
When filling in multiple DRAFT cases at once, run the full suite:
cd <repo-root>
oxy test <test-file> --output-json
In both modes, re-run if the case failed due to transient errors (backend 400s, empty results, timeouts). Up to 3 total runs is reasonable.
The output is written as <test-name>.results.json in the same directory as the test file.
Reading the results file — use Read and Grep, never Python or Bash:
-
--case run (single case): the file is small. Read it directly in full.
Read <test-name>.results.json
-
Full suite run (multiple cases): the file can be large. Use Grep to locate the relevant records by prompt text, then Read those line ranges.
# Find line numbers for a specific prompt
Grep "prompt text keywords" in <results-file> (output_mode: content, -n, -C 5)
# Then Read the surrounding lines
Read <results-file> (offset: <line>, limit: 30)
To read all actual_output values for a set of cases, locate each record's prompt line with Grep, then Read the surrounding block (the actual_output field follows within a few lines).
Do not use Python or Bash to parse the JSON. The Read and Grep tools are sufficient and require no shell approval.
Detecting partial failures:
- A case can "succeed" (no top-level error) but have empty
actual_output — this is a data availability failure, not success.
- A case where
actual_output contains a ranked list but all metric columns are null/empty is a failing attempt, not a stable answer.
- Transient backend errors (retryable 400s, timeouts) may resolve on re-run; count only the final successful outcome.
Phase 4 — Analyze each case
For each case examine all records (one per run attempt):
| Check | Question |
|---|
| Stability | Do 2+ attempts agree on key business facts? |
| Numbers | Are numerics approximately consistent across attempts? |
| Ranking | Is the ordering stable even if exact values vary? |
| Null metrics | Did the query return rows but all metric values are empty? |
| Backend errors | Were failures transient or systematic? |
| Refusals | Did the agent correctly decline? What reason did it give? |
Classify each case:
| Class | Meaning |
|---|
stable | 2+ attempts agree; draft a full expected |
stable_but_partial | Ranking/direction stable but values vary; draft partial expected |
flaky | Attempts disagree materially; draft only stable facts, flag the rest |
ambiguous | Prompt is underspecified; suggest prompt rewrite |
likely_unsupported | Data/capability gap; note in diagnostics |
cannot_answer_as_asked | Agent correctly declines; encode the refusal reason as expected |
Phase 5 — Draft expected strings
Write expected strings following the style guide below. Decision rules:
- 2+ attempts converge on the same business answer → full expected with approximate numbers
- Ranking stable, values vary → encode ranking + approximate anchors for top 2–3 entries
- Only a refusal is stable → encode the reason and acceptable fallback
- No trustworthy attempt → leave a narrowed DRAFT note, never fabricate
Phase 6 — Write outputs
1. Updated .test.yml
- Replace
DRAFT: placeholders with drafted expected strings
- Rewrite relative time prompts to absolute periods where the run context clarified the period (e.g.
"last month" → "November 2025")
- Leave truly uncertain cases with a scoped DRAFT note explaining what's missing
2. Diagnostic summary (print to conversation, not to a file):
## Test Drafting Summary
Updated test file: tests/<name>.test.yml
Cases analyzed: N
Stable expecteds drafted: X
Partial/uncertain: Y
Flaky prompts: Z
### Per-case results
| Prompt (truncated) | Class | Notes |
|-----------------------------|--------------------|--------------------------------|
| What is total revenue… | stable | consistent across 3 runs |
| Which items are volatile… | stable_but_partial | ranking stable; values vary |
### Recommended follow-up
- [semantic/system issues to investigate]
- [prompts that should be split]
- [prompts that remain ambiguous and why]
Test File Format
name: "Human-readable suite name"
target: path/to/agent.agent.yml
settings:
concurrency: 5
runs: 3
judge_model: openai-5-mini
cases:
- name: total-revenue-all-stores
prompt: What is the total revenue for all stores?
expected: |
DRAFT: expected answer to be synthesized from repeated runs and trace inspection.
tags:
- revenue
- stores
tool: ""
Add name: to every case you scaffold. It enables targeted single-case runs via --case and makes the diagnostic output easier to read.
File naming: <agent-stem>.<category>.test.yml
e.g. analyst.sales_performance.test.yml
File location: tests/ subdirectory of the target repo, unless the repo already uses a different convention.
expected String Style Guide
This is the most important content this skill produces. Follow these rules precisely.
Rules
- Do NOT start with
"The response should..."
- Do NOT repeat the prompt unnecessarily
- Do NOT include internal field names or schema names unless essential
- Do NOT require exact wording or exact formatting
- Use absolute time periods — replace
"last month" with the actual month/year
- Use approximate numerics naturally:
~, about, approximately, roughly
- For rankings: mention the ranking criterion once, then include only the most important anchor entities and values (top 2–3 plus notable outliers)
- For trend questions: include current value, rough range, a few notable points
- For unanswerable questions: state why, and what acceptable alternative a correct answer should offer
- Keep it short: 1 paragraph, 2–4 sentences max
- Use natural business language, not judge instructions
Good examples
# Simple aggregate
Total revenue across all locations is approximately $6.7 billion.
# Ranking with anchors
Top 5 items ranked by first-week engagements published in February 2026. The #1 item
had ~25K engagements; the #5 item had ~10K. Each entry includes title, channel,
engagement count, and ID.
# Unanswerable — encode the refusal and acceptable alternative
Regional plan attainment cannot be calculated because targets are set at the company
level only. A correct answer explains this and offers company-level attainment as
a useful alternative.
Bad examples
# Bad: starts with "The response should"
The response should explain total revenue across all stores...
# Bad: exact values without approximation
Total revenue is $6,737,218,004.32
# Bad: internal field/schema names
The agent should query store__weekly_sales summed across all 45 store__store_id values.
# Bad: exhaustive list instead of anchors
The answer must include: location A at $73.3M, location B at $275M, location C at $402K...
CLI Reference
oxy test tests/analyst.sales_performance.test.yml --case total-revenue-all-stores --output-json
oxy test tests/analyst.sales_performance.test.yml --case "What is the total revenue?" --output-json
oxy test tests/analyst.sales_performance.test.yml --case 0 --output-json
oxy test tests/analyst.sales_performance.test.yml --output-json
oxy test tests/analyst.sales_performance.test.yml --output-json --tag revenue
oxy test --output-json
Always run from the repo root of the target project.
JSON output schema
--output-json produces an array of EvalResult objects:
[
{
"test_name": "analyst.sales_performance",
"errors": [],
"stats": {
"total_attempted": 9,
"answered": 9
},
"metrics": [
{
"type": "Correctness",
"score": 0.85,
"records": [
{
"prompt": "What is the total revenue for all stores?",
"expected": "DRAFT: ...",
"actual_output": "Total revenue across all 45 stores is $6.74B...",
"cot": "...",
"score": 1.0,
"duration_ms": 4200.0
}
]
}
]
}
]
The actual_output field is the primary evidence source. Read it across all run attempts to identify stable facts.
Guardrails
Never do these:
- Modify semantic layer files, agent YAML, workflow YAML, or system instructions
- Remove cases because they are difficult or flaky
- Convert a clearly wrong agent answer into
expected just because it was observed
- Silently accept null/empty metric results as "the answer"
- Fabricate expected values when evidence is insufficient — use a DRAFT note instead
- Use Python or Bash to parse results JSON — always use the Read and Grep tools directly
Strong preferences:
- Intersection of stable facts across attempts > any single run
- Approximate numbers > exact numbers (unless the exact value is reliably stable)
- A shorter, honest expected > a longer, invented one
- Explicit DRAFT markers > false confidence