| name | waxa-eval |
| description | Use when iterating on a skill's prompt with the waxa CLI (https://github.com/mizchi/skills/tree/main/tools/waxa) — authoring scenarios, choosing graders, interpreting unclear-points, advancing a ledger, and judging convergence. Encodes the four-stage iteration pattern observed in real iter loops (structural fix → grader breadth → surface-form coverage → residual unclear) and the scenario-design pitfalls (blank-slate executor's network limit, prompt expectation explicitness, regex coverage of Japanese/English surface forms). Meta-skill: do NOT auto-invoke for routine skill edits; fires only when the user explicitly runs waxa or asks for a skill-quality eval. |
waxa-eval
Empirical evaluation loop for skill prompts, codified from real iter runs.
This skill is the operating manual for waxa; the CLI itself lives in
tools/waxa/ (see its README for argument-level reference).
waxa extends microsoft/waza with
empirical-prompt-tuning semantics on top: forced Self-report, ledger
across iterations, four grader types, and convergence detection.
When to invoke
Explicit user request only:
- "evaluate this skill with waxa"
- "iterate on
<skill> until it converges"
- "add a waxa scenario for
<scenario>"
- "interpret these unclear-points / Self-report"
When NOT to use:
- One-off skill edits (typo fix, tightening one sentence) — direct edit, no eval cycle.
- A skill is brand-new and has no scenarios yet — author the SKILL.md first; only run eval once you have a stable target to measure.
- The user did not ask for evaluation — meta-skill, no auto-invoke.
The four-stage iteration pattern
Repeated across the skill-selector, nix-setup, and skill-finder evals. Treat it as the default trajectory; it converges in 3-4 iterations when the skill is structurally sound.
| Stage | What you fix | Typical Iter | Symptom of being on this stage |
|---|
| 1. Structural fix | A whole behavior is missing — the skill never tells the executor to do X. | 1 → 2 | LLM grader and surface graders both fail; pass rate jumps after one edit. |
| 2. Grader breadth | Skill is correct; surface graders are too narrow (regex only matches one phrasing). | 2 → 3 | LLM grader passes; surface grader fails on the same axis. Widen the regex; do not "force" the LLM to use one phrasing. |
| 3. Surface-form coverage | LLM uses Japanese / synonym / abbreviation that the regex didn't anticipate. | 3 → 4 | One trial passes, the other fails on the same grader. The miss is a surface form, not a content gap. |
| 4. Residual unclear | What's left is structural to the eval setup (blank-slate executor cannot run real tools), not the skill body. | terminal | Self-report flags "Execution stuck — cannot run web fetch / external command". Mark in the ledger; do not chase. |
A skill that needs more than 4 iterations probably has a deeper design issue. Re-read the skill body before adding more graders.
Authoring scenarios (tasks/*.yaml)
Two roles, both mandatory:
- Typical (median) — the most common shape of the user request. Should pass at convergence; the gating cost is "did the skill engage at all?"
- Edge — a known failure mode the skill is supposed to handle. Examples: an out-of-scope request the skill should refuse, a sibling skill's territory the skill should defer to.
Pitfall: blank-slate executor cannot run external tools
waxa invokes the executor with claude -p --no-session-persistence --disable-slash-commands and a system prompt that explicitly forbids external tools. The executor cannot run apm view, fetch a web URL, or list files. State this constraint in the scenario prompt when the workflow would otherwise require it:
inputs:
prompt: |
<task description>
Real network fetches and `apm view` calls are not available;
narrate the workflow you would run and the candidates you would
expect, with rubric application and waxa eval / pinning notes.
Without this clarification, the executor will mark the unrunnable phase as "Execution: stuck" in Self-report. That is a false signal; the skill is not the problem.
Pitfall: prompts that don't bind expected output shape
If the prompt says "give me a list of skills" but the grader regex needs the literal token playwright-test, the executor may produce a synonym that satisfies the user but fails the regex. Either:
- Make the prompt enumerate what the answer must contain (less natural), or
- Pair every surface grader with a semantic LLM grader (preferred — see below).
Grader selection
Four types. Use them in pairs.
| Type | Use for | Cost | Failure mode |
|---|
text (regex) | Surface-literal token presence | ~free | Too narrow; misses synonyms / Japanese / abbreviation. Default to broad alternation. |
code (JS expr) | Structural assertions on the output (line count, JSON shape) | ~free | Brittle to formatting drift; reserve for shape, not content. |
self-report | Phase trace / unclear-points discipline | ~free | LLM may always self-report "OK"; pair with another grader. |
llm (LLM-as-Judge) | Semantic equivalence and rubric checks | costly | Non-deterministic; check pass rate across trials_per_task ≥ 2. |
Standard pair pattern
For each behavioral axis, write two graders:
- A surface
text grader with deliberately broad alternation. Goal: deterministic, fast, catches obvious misses.
- An
llm grader with a multi-clause rubric. Goal: catches semantic equivalents and judges nuance.
When the LLM grader passes but the surface grader fails on the same axis, widen the surface regex (Stage 2). Do not make the prompt force a specific phrasing — that distorts the skill's natural output.
Regex coverage that bites in practice
- Japanese vs English:
スクレイプ vs scrape, 非互換 vs incompatible, 信頼 vs trust. Always include both.
- Pinning vocabulary:
pin to a tag, pinned via SHA, resolved with apm view, tag-pinned. Use a single broad alternation.
- Synonyms for "stop":
stop, do not adopt, refuse, reject, defer, redirect.
A useful template:
- name: pinning_vocab
type: text
config:
regex_match:
- "(?i)(pin(ned|ning)?\\s*(to|via|with)?\\s*(a\\s+)?(tag|sha|ref|version))|(apm\\s+view)|(adopt.*after.*eval)"
Self-report — what to read for
Each trial appends a structured Self-report block. Read these fields:
- Phase trace —
Understanding / Planning / Execution / Formatting. A "stuck" entry is a triage signal: was the prompt unclear, or is the skill missing instruction for that phase?
- Unclear points — most useful diagnostic. Each has
Issue / Cause / General Fix Rule. The Rule is what you commit to the ledger.
- Discretionary fill-ins — places the executor improvised. Each one is a candidate for "the skill should specify this".
- Retries — should be 0 in a healthy run. > 0 = the skill required clarification mid-task.
A clean run has all phases OK, unclear=0, fill-ins=(none), retries=0. The next reasonable goal after unclear=0 is unclear=0 again — convergence is two consecutive runs at this state.
Ledger pattern (evals/<skill>/ledger.yaml)
One entry per iteration. Schema (working fields only):
iterations:
- id: iter-N
date: YYYY-MM-DD
scope: <full or per-task>
overall_pass_rate: <number>
per_task: [...]
unclear_points:
- issue: ...
cause: ...
general_fix_rule: ...
actions:
- <one-line summary of the change made>
convergence:
status: <converging | near_convergence | converged>
rationale: |
<why you stopped — be specific about what residuals remain>
Extracting a General Fix Rule
When you see an unclear-point, ask: what class of mistake produced this? Not "this exact thing went wrong" — what shape of error would always produce this kind of failure? The class-level rule is what generalizes; the trace is just the example.
Issue: "skill never told executor to stop when catalog has a fit"
Cause: "When NOT to use" stated alternative but didn't enforce halt
General Fix Rule:
Meta-skills that delegate to a sibling must add a mandatory
pre-flight check that explicitly halts the workflow (not merely
advises an alternative).
The rule is reusable across skills. Future evals that hit a similar shape should refer back to existing rules instead of re-discovering.
Convergence
empirical-prompt-tuning defines convergence as two consecutive runs with zero unclear-points. In practice:
- A residual unclear-point that is structural to the eval setup (e.g., "cannot fetch URLs") does not block convergence. Note it in the ledger and exclude from the convergence count.
- A residual that is non-deterministic (one trial flags it, the next does not) is a soft signal. If the skill is otherwise stable and the unclear is below the action threshold, declare convergence.
- A residual that is deterministic and content-related must be addressed before declaring convergence — this is Stage 1 work, not closure.
Mark the ledger:
convergence:
status: converged
rationale: |
Iter 1 (X%) → Iter 2 (Y%) → ... → Iter N (Z%).
Residual <name> is <reason it does not block adoption>.
Test layout convention
From waxa 0.2.0, eval files live inside the skill directory, mirroring agentskills.io's evaluating-skills layout. This lets a single skill repo carry its own eval suite and ship as a self-contained unit:
<skill>/ # the target skill (distribution unit)
├── SKILL.md
└── evals/
├── eval.yaml # config + top-level graders + task glob
├── ledger.yaml # iter history (created on first `iterate` run)
└── tasks/
├── scenario-typical.yaml # median — passes at convergence
└── scenario-edge.yaml # known failure mode — exercises the rule
Workspace (per-iteration runs) lands outside the skill at <workspace-root>/results/<skill>/iteration-N/<task-id>/<with_skill|without_skill>/{output-trial-*.txt, timing.json, grading.json} plus benchmark.json. <workspace-root> is the .waxa.yaml / .waza.yaml directory when present, otherwise the skill directory's parent. Add results/ to .gitignore.
waxa init scaffolds eval.yaml and the two task templates with TODO markers; ledger.yaml is generated when iteration starts.
The pre-0.2.0 monorepo layout (<repo-root>/evals/<skill>/eval.yaml + <repo-root>/<skill>/SKILL.md) is still auto-detected, so old evals keep working; new ones should use skill-local.
Running the loop
Bare minimum:
npx @mizchi/waxa init [--skill <name>] [--force]
npx @mizchi/waxa <skill>/evals/eval.yaml
npx @mizchi/waxa <skill>/evals/eval.yaml --task <task-id>
npx @mizchi/waxa <skill>/evals/eval.yaml --baseline
npx @mizchi/waxa iterate <skill>/evals/eval.yaml --max 4
npx @mizchi/waxa audit <skill>/ [--no-apm] [--json]
The npm package bundles references/empirical-prompt-tuning.md so the methodology is on disk wherever waxa is installed. After npx @mizchi/waxa first runs, the file lives at <node_modules>/@mizchi/waxa/references/empirical-prompt-tuning.md.
--baseline — is the skill earning its keep?
--baseline runs every task twice per trial (with_skill and without_skill), then prints a Delta line and writes both configs into iteration-N/<task-id>/. This is the agentskills.io-style "does the skill body actually improve over a blank-slate model?" check. Skills that add tokens / latency without moving pass rate are visible here in a way they aren't in single-config runs.
Per-iteration cost (claude-opus-4-8, 3 scenarios × 2 trials): ~3-5 minutes wall time. Run iterations sequentially; do not launch parallel waxa processes against the same eval (they fight for the API and the lockfile). Single-task runs (--task <id>) are useful for confirming a small change without re-running the whole suite.
trials_per_task: 2 is the floor — a single trial cannot distinguish "the skill is unstable" from "the LLM had a bad sample." Bump to 3 only if you suspect non-determinism on a critical axis.
Common pitfalls
| Mistake | Fix |
|---|
| Re-running iterations without editing the skill | If the skill body did not change, the result is just a different LLM sample. Edit before re-running. |
| Using only LLM graders | They are non-deterministic and slow. Pair with surface graders. |
| Adding graders after every failure | Graders are not the safety net; the skill is. Add a grader only when an unmeasured behavior matters. |
Treating executor: mock as real | The current tools/waxa/src/cli.ts ignores the field; all runs go through claude -p. Plan budget accordingly. |
| Iterating past 4 without re-reading the skill | If you're past iter 4 and still chasing failures, the skill body has a structural gap. Stop iterating; rewrite the section. |
| Not committing the ledger | The ledger is the durable artifact. Future evals build on past General Fix Rules. Lose it and you lose the institutional memory. |
Scope vs empirical-prompt-tuning
This skill is the implementation guide; empirical-prompt-tuning is the methodology. They are complementary — neither subsumes the other, because the dispatch mechanisms differ (Task-tool subagent vs external CLI process).
| Concern | waxa-eval | empirical-prompt-tuning |
|---|
| Methodology / first principles (bias-free executor, Self-report rationale) | (referenced) | owns |
CLI operation (waxa run / iterate / variant / compare) | owns | n/a |
| Scenario YAML authoring | owns | n/a |
| Grader selection (text / code / self-report / llm) | owns | n/a |
| Iter pattern derived from real iter loops (4-stage signature) | owns | n/a |
Ledger schema (ledger.yaml) and General Fix Rule extraction | owns | shared |
| Convergence definition (2 consecutive zero-unclear) | shared | owns the canonical version |
| Iter 0 description / body consistency check (static, no dispatch) | n/a | owns |
tool_uses relative-value analysis (only possible via Task-tool subagent) | n/a | owns |
[critical] tag in the requirements checklist | n/a | owns |
| Fix-propagation patterns (conservative / overshoot / zero-shoot) | n/a | owns |
| Pairwise-comparison caveats (counterbalance ordering) | (referenced) | owns |
| Structural review mode (text consistency only, not execution) | n/a | owns |
| Environment constraints (Task tool unavailable in current session) | n/a | owns |
| Red-flag table (self-reread rationalization, etc.) | n/a | owns |
Use empirical-prompt-tuning when: evaluating inside a Claude Code session via Task-tool subagent dispatch; needing tool_uses measurement of the executor; running the Iter 0 static consistency check; writing the [critical]-tagged requirements checklist.
Use waxa-eval (this skill) when: running the eval as an external CLI process; persisting iteration history as YAML for repeatability or CI; encoding scenarios that re-run after every skill edit; gating an external skill candidate before adoption (cf. skill-finder).
A real flow often uses both: empirical for the in-session Iter 0 + first dispatch to confirm direction, then waxa-eval for the iter loop and durable ledger.
Related
empirical-prompt-tuning — methodology this skill operationalizes; see scope table above for responsibility split
superpowers:writing-skills — TDD framing for skills; pairs with this skill (write skill → eval → iterate)
skill-finder — uses waxa-eval as the adoption gate for cross-source candidates
tools/waxa/README.md — CLI argument reference
tools/waxa/RFC-waza.md — upstream waza compatibility notes