一键在 Manus 中运行任何 Skill

$pwd:

waxa-eval

Name: Waxa Eval
Author: mizchi

// Use when iterating on a skill's prompt with the waxa CLI (https://github.com/mizchi/skills/tree/main/tools/waxa) — authoring scenarios, choosing graders, interpreting unclear-points, advancing a ledger, and judging convergence. Encodes the four-stage iteration pattern observed in real iter loops (structural fix → grader breadth → surface-form coverage → residual unclear) and the scenario-design pitfalls (blank-slate executor's network limit, prompt expectation explicitness, regex coverage of Japanese/English surface forms). Meta-skill: do NOT auto-invoke for routine skill edits; fires only when the user explicitly runs waxa or asks for a skill-quality eval.

在 Manus 中运行

$ git log --oneline --stat

stars:244

forks:4

updated:2026年5月29日 14:10

SKILL.md

readonly

related-skills.json

同仓库

ts2moonbit-migration.md

from "mizchi/skills"

Migrate a TypeScript codebase to MoonBit on the `js` target while keeping the same JavaScript API contract, using mizchi/js (JS/Web/Node bindings), mizchi/x (cross-target async backend: process/fs/http/ws), and mizchi/npm_typed (npm bindings). Use when porting a TS library, Node service, npm package, or Cloudflare Worker to MoonBit and the existing JS/TS consumers, `.d.ts`, and tests must keep working unchanged.

2026-05-29244

sql-lint.md

from "mizchi/skills"

Static lint for sqlc-style SQL catalogs. Catches duplicate query names, missing semicolons, SELECT *, double-wildcard LIKE, and other cheap-but-easy-to-miss mistakes. Optional sqlfluff integration for full style coverage.

2026-05-29244

sql-schema-audit.md

from "mizchi/skills"

Index coverage and N+1 review aids for SQLite/D1 schemas with a sqlc catalog. Surfaces unused indexes (with FK CASCADE awareness so cascade-load-bearing indexes are not flagged), queries that scan tables without index help, and `for`-loops calling generated SQL fns.

2026-05-29244

sql-security.md

from "mizchi/skills"

SQL injection screening for host code (MoonBit / TS / Rust) plus secretlint setup notes. Flags single-line template-literal or string-concat SQL builders, regardless of value source — the scanner is line-based and does NOT trace data flow, so a clean scan is not proof of safety (multi-line template literals are missed) and every hit needs a manual review or an explicit `// sql-security: ok` opt-out.

2026-05-29244

cloudflare-deploy.md

from "mizchi/skills"

Deploy applications and infrastructure to Cloudflare using Workers, Pages, and related platform services. Use when the user asks to deploy, host, publish, or set up a project on Cloudflare.

2026-05-29244

skill-finder.md

from "mizchi/skills"

Use ONLY when the user explicitly asks to discover or evaluate a skill from OUTSIDE the curated catalog — e.g., "find a Stripe skill", "is there a skill for X?", "evaluate this candidate before adopting". Meta-skill, complementary to skill-selector: do NOT auto-invoke on routine apm-management, project-init, or catalog-resident requests. Cross-source survey across vetted registries (Anthropic official, claude-skill-registry, VoltAgent/awesome-agent-skills, ComposioHQ, Superpowers, GitHub topic) with a mandatory waxa-eval adoption gate.

2026-05-29244

package.json

"author": "mizchi"

"repository": "mizchi/skills"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

软件开发工程师计算机与数学类职业15-1252L4

name

waxa-eval

description

Use when iterating on a skill's prompt with the waxa CLI (https://github.com/mizchi/skills/tree/main/tools/waxa) — authoring scenarios, choosing graders, interpreting unclear-points, advancing a ledger, and judging convergence. Encodes the four-stage iteration pattern observed in real iter loops (structural fix → grader breadth → surface-form coverage → residual unclear) and the scenario-design pitfalls (blank-slate executor's network limit, prompt expectation explicitness, regex coverage of Japanese/English surface forms). Meta-skill: do NOT auto-invoke for routine skill edits; fires only when the user explicitly runs waxa or asks for a skill-quality eval.

waxa-eval

Empirical evaluation loop for skill prompts, codified from real iter runs. This skill is the operating manual for waxa; the CLI itself lives in tools/waxa/ (see its README for argument-level reference).

waxa extends microsoft/waza with empirical-prompt-tuning semantics on top: forced Self-report, ledger across iterations, four grader types, and convergence detection.

When to invoke

Explicit user request only:

"evaluate this skill with waxa"
"iterate on <skill> until it converges"
"add a waxa scenario for <scenario>"
"interpret these unclear-points / Self-report"

When NOT to use:

One-off skill edits (typo fix, tightening one sentence) — direct edit, no eval cycle.
A skill is brand-new and has no scenarios yet — author the SKILL.md first; only run eval once you have a stable target to measure.
The user did not ask for evaluation — meta-skill, no auto-invoke.

The four-stage iteration pattern

Repeated across the skill-selector, nix-setup, and skill-finder evals. Treat it as the default trajectory; it converges in 3-4 iterations when the skill is structurally sound.

Stage	What you fix	Typical Iter	Symptom of being on this stage
1. Structural fix	A whole behavior is missing — the skill never tells the executor to do X.	1 → 2	LLM grader and surface graders both fail; pass rate jumps after one edit.
2. Grader breadth	Skill is correct; surface graders are too narrow (regex only matches one phrasing).	2 → 3	LLM grader passes; surface grader fails on the same axis. Widen the regex; do not "force" the LLM to use one phrasing.
3. Surface-form coverage	LLM uses Japanese / synonym / abbreviation that the regex didn't anticipate.	3 → 4	One trial passes, the other fails on the same grader. The miss is a surface form, not a content gap.
4. Residual unclear	What's left is structural to the eval setup (blank-slate executor cannot run real tools), not the skill body.	terminal	Self-report flags "Execution stuck — cannot run web fetch / external command". Mark in the ledger; do not chase.

A skill that needs more than 4 iterations probably has a deeper design issue. Re-read the skill body before adding more graders.

Authoring scenarios (`tasks/*.yaml`)

Two roles, both mandatory:

Typical (median) — the most common shape of the user request. Should pass at convergence; the gating cost is "did the skill engage at all?"
Edge — a known failure mode the skill is supposed to handle. Examples: an out-of-scope request the skill should refuse, a sibling skill's territory the skill should defer to.

Pitfall: blank-slate executor cannot run external tools

waxa invokes the executor with claude -p --no-session-persistence --disable-slash-commands and a system prompt that explicitly forbids external tools. The executor cannot run apm view, fetch a web URL, or list files. State this constraint in the scenario prompt when the workflow would otherwise require it:

inputs:
  prompt: |
    <task description>
    Real network fetches and `apm view` calls are not available;
    narrate the workflow you would run and the candidates you would
    expect, with rubric application and waxa eval / pinning notes.

Without this clarification, the executor will mark the unrunnable phase as "Execution: stuck" in Self-report. That is a false signal; the skill is not the problem.

Pitfall: prompts that don't bind expected output shape

If the prompt says "give me a list of skills" but the grader regex needs the literal token playwright-test, the executor may produce a synonym that satisfies the user but fails the regex. Either:

Make the prompt enumerate what the answer must contain (less natural), or
Pair every surface grader with a semantic LLM grader (preferred — see below).

Grader selection

Four types. Use them in pairs.

Type	Use for	Cost	Failure mode
`text` (regex)	Surface-literal token presence	~free	Too narrow; misses synonyms / Japanese / abbreviation. Default to broad alternation.
`code` (JS expr)	Structural assertions on the output (line count, JSON shape)	~free	Brittle to formatting drift; reserve for shape, not content.
`self-report`	Phase trace / unclear-points discipline	~free	LLM may always self-report "OK"; pair with another grader.
`llm` (LLM-as-Judge)	Semantic equivalence and rubric checks	costly	Non-deterministic; check pass rate across `trials_per_task ≥ 2`.

Standard pair pattern

For each behavioral axis, write two graders:

A surface text grader with deliberately broad alternation. Goal: deterministic, fast, catches obvious misses.
An llm grader with a multi-clause rubric. Goal: catches semantic equivalents and judges nuance.

When the LLM grader passes but the surface grader fails on the same axis, widen the surface regex (Stage 2). Do not make the prompt force a specific phrasing — that distorts the skill's natural output.

Regex coverage that bites in practice

Japanese vs English: スクレイプ vs scrape, 非互換 vs incompatible, 信頼 vs trust. Always include both.
Pinning vocabulary: pin to a tag, pinned via SHA, resolved with apm view, tag-pinned. Use a single broad alternation.
Synonyms for "stop": stop, do not adopt, refuse, reject, defer, redirect.

A useful template:

- name: pinning_vocab
  type: text
  config:
    regex_match:
      - "(?i)(pin(ned|ning)?\\s*(to|via|with)?\\s*(a\\s+)?(tag|sha|ref|version))|(apm\\s+view)|(adopt.*after.*eval)"

Self-report — what to read for

Each trial appends a structured Self-report block. Read these fields:

Phase trace — Understanding / Planning / Execution / Formatting. A "stuck" entry is a triage signal: was the prompt unclear, or is the skill missing instruction for that phase?
Unclear points — most useful diagnostic. Each has Issue / Cause / General Fix Rule. The Rule is what you commit to the ledger.
Discretionary fill-ins — places the executor improvised. Each one is a candidate for "the skill should specify this".
Retries — should be 0 in a healthy run. > 0 = the skill required clarification mid-task.

A clean run has all phases OK, unclear=0, fill-ins=(none), retries=0. The next reasonable goal after unclear=0 is unclear=0 again — convergence is two consecutive runs at this state.

Ledger pattern (`evals/<skill>/ledger.yaml`)

One entry per iteration. Schema (working fields only):

iterations:
  - id: iter-N
    date: YYYY-MM-DD
    scope: <full or per-task>
    overall_pass_rate: <number>
    per_task: [...]
    unclear_points:
      - issue: ...
        cause: ...
        general_fix_rule: ...        # this is the durable artifact
    actions:
      - <one-line summary of the change made>

convergence:
  status: <converging | near_convergence | converged>
  rationale: |
    <why you stopped — be specific about what residuals remain>

Extracting a General Fix Rule

When you see an unclear-point, ask: what class of mistake produced this? Not "this exact thing went wrong" — what shape of error would always produce this kind of failure? The class-level rule is what generalizes; the trace is just the example.

Issue:    "skill never told executor to stop when catalog has a fit"
Cause:    "When NOT to use" stated alternative but didn't enforce halt
General Fix Rule:
  Meta-skills that delegate to a sibling must add a mandatory
  pre-flight check that explicitly halts the workflow (not merely
  advises an alternative).

The rule is reusable across skills. Future evals that hit a similar shape should refer back to existing rules instead of re-discovering.

Convergence

empirical-prompt-tuning defines convergence as two consecutive runs with zero unclear-points. In practice:

A residual unclear-point that is structural to the eval setup (e.g., "cannot fetch URLs") does not block convergence. Note it in the ledger and exclude from the convergence count.
A residual that is non-deterministic (one trial flags it, the next does not) is a soft signal. If the skill is otherwise stable and the unclear is below the action threshold, declare convergence.
A residual that is deterministic and content-related must be addressed before declaring convergence — this is Stage 1 work, not closure.

Mark the ledger:

convergence:
  status: converged
  rationale: |
    Iter 1 (X%) → Iter 2 (Y%) → ... → Iter N (Z%).
    Residual <name> is <reason it does not block adoption>.

Test layout convention

From waxa 0.2.0, eval files live inside the skill directory, mirroring agentskills.io's evaluating-skills layout. This lets a single skill repo carry its own eval suite and ship as a self-contained unit:

<skill>/                                # the target skill (distribution unit)
├── SKILL.md
└── evals/
    ├── eval.yaml                       # config + top-level graders + task glob
    ├── ledger.yaml                     # iter history (created on first `iterate` run)
    └── tasks/
        ├── scenario-typical.yaml       # median — passes at convergence
        └── scenario-edge.yaml          # known failure mode — exercises the rule

Workspace (per-iteration runs) lands outside the skill at <workspace-root>/results/<skill>/iteration-N/<task-id>/<with_skill|without_skill>/{output-trial-*.txt, timing.json, grading.json} plus benchmark.json. <workspace-root> is the .waxa.yaml / .waza.yaml directory when present, otherwise the skill directory's parent. Add results/ to .gitignore.

waxa init scaffolds eval.yaml and the two task templates with TODO markers; ledger.yaml is generated when iteration starts.

The pre-0.2.0 monorepo layout (<repo-root>/evals/<skill>/eval.yaml + <repo-root>/<skill>/SKILL.md) is still auto-detected, so old evals keep working; new ones should use skill-local.

Running the loop

Bare minimum:

# Scaffold the eval skeleton (run inside the skill's own dir).
npx @mizchi/waxa init [--skill <name>] [--force]

# Single eval pass.
npx @mizchi/waxa <skill>/evals/eval.yaml

# Single task.
npx @mizchi/waxa <skill>/evals/eval.yaml --task <task-id>

# Single eval with baseline (with_skill vs without_skill, reports Delta).
npx @mizchi/waxa <skill>/evals/eval.yaml --baseline

# Iteration loop (auto re-runs while pass rate improves; writes ledger.yaml).
npx @mizchi/waxa iterate <skill>/evals/eval.yaml --max 4

# Audit a skill directory (apm audit hidden-Unicode scan + waxa native
# quality checks: frontmatter, body length, When-NOT-to-use, suspicious
# scripts, LICENSE).
npx @mizchi/waxa audit <skill>/ [--no-apm] [--json]

The npm package bundles references/empirical-prompt-tuning.md so the methodology is on disk wherever waxa is installed. After npx @mizchi/waxa first runs, the file lives at <node_modules>/@mizchi/waxa/references/empirical-prompt-tuning.md.

`--baseline` — is the skill earning its keep?

--baseline runs every task twice per trial (with_skill and without_skill), then prints a Delta line and writes both configs into iteration-N/<task-id>/. This is the agentskills.io-style "does the skill body actually improve over a blank-slate model?" check. Skills that add tokens / latency without moving pass rate are visible here in a way they aren't in single-config runs.

Per-iteration cost (claude-opus-4-8, 3 scenarios × 2 trials): ~3-5 minutes wall time. Run iterations sequentially; do not launch parallel waxa processes against the same eval (they fight for the API and the lockfile). Single-task runs (--task <id>) are useful for confirming a small change without re-running the whole suite.

trials_per_task: 2 is the floor — a single trial cannot distinguish "the skill is unstable" from "the LLM had a bad sample." Bump to 3 only if you suspect non-determinism on a critical axis.

Common pitfalls

Mistake	Fix
Re-running iterations without editing the skill	If the skill body did not change, the result is just a different LLM sample. Edit before re-running.
Using only LLM graders	They are non-deterministic and slow. Pair with surface graders.
Adding graders after every failure	Graders are not the safety net; the skill is. Add a grader only when an unmeasured behavior matters.
Treating `executor: mock` as real	The current `tools/waxa/src/cli.ts` ignores the field; all runs go through `claude -p`. Plan budget accordingly.
Iterating past 4 without re-reading the skill	If you're past iter 4 and still chasing failures, the skill body has a structural gap. Stop iterating; rewrite the section.
Not committing the ledger	The ledger is the durable artifact. Future evals build on past General Fix Rules. Lose it and you lose the institutional memory.

Scope vs `empirical-prompt-tuning`

This skill is the implementation guide; empirical-prompt-tuning is the methodology. They are complementary — neither subsumes the other, because the dispatch mechanisms differ (Task-tool subagent vs external CLI process).

Concern	waxa-eval	empirical-prompt-tuning
Methodology / first principles (bias-free executor, Self-report rationale)	(referenced)	owns
CLI operation (`waxa run` / `iterate` / `variant` / `compare`)	owns	n/a
Scenario YAML authoring	owns	n/a
Grader selection (text / code / self-report / llm)	owns	n/a
Iter pattern derived from real iter loops (4-stage signature)	owns	n/a
Ledger schema (`ledger.yaml`) and General Fix Rule extraction	owns	shared
Convergence definition (2 consecutive zero-unclear)	shared	owns the canonical version
Iter 0 description / body consistency check (static, no dispatch)	n/a	owns
`tool_uses` relative-value analysis (only possible via Task-tool subagent)	n/a	owns
`[critical]` tag in the requirements checklist	n/a	owns
Fix-propagation patterns (conservative / overshoot / zero-shoot)	n/a	owns
Pairwise-comparison caveats (counterbalance ordering)	(referenced)	owns
Structural review mode (text consistency only, not execution)	n/a	owns
Environment constraints (Task tool unavailable in current session)	n/a	owns
Red-flag table (self-reread rationalization, etc.)	n/a	owns

Use empirical-prompt-tuning when: evaluating inside a Claude Code session via Task-tool subagent dispatch; needing tool_uses measurement of the executor; running the Iter 0 static consistency check; writing the [critical]-tagged requirements checklist.

Use waxa-eval (this skill) when: running the eval as an external CLI process; persisting iteration history as YAML for repeatability or CI; encoding scenarios that re-run after every skill edit; gating an external skill candidate before adoption (cf. skill-finder).

A real flow often uses both: empirical for the in-session Iter 0 + first dispatch to confirm direction, then waxa-eval for the iter loop and durable ledger.

empirical-prompt-tuning — methodology this skill operationalizes; see scope table above for responsibility split
superpowers:writing-skills — TDD framing for skills; pairs with this skill (write skill → eval → iterate)
skill-finder — uses waxa-eval as the adoption gate for cross-source candidates
tools/waxa/README.md — CLI argument reference
tools/waxa/RFC-waza.md — upstream waza compatibility notes

waxa-eval

waxa-eval

When to invoke

The four-stage iteration pattern

Authoring scenarios (`tasks/*.yaml`)

Pitfall: blank-slate executor cannot run external tools

Pitfall: prompts that don't bind expected output shape

Grader selection

Standard pair pattern

Regex coverage that bites in practice

Self-report — what to read for

Ledger pattern (`evals/<skill>/ledger.yaml`)

Extracting a General Fix Rule

Convergence

Test layout convention

Running the loop

`--baseline` — is the skill earning its keep?

Common pitfalls

Scope vs `empirical-prompt-tuning`

Related

waxa-eval

When to invoke

The four-stage iteration pattern

Authoring scenarios (`tasks/*.yaml`)

Pitfall: blank-slate executor cannot run external tools

Pitfall: prompts that don't bind expected output shape

Grader selection

Standard pair pattern

Regex coverage that bites in practice

Self-report — what to read for

Ledger pattern (`evals/<skill>/ledger.yaml`)

Extracting a General Fix Rule

Convergence

Test layout convention

Running the loop

`--baseline` — is the skill earning its keep?

Common pitfalls

Scope vs `empirical-prompt-tuning`

Related

waxa-eval

同仓库更多 Skills

同仓库更多 Skills

waxa-eval

When to invoke

The four-stage iteration pattern

Authoring scenarios (tasks/*.yaml)

Pitfall: blank-slate executor cannot run external tools

Pitfall: prompts that don't bind expected output shape

Grader selection

Standard pair pattern

Regex coverage that bites in practice

Self-report — what to read for

Ledger pattern (evals/<skill>/ledger.yaml)

Extracting a General Fix Rule

Convergence

Test layout convention

Running the loop

--baseline — is the skill earning its keep?

Common pitfalls

Scope vs empirical-prompt-tuning

Related

waxa-eval

When to invoke

The four-stage iteration pattern

Authoring scenarios (tasks/*.yaml)

Pitfall: blank-slate executor cannot run external tools

Pitfall: prompts that don't bind expected output shape

Grader selection

Standard pair pattern

Regex coverage that bites in practice

Self-report — what to read for

Ledger pattern (evals/<skill>/ledger.yaml)

Extracting a General Fix Rule

Convergence

Test layout convention

Running the loop

--baseline — is the skill earning its keep?

Common pitfalls

Scope vs empirical-prompt-tuning

Related

Authoring scenarios (`tasks/*.yaml`)

Ledger pattern (`evals/<skill>/ledger.yaml`)

`--baseline` — is the skill earning its keep?

Scope vs `empirical-prompt-tuning`

Authoring scenarios (`tasks/*.yaml`)

Ledger pattern (`evals/<skill>/ledger.yaml`)

`--baseline` — is the skill earning its keep?

Scope vs `empirical-prompt-tuning`