Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

espdl-quantize

Name: Espdl Quantize
Author: espressif

// Iteratively tune esp-ppq QuantizationSetting to recover post-quantization accuracy on ESP-DL targets. Drives a closed loop of "baseline -> calibration × TQT(default) cartesian product -> distribution-aware residual fixes -> agent-driven open exploration -> re-evaluate" in the current Python environment, using a minimal user contract (calib dataloader + evaluate function). Generic across architectures (ResNet / EfficientNet / ViT / DETR / YOLO / LSTM and any esp-ppq-supported graph) — the search procedure is distribution-driven and does not depend on a specific network family. Method ordering is accuracy-first with a soft penalty for passes that slow down on-device inference; once the prescribed Phase-1/2/3 sequence exhausts, the skill hands control to the agent (Phase 5) with a structured history of improving levers + the per-iteration error artifacts to read, so the agent can compose multi-knob iterations (lever stacking, calibration cross-pollination, ablation, cost-trim) without a rigid template. LSQ on PO

In Manus ausführen

$ git log --oneline --stat

stars:1.029

forks:205

updated:21. Mai 2026 um 13:25

Datei-Explorer

24 Dateien

SKILL.md

readonly

related-skills.json

gleiches Repository

espdl-operator.md

from "espressif/esp-dl"

End-to-end guide for implementing, testing, and optimizing neural network operators in the ESP-DL framework. Covers C++ module implementation, C reference kernels, SIMD assembly optimization, esp-ppq quantization strategy integration, Docker-based build/test, and inference result alignment between esp-dl and esp-ppq. Use this skill whenever the user wants to add a new operator, implement an operator, optimize an existing operator with SIMD, add quantization support for an operator, or test/validate operator correctness. Also triggers for "算子实现", "添加算子", "SIMD优化", "量化支持", "算子对齐" and similar phrases.

2026-05-181.0k

package.json

"author": "espressif"

"repository": "espressif/esp-dl"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

SoftwareentwicklerInformatik- und Mathematikberufe15-1252L4

Jeden Skill mit einem Klick ausführen

name

espdl-quantize

description

Iteratively tune esp-ppq QuantizationSetting to recover post-quantization accuracy on ESP-DL targets. Drives a closed loop of "baseline -> calibration × TQT(default) cartesian product -> distribution-aware residual fixes -> agent-driven open exploration -> re-evaluate" in the current Python environment, using a minimal user contract (calib dataloader + evaluate function). Generic across architectures (ResNet / EfficientNet / ViT / DETR / YOLO / LSTM and any esp-ppq-supported graph) — the search procedure is distribution-driven and does not depend on a specific network family. Method ordering is accuracy-first with a soft penalty for passes that slow down on-device inference; once the prescribed Phase-1/2/3 sequence exhausts, the skill hands control to the agent (Phase 5) with a structured history of improving levers + the per-iteration error artifacts to read, so the agent can compose multi-knob iterations (lever stacking, calibration cross-pollination, ablation, cost-trim) without a rigid template. LSQ on POWER_OF_2 targets is auto-disabled (silent degenerate; use TQT instead) and esp32p4 layer-wise equalization is warn-only (esp-ppq officially "Not recommend" for per-channel weights but empirically can still help on some models). Use this skill whenever the user wants to improve a quantized esp-dl/.espdl model's accuracy, debug high quantization error, choose between calibration algorithms, decide on equalization / bias correction / weight split / mixed precision / TQT / blockwise reconstruction settings, or run an automated search over esp-ppq quantization options. Also triggers for "esp-ppq 量化调参", "降低量化误差", "Layerwise quantization error 分析", "QuantizationSettingFactory.espdl_setting 怎么调", "混合精度量化", "量化精度恢复", "espdl 量化优化".

ESP-DL Quantization Tuning Skill

This skill turns the human "stare at error report, guess setting, rerun" loop into a structured, distribution-aware search. The user owns data loading and evaluation; the skill owns QuantizationSettingFactory.espdl_setting() and the iteration loop.

About <SKILL_DIR> in shell snippets below. This skill is agent-directory agnostic — it may be installed as .cursor/skills/espdl-quantize/, .opencode/skills/espdl-quantize/, or under any other agent's skills folder. Whenever you see <SKILL_DIR> in a shell command, substitute the absolute path of the directory containing this SKILL.md (the agent runtime gives you that path when it loads the skill). Setting it once at the start of a session makes the rest copy-pasteable:
SKILL_DIR=/abs/path/to/espdl-quantize   # the directory holding this SKILL.md
All in-skill markdown links (e.g. [scripts/run_iteration.py](scripts/run_iteration.py)) are already relative to <SKILL_DIR> and need no substitution.

Generality boundary

This skill is a general framework for any esp-ppq-supported graph: ResNet, EfficientNet, ViT, DETR, YOLO family, LSTM, custom CV/NLP backbones — anything espdl_quantize_torch / espdl_quantize_onnx can ingest. The Phase 2 calib×TQT cartesian product, the Phase 3 lever ordering, and the four Composition discipline rules are all distribution-driven; none of them depends on a specific network structure or family. The Worked example at the end uses MobileNet-V2 on ESP32-P4 as an empirical demonstration — its concrete numbers are illustrative, not a model-selection threshold.

Why this skill exists

esp-ppq exposes a dozen tunable passes (calibration algorithm, layerwise equalization, bias correction, weight split, mixed precision via dispatching table, TQT, LSQ, blockwise reconstruction, ...). Each has 2-6 parameters. Trying them by hand is slow and error-prone.

What this skill brings to the table:

Knowledge — every esp-ppq method's principle, parameters, applicable scenarios, and anti-patterns are codified in references/ppq_methods.md.
A decision rulebook — given the top-K worst layers' input/weight/output distributions, references/decision_playbook.md maps observed patterns to candidate methods.
A fixed harness — scripts/run_iteration.py takes the user's contract module plus a JSON setting and emits structured artifacts (metrics, layerwise error, per-layer stats) so the agent only has to read JSON to make the next decision.
A search state machine — scripts/compare_iterations.py inspects what's already been tried and tells the agent which iteration to run next via comparison.json["next_step_hint"]. The hint embeds a complete setting.json template so the agent only has to fill in the rationale.
Target-aware safety net — the harness detects passes that conflict with the target's quantization policy:
- LSQ on POWER_OF_2 targets (esp32p4 / esp32s3 / c) — auto-disabled. esp-ppq's LSQDelegator silently disables continuous-scale training under POWER_OF_2, so the pass would degenerate to weight-only tuning while paying full TQT-level PC time. Use TQT instead — it trains log2_scale and is POWER_OF_2-native.
- Layer-wise equalization on esp32p4 — warn-only (changed in this revision). esp-ppq officially marks the combination as "Not recommend" (see esp-ppq/md_doc/Passes/LayerwiseEqualization.md, "Usage" section), but empirical runs show some MobileNet-family / depthwise-separable networks still benefit. The harness now lets the pass run when equalization.enabled=true and emits a strong warning; the agent should treat it as a Phase 3 lever to try only after the calib×TQT cartesian product has settled.

What the user has to provide

A single Python module (typically named user_quant.py) that exports:

QUANT_CONFIG dict — model path, input shape, target chip, bits, primary_metric, etc.
create_calib_dataloader() — returns the calibration DataLoader.
evaluate(quant_graph) — returns a dict whose keys include QUANT_CONFIG["primary_metric"].
For torch flow only: get_torch_model() — returns the nn.Module.
Optional: collate_fn(batch) and evaluate_fast(quant_graph).

The full contract spec is in references/contract.md. Two ready-to-copy examples live in assets/user_quant_torch_example.py and assets/user_quant_onnx_example.py.

The skill never edits the contract module. All iteration state lives under outputs/.

High-level flow

flowchart TD
    contract[user_quant.py] --> harness[run_iteration.py]
    setjson[outputs/iter_N/setting.json<br/>written by agent] --> harness
    harness --> ppqapi["esp_ppq.api.espdl_quantize_torch / _onnx"]
    ppqapi --> graph["esp-ppq BaseGraph<br/>(esp_ppq.IR.BaseGraph)"]
    graph --> lwerr["layerwise_error_analyse"]
    graph --> stat["statistical_analyse"]
    graph --> evalfn[user.evaluate]
    lwerr --> art[outputs/iter_N/]
    stat --> art
    evalfn --> art
    art --> compare[compare_iterations.py]
    compare --> hint["comparison.json<br/>next_step_hint"]
    hint --> phase{phase?}
    phase -- "1 / 2 / 3" --> agentTpl["agent fills rationale on embedded template"]
    phase -- "5 (open exploration)" --> agentFree["agent reads phase5_signals + artifacts,<br/>writes setting.json from scratch"]
    phase -- "4 (final)" --> finalize["outputs/best/ + outputs/final_report.md"]
    agentTpl -.writes next setting.-> setjson
    agentFree -.writes next setting.-> setjson

In Phases 1-3 the agent's job each round shrinks to: read comparison.json["next_step_hint"], copy the embedded setting.json template, fill in the rationale field, run the harness. In Phase 5 the script stops prescribing settings — the agent reads phase5_signals plus the per-iteration error artifacts and writes the next setting.json from scratch (see "Phase 5 — Agent-driven exploration" below).

Phases

Phase 0 — Validate contract and environment

Important — ignore Docker / image / /work mentions you may see elsewhere. Some user projects (and user_quant.py itself) still carry comments left over from an older Docker-based workflow — phrases like "build the image", "Phase 0 — docker 准备", /work inside Docker, or docker run --gpus all. Those are legacy text only, not steps to execute. The skill now runs entirely in the current Python interpreter.

Before any quantization, do these once per session:

Make sure the current Python environment has esp_ppq (with the [cpu] extra), torch, plus the small set of helpers the harness needs:
```
pip install -e <path/to/esp-ppq>[cpu]
pip install -r "$SKILL_DIR/assets/extra_requirements.txt"
```
The skill is environment-agnostic — it does not require Docker. As long as python -c "import esp_ppq, torch, onnx, onnxsim, pandas, scipy, tqdm" succeeds, you are ready to go.

Validate the user's contract module imports cleanly and exposes the required functions/keys:

python "$SKILL_DIR/scripts/run_iteration.py" \
  --user-quant <path/to/user_quant.py> \
  --output-dir <path/to/user_project>/outputs/contract_check \
  --check-contract

Make sure the iteration workdir exists (default: <contract_dir>/outputs/). The harness creates it on first run.

The working directory for python should be the directory containing user_quant.py (or any directory — the harness resolves relative paths in QUANT_CONFIG against the contract module's directory).

Phase 1 — Baseline (iter_0)

Run the default QuantizationSettingFactory.espdl_setting() once. The agent does NOT propose any settings here — the harness uses a built-in baseline JSON when --baseline is passed.

python "$SKILL_DIR/scripts/run_iteration.py" \
  --user-quant <path/to/user_quant.py> \
  --output-dir <path/to/user_project>/outputs/iter_0 \
  --baseline

After it finishes, read these files:

outputs/iter_0/metrics.json — what evaluate() returned, plus _primary shortcut.
outputs/iter_0/layerwise_error.json — {op_name: snr} sorted descending by error. Covers only is_computing_op (Conv / Gemm / ConvTranspose / MatMul / Attention / PPQBiasFusedMatMul / LSTM); the SNR is the isolated contribution of that op when it alone is quantized.
outputs/iter_0/layer_stats.json — statistical_analyse filtered by the layerwise top-K (legacy artifact; same coverage as layerwise).
outputs/iter_0/layer_stats_full.json — (new) the full statistical_analyse output: every non-passive op's per-input/per-output distribution + cumulative SNR. This is the only artifact that includes Add / Concat / Resize / AveragePool / Sigmoid / Softmax / GRU / LayerNorm.
outputs/iter_0/non_computing_hot_ops.json — (new) the top-K non-COMPUTING_OP layers ranked by max per-variable SNR, plus inputs_float_std_ratio (max/min Float Std of input variables, used by playbook rule R8).
outputs/iter_0/graphwise_jumps.json — (new) adjacent computing-op pairs whose cumulative SNR gap is not explained by the downstream op's isolated contribution. Lists the intervening non-computing ops as suspected culprits.
outputs/iter_0/console.log — full stdout/stderr.

Tell the user the baseline numbers, the top-5 error layer names from layerwise, and (if non-empty) the top-3 entries from non_computing_hot_ops.json. The state machine in compare_iterations.py decides when to finalize — do not stop here on your own even if iter_0 looks like it hit target_metric; run the comparison once and let next_step_hint["phase"] == "phase-4-final-report" confirm.

Phase 2 — Calibration × TQT(default) cartesian product (mandatory)

This phase must run three iterations in strict sequence, each enabling exactly two fields: calib_algorithm and tqt_optimization (with the esp-ppq default schedule). No other pass is enabled. The cartesian product is what makes the search robust — calibration in esp-dl quantization is not separable from the training pass: a calibration that regresses standalone may become the strongest base when paired with TQT, and vice versa. See the Worked example below for the empirical case that motivated this design.

The TQT default schedule is strict:

{
  "lr": 1e-5,
  "steps": 500,
  "block_size": 4,
  "is_scale_trainable": true,
  "gamma": 0.0,
  "int_lambda": 0.0,
  "collecting_device": "cuda"
}

Iteration sequence:

Iter	calib_algorithm	other passes	Purpose
`iter_1`	`kl`	TQT(default)	Pair iter_0(kl-only) and iter_1(kl+TQT) to read off the TQT-on-kl delta. If iter_1 hits target, stop.
`iter_2`	`mse`	TQT(default)	Same with mse. If hits target, stop.
`iter_3`	`percentile`	TQT(default)	Same with percentile. Often the hidden winner on heavy-tailed activations even when standalone percentile would regress.

The way to drive this is to run scripts/compare_iterations.py between iterations:

python "$SKILL_DIR/scripts/compare_iterations.py" \
  --output-dir <path/to/user_project>/outputs

comparison.json["next_step_hint"] will be phase-2-calib-tqt-sweep until all three calibrations are covered with TQT(default), and the embedded setting.json template can be copied verbatim into outputs/iter_<N>/setting.json (only fill in the rationale).

Critical: iterations are strictly sequential — never run two in parallel. Single GPU, calibration-data download race, and any of the three legs can short-circuit the rest if it hits target_metric. If you spawn parallel subagents the search breaks.

Phase 3 — Residual fixes from best-so-far

After Phase 2 the best-so-far iteration is the one with the highest primary_value among iter_0..3. comparison.json["next_step_hint"] switches to phase-3-residual (or phase-3-pivot if the last two iterations both regressed vs best).

Each Phase 3 iteration must mutate from comparison.json["best_iteration"]'s setting.json and change exactly one thing. The lever order below is the linear default for deploy_runtime_priority="balanced"; lever 3a-3 is conditional (entered only when one of two specific signals fires); the speed-priority reorder is described under Accuracy-first method ordering below.

Lever	Tier	On-device cost	What changes	When to use
3a-1	A	0	TQT `steps: 500 → 1000` (lr=1e-5, block_size=4 unchanged)	Phase-2 winner is TQT-based and gap to target is non-trivial. One knob only — Composition discipline #2.
3a-2	A	0	TQT `lr: 1e-5 → 5e-5, steps: 1000 → 2000` (block_size=4 unchanged)	3a-1 already gave a positive net effect. Do NOT push beyond this on the lr/steps axis (lr=1e-4 / steps=4000 stably regress on representative reproducers).
3a-3	A	0	TQT `block_size: 4 → 2` (lr/steps from best-so-far unchanged)	CONDITIONAL — enter only on one of these two signals: (1) unstable fallback — last iter was 3a-1 or 3a-2, regressed by < 0.5% relative AND introduced a new layer into the top-5 error list (TQT joint training perturbed a previously-quiet layer); or (2) gap-shrink after convergence — 3a-1/3a-2 both improved on best AND none of R5/R8/R3 structural triggers match in best's `layer_stats.json` / `non_computing_hot_ops.json`. Smaller block_size = closer to layerwise = more stable. Do not try block_size=1 (full layerwise, no upside) or block_size≥6 (overlaps lever 3g, unstable).
3b	A	0	`bias_correct.enabled=true`	A top-error op's output row shows `
3c	A	0	`fusion_alignment.align_elementwise_to = 'Align to Large'` (and friends)	R8 trigger fires on best's `non_computing_hot_ops.json`: a Concat/Add/Sub/Mul/Resize/AveragePool entry whose `max_snr ∈ (0.20, 0.30]` (primary Goldilocks band), OR `inputs_float_std_ratio > 5` (legacy reinforcement, fires outside the band too). Skipped above the band (`max_snr > 0.30` — residual too severe; use 3a-3/3d/3e), below the band (too little to fix), or via top-3 activation veto (`Relu/Swish/Sigmoid` `max_snr > 1.2× candidate` — activation-dominated, fix via TQT/int16). Constants live in `compare_iterations.py`; see references/decision_playbook.md §R8 for the empirical calibration.
3d	A	0	enable `equalization` (full lever-3d template; do not abbreviate to `enabled=true` only — esp-ppq defaults `opt_level=1` while the template recommends `opt_level=2`, see Common pitfalls)	Conv→activation→Conv chain with weight per-channel `max/mean > 5`. Per-tensor weight targets (`esp32s3 / c`) are the canonical use case; on `esp32p4` the pass is warn-only — esp-ppq officially "Not recommend" for per-channel weights but it can empirically help on some MobileNet-family / depthwise-separable nets.
3e	B	+	`dispatching_table` int16 on top 1-3 worst layers	One layer's SNR > 2× median of the top-5; structural fixes failed. ≤10% of total ops. Permanent on-device runtime cost (~2× cycles + ~2× activation memory on promoted ops).
3f	B	+	`weight_split` on a single Conv with weight kurtosis > 10	Equalization didn't fix it (or wasn't applicable on esp32p4). ≤3 split layers. Permanent on-device runtime cost (one extra Add op per split layer).
3g	C	0	`blockwise_reconstruction` (last resort, stacked on top of best)	Tier A + Tier B all plateaued and gap > 5% absolute. GPU strongly recommended. The lever template no longer disables TQT — the engine runs `TrainedQuantizationThresholdPass` and then `AdaroundPass` sequentially (see `esp-ppq/esp_ppq/quantization/quantizer/base.py`), so the two passes coexist in the pipeline. PC quantization time roughly doubles vs the prior best, but accuracy attribution stays clean (blockwise is the only new variable). LSQ × {TQT, blockwise} remains hard-rejected by `apply_setting._check_mutex` because LSQ silently degenerates on POWER_OF_2 targets.

The state machine in compare_iterations.py automatically picks the next lever per the table above and the deploy_runtime_priority knob. The agent's job each Phase-3 iteration shrinks to: read comparison.json["next_step_hint"]["advice"], copy the embedded change snippet onto best-so-far's setting.json, fill rationale with the specific layer-stats observation that drove the choice, run the harness.

Stop conditions are no longer the agent's call — when any of the following hold, the state machine yields. Two of them finalise (Phase 4); two of them hand control to the agent (Phase 5). Note that the Phase-3 cap fires at 5 iterations even though the linear-order list has 8 levers — see "Why _PHASE3_CAP=5 leaves untried linear-order levers" in the Phase 5 section below for the trade-off and how Phase 5 picks up the slack.

Stop condition	Routes to	Why
`primary_metric` reached `target_metric`	phase-4-final-report	Target met — keep poking is a waste.
Plateau: last 3 iterations all within 0.1% relative of best	phase-4-final-report	Accuracy stopped moving; Phase 5 has the same problem. The window is 3 (not 2) because real iteration histories often have a single sub-0.1% wobble in the middle of an otherwise-improving run; requiring 3 consecutive flat iterations rules out that false-positive.
`_PHASE3_CAP` (= 5) Phase-3 iterations run after Phase 2 AND target NOT reached	phase-5-agent-driven	The linear search ran out of cap-budget but the metric is still moving — let the agent explore. The unfilled tail of the linear list shows up in `phase5_signals.untried_phase3_levers`.
All linear-order Phase-3 levers (3a-1, 3a-2, 3b, 3c, 3d, 3e, 3f, 3g) tried or correctly skipped AND target NOT reached	phase-5-agent-driven	Same idea — the prescribed list is exhausted, but the model has more accuracy to give.

Lever 3c is correctly skipped when R8 doesn't fire on best-so-far's non_computing_hot_ops.json (see "R8 trigger" below): the data says fusion alignment would regress, so the state machine doesn't burn an iteration on it and instead advances to 3d. The skip is recorded in comparison.json["next_step_hint"] so the agent (and the human reviewing later) sees why the lever was bypassed.

Phase 5 — Agent-driven exploration

Phase 5 exists because the Phase-3 linear search is, by design, surgically narrow. Each Phase-3 lever changes exactly one thing on top of best-so-far. That's the right shape when you're hunting for the next single biggest fix, but it can't find combinations: a recipe that needs `percentile + TQT + equalization + 3-layer int16

bias_correct(theexample_quantize_mobilenetv2_esp32p4/outputs/winning configuration, iter_14) is unreachable from any single Phase-3 lever applied to any single Phase-2 leg. Worse, Composition discipline #4 (calibration is not separable from the training pass) means even Phase 2 cannot tell you whetherpercentile` becomes the best calibration once equalization + int16 + bias are stacked on top of it.

In Phase 5 the state machine yields and the agent drives. The contract:

The hint in comparison.json["next_step_hint"]["advice"] is meta-guidance, not a setting.json template. There is no iteration_id skeleton to fill in — you write the next setting.json from scratch.
The hint is paired with a structured comparison.json["next_step_hint"]["phase5_signals"] block that summarises history: which iterations improved over their prior best (and by how much), which regressed (so you don't re-stack their changes), what calibration algorithms haven't been tried on top of the current lever stack, and pointers to the on-disk artifacts to consult before proposing the next change.
Each Phase-5 iteration must still mutate from best-so-far (Composition discipline #1) and stop escalating after one regression (#3). Discipline #2 (one knob per iteration) is relaxed — see Composition discipline #5 below.

Inspiration patterns (these are starting points; let the actual data decide which one applies on your model):

STACK improving levers. When phase5_signals.improving_levers lists ≥2 entries (e.g. iter_5 enabled tqt_optimization, iter_9 enabled equalization), the first natural Phase-5 iteration is one that turns BOTH on at once. The mobilenetv2-p4 path went iter_11 (TQT + equalization + int16x3) → iter_13 (added calibration swap, +0.55%) → iter_14 (added bias_correct, +0.05%, final best).
CROSS-POLLINATE CALIB. When phase5_signals.untried_calib_swaps is non-empty, run a single iteration that swaps the calibration on top of the current lever stack. The Phase-2 cartesian product evaluates calib × TQT(default) in isolation; once 3-4 levers are stacked the ranking can flip — exactly what happened on mobilenetv2-p4 when percentile, which had previously been considered a calibration loser, became the winner once stacked with TQT + equalization + int16.
ABLATE. Once Phase 5 finds a new best, drop one component at a time and check whether accuracy stays above target. This produces a "minimal recipe": fewer passes, fewer surprises, shorter PC quantization time. Two ablation directions are particularly useful — drop the highest-cost component (e.g. one int16 op, or weight_split) for on-device speed; drop a Tier-A pass (TQT off, equalization off) to test which passes are actually load-bearing.
DIVE INTO ARTIFACTS. When the above three don't produce an obvious next move, open best's layerwise_error.json, layer_stats_full.json, non_computing_hot_ops.json, and graphwise_jumps.json. Pick a layer with a concrete distribution observation (e.g. "Conv layer X has Float Std skew >5 but the weight per-channel max/mean is only 2.3 — equalization won't help; this is a high-variance activation, try TQT escalation or int16 on this single op") and write the next iteration around that observation. Cite the file + the number in rationale.

Stop signals (each → finalize). Two are auto, two are agent-driven:

primary_metric reached target_metric — compare_iterations.py AUTO-finalizes via phase-4-final-report. final_report.md records Stop reason category: target_reached.
Plateau — last 3 iterations all within 0.1% relative of best. compare_iterations.py AUTO-finalizes via phase-4-final-report. final_report.md records Stop reason category: plateau plus the three plateau values.
User-given iteration budget reached — agent runs --finalize --force-finalize NOW, regardless of phase and regardless of remaining untried patterns/levers. User budget is the hard ceiling. --force-finalize is the explicit opt-in that confirms "this early stop is intentional"; final_report.md records Stop reason category: force_finalize_phase5 plus the untried lists so the user can see what was skipped.
Coverage-exhausted "ran out of ideas" — STRICT. Only fires when ALL the following hold simultaneously:
- the user did NOT give a specific iteration budget;
- phase5_signals.untried_phase5_patterns is empty (every pattern attempted at least once);
- phase5_signals.untried_phase3_levers is empty (every linear-order Phase-3 lever either tried or correctly skipped by its trigger);
- phase5_signals.untried_5beta_reapply is empty (every calib swap re-tested on the current deepest stack — see "5β-reapply" below);
- the most recent iterations did not produce a new best.
If signal (3) is in play, signal (4) is disabled. Keep iterating until the user budget is met, drawing fresh variations from the untried lists. Phase 5 has NO hard iteration cap from the state machine; the user is the cap.

The comparison.json["early_finalize_command"] field always contains the one-line --finalize invocation; the stdout "Tip" block reprints it after every run.

Hard-reject of premature --finalize: if you run --finalize while still in phase-5-agent-driven and neither target nor plateau is met AND you do NOT pass --force-finalize, compare_iterations.py PRINTS THE REJECTION BLOCK, REFUSES TO WRITE outputs/best/ or outputs/final_report.md, and EXITS WITH CODE 1. The agent must either (a) re-run without --finalize and keep iterating, or (b) pass --force-finalize to confirm intentional early stop. This is the operational enforcement of the user-budget contract in signal (3); see "How premature finalize is prevented" below for the rationale.

5β-reapply (the high-leverage coverage gap that needs explicit tracking): Composition discipline #4 says calib-only ranking does not predict the combined ranking, but Phase 2 runs before any levers are on. The corollary is that an early 5β CROSS-POLLINATE attempt on a shallow stack can produce a misleading verdict — the same calib swap on the current deepest stack may behave very differently. The canonical example is example_quantize_mobilenetv2_esp32p4 iter_13: percentile lost to kl on the Phase-2 stack but won by +0.55% on the deepest lever stack. The skill tracks this via phase5_signals.untried_5beta_reapply: a list of calibrations that appeared as 5β targets earlier in history but were not re-tested on the current best's stack. The Phase 5 hint surfaces the list explicitly, and stop signal (4) is blocked while it is non-empty.

Tunable-params soft advisory: the Phase 5 hint includes a "Tunable parameters in current best" section listing the parameter knobs available inside each enabled pass (TQT lr / steps / block_size, blockwise lr / steps / block_size, equalization opt_level / iterations, fusion_alignment direction, percentile calibration percentile) with common value ranges drawn from references/ppq_methods.md. This is a SOFT advisory — NOT part of coverage. The agent reads the section, decides whether the layerwise / non_computing_hot_ops data justifies a knob change, and proposes the next iteration accordingly. Tuning a parameter within an already-enabled pass is a valid Phase-5 move; not every variation requires turning a pass on/off.

User-budget enforcement in Phase 5

Phase 5's "no hard cap" property means the state machine will keep emitting hints forever if you let it. The user budget is what bounds the loop. Concretely:

Track the user-budget count in your head (or in scratchpad). Increment after each iteration completes.
After the N-th iteration where N == user budget, run the --finalize command and stop.
Never invoke stop signal (4) when a user budget is in play — even if all patterns and levers are covered. Spend the remaining budget on variations of attempted patterns: re-stack improving levers in different combinations, try the same pattern on a different layer subset, ablate a different component, dive into a different artifact than last time. Variation under user budget is REQUIRED; fabricating defensible variations is part of the Phase 5 contract.
The mechanism that prevents the wrong call here is rationale citation discipline #5: every iteration must name the iter id(s) whose data motivated the change. If you cannot name any prior iter that motivates the next change AND the user budget remains, look at a different artifact, find a number you can name, and use that as your rationale — do not finalize.

Why `_PHASE3_CAP=5` leaves untried linear-order levers (by design)

The Phase-3 linear-order lever list has 8 entries (3a-1, 3a-2, 3b, 3c, 3d, 3e, 3f, 3g), but _PHASE3_CAP=5 caps the number of structured single-knob Phase-3 iterations at 5. The trade-off this cap encodes:

Pro: Phase 5 can start exploring sooner (cross-pollination + ablation + stacking are higher-leverage moves than the tail of the linear list in many real cases — see the mobilenetv2-p4 iter_13 +0.55% jump).
Pro: when target_metric is hit early in Phase 3, the cap is irrelevant — the short-circuit fires first.
Con: levers near the end of the linear order (typically 3d / 3e / 3f / 3g) are often left untried when the cap fires. The 3a-3 conditional path can also occupy a cap slot, making this worse.

The skill closes the gap by treating those untried levers as first-class Phase 5 coverage targets. phase5_signals.untried_phase3_levers lists them by id; the hint explicitly directs the agent to STACK each onto best-so-far as a Phase-5 iteration before stop signal (4) can fire. Functionally a Phase-3 single-knob mutation and a Phase-5 STACK iteration produce the same setting (both mutate from best + flip one lever), so coverage is preserved — just under a different label and with a slightly larger lever-stack baseline.

Worked example: in example_quantize_mobilenetv2_bad_esp32s3/outputs/, the 20-iteration run only produced 12 iterations because the agent (a) interpreted the old "Pretending there's a 5th idea when there isn't one is worse than finalising" line as permission to bail, and (b) had no visibility into the fact that 3f weight_split and 3g blockwise_reconstruction were untouched and 5gamma ABLATE + 5delta DIVE-INTO- ARTIFACTS were untried Phase-5 patterns. Under the current contract:

The "Pretending..." sentence is gone — replaced by the hard rule that user budget trumps signal (4).
The hint now surfaces untried_phase3_levers=[3f, 3g] and untried_phase5_patterns=[5gamma, 5delta] as named targets.
Premature --finalize in phase-5 (target not met, no plateau) is HARD REJECTED: compare_iterations.py refuses to write outputs/best/ / outputs/final_report.md and exits with code 1. The agent must either continue iterating or explicitly pass --force-finalize. The earlier soft-warning version of this guardrail was a load- bearing failure mode that produced both this 12-of-20 run and the later 18-of-30 / 21-of-30 runs in example_quantize_mobilenetv2_esp32p4_tmp/outputs/.

These three together make the iteration-budget mismatch self-correcting.

How premature finalize is prevented

The hard-reject contract for --finalize is the operational enforcement of the user- budget rule in stop signal (3). Concretely:

The agent (or user) runs compare_iterations.py --finalize while phase == phase-5-agent-driven and the script computes target_metric is NOT reached and the recent iterations are NOT a plateau.
compare_iterations.py prints the rejection block, listing untried_phase5_patterns, untried_phase3_levers, and untried_5beta_reapply.
comparison.json is still written (so the agent can re-read the hint), but outputs/best/ and outputs/final_report.md are NOT touched.
The script exits with code 1.
The agent picks ONE of:
- re-run without --finalize and continue iterating (use the printed untried lists as concrete next targets); OR
- pass --force-finalize alongside --finalize to confirm intentional early stop. The resulting final_report.md has a ## Stop reason section with category force_finalize_phase5 plus the untried lists so the user can see exactly what was skipped.

Why hard-reject (not soft warning): the soft warning was ignored both in the bad- esp32s3 12-of-20 run and in the esp32p4_tmp 18-of-30 / 21-of-30 runs. Making the reject load-bearing means the agent literally cannot produce a final_report.md by accident in phase-5 — a positive write requires --force-finalize, which the agent will only emit when the user budget rationale is solid.

Phase 4 — Final report

Two ways to enter Phase 4:

State-machine trigger (machine view): comparison.json["next_step_hint"]["phase"] == "phase-4-final-report". The state machine emits this when target reached, plateau, Phase-3 cap, or all linear-order levers tried (see Phase 3's stop-condition list).
User-budget trigger (human view): the user gave the agent a specific iteration budget — phrasings like "iterate 3 times", "迭代 3 轮", "只跑 N 轮", "iterate N times", "最多 N 轮", "only N iterations". When this budget is hit, the user-budget trigger always wins even if the state machine still wants to keep going.

Why the auto-finalize is bullet-proof. compare_iterations.py writes outputs/best/ and outputs/final_report.md whenever either trigger fires:

Automatic when phase == "phase-4-final-report" — every invocation of compare_iterations.py --output-dir <outputs> checks this and finalizes if true. Agents reading the script's stdout will see a [compare] phase-4 detected; finalize results: ... block.
On demand via the --finalize flag at any time, regardless of phase. This is the escape hatch for the user-budget case — agents should copy the command from comparison.json["early_finalize_command"] (or the printed "Tip: how to wrap up at any time" block at the bottom of compare_iterations.py's stdout) and run it after the last user-budgeted iteration completes.

The generated final_report.md carries an HTML marker comment on its first line. Subsequent finalize runs detect the marker → safely refresh the report (no data loss). If an agent has hand-edited the file (and removed the marker), subsequent finalize preserves it untouched unless --force is passed. Sections ## Key findings and ## Remaining gap (if target not met) are seeded with auto-bullets but agents are explicitly invited (via the marker comment) to expand them with concrete distribution interpretations from layer_stats.json / non_computing_hot_ops.json / graphwise_jumps.json.

Iteration history table — new columns. The auto-generated table now includes:

rank — dense ranking by primary_value (1 = best). Recomputed from disk on every finalize, so adding more iterations later won't reverse the relative order of any pre-existing pair (this is asserted by the unit tests). Columns visible in the report and in comparison.json["iteration_ranks"].
affects inference speed — "No" for almost all settings; "Yes (...)" only when the iteration enables dispatching_table int16 promotion or weight_split (the only two passes with permanent on-device runtime cost; see "On-device runtime cost cheat-sheet"). When the best iteration has affects inference speed = Yes, inspect the rank-2/3 rows — if they trade < 0.1% accuracy for affects inference speed = No, the user may prefer the runner-up for production deployment.

Recommended agent workflow after finalize:

Read outputs/final_report.md. The Summary, Iteration history (with rank + speed columns), Best setting, Python snippet are auto-generated; expand ## Key findings and ## Remaining gap with concrete bullets if the model warrants.
Run a single full-eval re-check to replace the iteration loop's evaluate_fast() number with the user's real evaluate(): python {SKILL_DIR}/scripts/run_iteration.py --user-quant <...> --setting outputs/best/setting.json --output-dir outputs/iter_<NEW> --use-full-eval. If the resulting <primary_metric> differs from what's in the Summary, update the Summary line in final_report.md. The marker line keeps the file refreshable; once you remove the marker (or pass --force from the script), subsequent automated runs won't clobber edits.
If you ever need to regenerate the report from scratch (e.g. after fixing a bug in an iteration), run: python {SKILL_DIR}/scripts/compare_iterations.py --output-dir <outputs> --finalize --force.

Legacy fallback. The pre-auto-finalize workflow (manually run --write-best, hand-write outputs/final_report.md) is still supported for completeness — if for any reason compare_iterations.py does not emit the artifacts (e.g. broken iteration data on disk), the agent can fall back to that flow. The --write-best flag now writes only outputs/best/; the report remains the agent's responsibility in that fallback.

Final-report template (auto-emitted by the script, for reference / audit):

<!-- auto-generated marker line — agents may edit Key findings / Remaining gap -->
# Final Report: <model> on <target>

## Summary
- Best iteration: iter_<N>
- <primary_metric>: <value> (target_metric=<target or "not set">)
- _Note: value comes from evaluate_fast(); run --use-full-eval to refresh._
- On-device speed cost vs baseline (best): <No | Yes (...)>
- Other metrics: <copy from outputs/best/metrics.json>

## Iteration history
| iter | method changed | <primary_metric> | delta | outcome | rank | affects inference speed |

## Best setting
<inline the FULL outputs/best/setting.json>

## Python snippet
<auto-translated QuantizationSettingFactory.espdl_setting() recipe>

## Key findings
<auto-bullets — agents extend with concrete distribution interpretations>

## Remaining gap (if target not met)
<auto-bullets — agents replace boilerplate with model-specific recommendations>

Composition discipline (read before every iteration)

These rules govern the iteration loop. Violating any of them = current iteration is discarded, agent rolls back to best-so-far and re-runs.

Mutate from best-so-far, not the last iteration. Always start the next setting.json from comparison.json["best_iteration"]["dir"]/setting.json. The most recent run can be a regression you should not inherit.
One new method (or one parameter change) per iteration (Phases 1-3 only). Calibration algorithm and tqt_optimization with the default schedule are treated as the conjoined Phase 2 base — they enter and leave together inside Phase 2. Outside Phase 2 and inside Phase 3, change exactly one knob. If two changes are stacked and the iteration regresses, you can't tell which one hurt. (Phase 5 relaxes this rule — see #5.)
Stop escalating after one regression. If iter_N raises a TQT hyper-parameter (or tightens a lever) and the metric drops, do not push further on that axis; pivot to a different lever from the Phase 3 table.
Never retire a calibration algorithm based on its calib-only score. Calibration is the input distribution shaper for downstream passes (especially TQT in esp-dl, where POWER_OF_2 makes TQT the only available training-based pass). Calib-only accuracy does not predict combined accuracy: percentile may regress standalone but become the strongest TQT base, because tail-clipping leaves more "training space" for TQT to recover. Phase 2 must always evaluate calibration with calib × TQT(default) cartesian product, never with calib-only ranking. The same principle applies in Phase 5: calibration ranked against the Phase-3 lever stack can flip vs the Phase-2 cartesian ranking — re-test untried calibrations after the lever stack settles. See Worked example below for the concrete reproducer.
Phase 5 multi-knob changes are allowed iff the rationale cites historical evidence. The "one knob per iteration" rule (#2) is relaxed in Phase 5 — an iteration may stack 2+ passes — but only when each pass change names the specific iteration whose data motivates it (e.g. "iter_5 showed lever 3a-3 stabilised the perturbed layer; iter_9 showed equalization improved bottleneck Conv chains; combining them tests whether the gains compound"). Without citations, a multi-knob change is a guess and must be split into single-knob steps as in Phase 3. The reason: in Phase 3 the linear search makes attribution clear; in Phase 5 the only mechanism keeping attribution honest is the rationale itself.

Operating principles

Always look at distributions before changing settings

The iter-0 layerwise table tells you which layers hurt, but not why. The layer_stats.json tells you why. Don't propose equalization=True because the doc says it helps with depthwise convs — propose it because the layer's per-channel weight max/mean ratio is > 5 and the layer is part of a Conv→activation→Conv chain. The references/decision_playbook.md formalises this; follow it unless you have a strong reason not to.

esp-ppq three-function coverage table

esp-ppq exposes three error analysers; each has a distinct scope and they are not interchangeable. The harness invokes all three every iteration; the playbook combines them.

Function	Scope	What is filtered out	SNR semantics	Output artifact
`layerwise_error_analyse`	`is_computing_op` only (Conv/Gemm/ConvTranspose/MatMul/Attention/PPQBiasFusedMatMul/LSTM)	Everything else, including all activations / Concat / Add / Resize / Pool / Sigmoid / Softmax	Isolated — quantize this op only, leave the rest in FP, measure the final-output SNR delta.	`layerwise_error.json`
`graphwise_error_analyse`	Same `is_computing_op` set	Same	Cumulative — quantize everything up to and including this op, leave downstream in FP, measure the final-output SNR. Larger than layerwise; difference reveals interactions.	`graphwise_error.json`
`statistical_analyse`	All `QuantableOperation` minus `PASSIVE_OPERATIONS` (`MaxPool / Slice / Reshape / Transpose / Identity / Squeeze / Unsqueeze / Cast / GatherND / Scatter / Pad / Tile / NonMaxSuppression / RoiAlign / TopK / Resize / Split / Flatten / DepthToSpace / SpaceToDepth`)	Passive ops only — includes Add / Concat / AveragePool / Sigmoid / Softmax / GRU / LayerNorm / Resize and many other non-COMPUTING_OP types	Per-variable cumulative SNR — distribution stats (Float Mean/Std/Skew/Kurt, Quant counterparts, Noise Mean/Std, Noise:Signal ratio) on every input/output tensor. SNR is computed against the FP reference at that variable, not at final output.	`layer_stats_full.json`, plus filtered `layer_stats.json` and aggregated `non_computing_hot_ops.json`

Coverage gap and how the playbook closes it. Because layerwise filters to COMPUTING_OP, a Concat with mismatched input scales or a Resize after a wide-range sigmoid will never show up in layerwise_error.json — yet they can dominate the final-output error. Two harness-side products bridge the gap:

non_computing_hot_ops.json ranks non-COMPUTING_OP layers by their max per-variable SNR (from statistical_analyse) and adds inputs_float_std_ratio (max/min Float Std across input variables) for direct R8 trigger detection.
graphwise_jumps.json computes graphwise[op_next] − graphwise[op_prev] − layerwise[op_next] for every adjacent pair of computing ops in the simplified graph. When this excess is > 0.02 (overridable via QUANT_CONFIG["graphwise_intervening_excess_threshold"]), the non-computing ops sitting between op_prev and op_next are flagged as suspected culprits.

The decision_playbook reads all three artifacts before proposing any Phase-3 lever.

Accuracy-first method ordering (with a soft penalty for on-device runtime cost)

Pick which pass to try next by expected accuracy gain on this model, not by how long the quantization step takes on the PC/server. PC time is a one-shot cost; on-device latency and final accuracy are what the user lives with. Apply a soft penalty to passes that slow down inference on the ESP target so they get used surgically, not blanket-on.

Tier A — High accuracy, no on-device runtime cost (the Phase 2 base + most Phase 3 levers):

TQT (tqt_optimization) — POWER_OF_2-native gradient training of log2_scale. The strongest single pass for esp-dl POWER_OF_2 targets when calibration alone plateaus. The Phase 2 cartesian product uses the conservative default schedule (lr=1e-5, steps=500, block_size=4); Phase 3 escalates in three stages: 3a-1 (steps 500→1000), 3a-2 (steps 1000→2000, lr 1e-5→5e-5), 3a-3 (CONDITIONAL — block_size 4→2 for stability when 3a-1/3a-2 perturb a quiet layer or all other levers don't apply).
Calibration algorithm (calib_algorithm) — {kl, mse, percentile} are the three options Phase 2 sweeps (each combined with TQT(default)). The interplay between calibration and TQT is non-monotonic, see Composition discipline #4.
Bias correction — fixes systematic mean-shift error baked into Conv/Gemm bias.
Equalization — large gains on Conv→Conv chains (depthwise/pointwise pairs in MobileNet, bottleneck residuals in ResNet, inverted residuals in MobileNet-V2/V3, etc.). On per-tensor weight targets (esp32s3 / c) the canonical Tier A pass; on esp32p4 per-channel weight target it is warn-only (esp-ppq officially "Not recommend" for per-channel; empirically can still help on some networks — see references/ppq_methods.md §1 for the full rationale).
Fusion alignment (fusion_setting) — surgical fix for Concat/Add/Resize layers with mismatched input ranges.

Tier B — High accuracy, but slows on-device inference (apply surgically; gain must justify the slowdown — typically only on the worst 1-3 layers):

Mixed precision via dispatching_table (→ int16) — promote the very worst layer(s) from int8 to int16. Cap at ≤10% of total ops, ideally 1-3.
Weight split (weight_split) — only on layers whose weight outliers couldn't be handled by equalization. Inserts an extra Add op per split layer at runtime.

Tier C — Last resort (when Tier A+B plateau and the user accepts long offline training):

Blockwise reconstruction (blockwise_reconstruction) — biggest hammer. Scales are frozen by default (POWER_OF_2-safe); only weights move. GPU strongly recommended; CPU runs are hours per iteration. Coexists with TQT in the esp-ppq pipeline — the engine runs TrainedQuantizationThresholdPass and then AdaroundPass sequentially, so the two passes do not conflict. Combining doubles PC quantization time but improves accuracy attribution when used in 3g (blockwise is the only new variable on top of best). LSQ × {TQT, blockwise} remains hard-rejected (LSQ silently degenerates on POWER_OF_2 targets).

Tier D — Auto-disabled on POWER_OF_2 targets:

LSQ (lsq_optimization) — esp-ppq's LSQDelegator disables scale training under POWER_OF_2, so LSQ would silently degenerate to weight-only tuning while paying TQT-level PC time. The harness in scripts/apply_setting.py detects this conflict and disables LSQ with a clear warning. Use TQT instead.

See references/ppq_methods.md §11 (compatibility matrix) and §12 (target-policy compatibility) for the full method-vs-method and method-vs-target compatibility rules.

On-device runtime cost cheat-sheet

The state machine reorders Phase-3 levers based on QUANT_CONFIG["deploy_runtime_priority"] (default "balanced"; "speed" defers all +-cost levers behind every zero-cost lever; "pc_time" is reserved for future use). The cost column on each lever:

Pass / Lever	On-device cost	Why
`calib_algorithm` (kl/mse/percentile/minmax/isotone)	0	Only changes scale values, not the runtime graph.
`tqt_optimization` (any 3a-1/3a-2/3a-3 schedule)	0	Trains `log2_scale` offline; output is plain POWER_OF_2 scales.
`bias_correct`	0	Adjusts the bias tensor in-place; same op count, same memory.
`fusion_alignment`	0	Forces input scales to align — emits identical kernels.
`equalization`	0	Folds a diagonal scale into adjacent weights; runtime graph unchanged.
`blockwise_reconstruction`	0	Same as TQT: output is just adjusted weights/scales. PC-time is huge though.
`lsq_optimization`	0 if it ran	Auto-disabled on POWER_OF_2; not a real option here.
`dispatching_table` (→ int16 promotion)	+	Each promoted op pays ~2× cycles + ~2× activation memory permanently.
`weight_split`	+	Inserts an extra `Add` op per split layer (and the underlying Conv loses its bias-fusion).
Mixed-precision via per-op rules in `dispatching_table` (FP fallback)	++	Any op that can't run native int8/16 falls back to esp-dl FP path; multi-x latency hit.

The reordering logic: under deploy_runtime_priority="speed", lever 3g (blockwise reconstruction, zero on-device cost but biggest PC-time) is moved ahead of 3e/3f (the two +-cost levers); under "balanced", 3e/3f come before 3g because their PC-time is shorter and their gains are more typical. The full ordering tables live in references/decision_playbook.md §"On-device cost reordering".

Use `evaluate_fast()` during iteration if available

If the contract exposes evaluate_fast(), the harness automatically calls it during iteration. Run the full evaluate() only once at the end on the chosen best iteration. Encourage the user to provide both if their full eval is slow (>10 min).

Names in `dispatching_table` must match simplified ONNX

esp-ppq simplifies the ONNX graph before quantization. Op names in the simplified graph sometimes differ from the original. The harness saves outputs/iter_<N>/simplified_ops.json listing every op name in the simplified graph; cross-check candidate names against that file before adding them to dispatching_table. If the name is missing, the dispatch will silently no-op.

Common pitfalls

Overwriting the user's ONNX: espdl_quantize_onnx saves the simplified ONNX back to the input path. The harness works around this by copying the user's ONNX to a temporary file in outputs/iter_<N>/_input.onnx before calling the API. Don't bypass this.
Stale calibration: the dataloader is rebuilt every iteration — calibration is fast enough that caching across iterations isn't worth the bug surface.
Setting interested_layers=None vs []: in esp-ppq None and [] both mean "all layers" for some passes but only one is accepted by others. The harness normalises this via scripts/apply_setting.py — agents writing JSON should use null or omit the field; never use empty arrays for interested_layers.
TQT/blockwise on CPU: these passes are 10-100x faster on GPU. If device=cpu and the user asks for TQT/blockwise, warn them the iteration will take a long time before submitting.
LSQ on POWER_OF_2 targets: the harness will skip LSQ and warn — don't re-enable it in the next iteration. The accuracy you wanted from LSQ comes from TQT here.
Calling --finalize in Phase 5 without --force-finalize: the script HARD REJECTS this by design (exit code 1, no outputs written) when target_metric is not reached and no plateau is detected. The rejection block surfaces untried_phase5_patterns, untried_phase3_levers, and untried_5beta_reapply so the agent can either keep iterating or — if the user budget really is exhausted — re-invoke with --force-finalize to confirm intentional early stop. The earlier soft-warning version was the load-bearing failure mode behind the 12-of-20, 18-of-30 and 21-of-30 premature- stop runs.
Skipping the Phase 2 cartesian product: this is the #1 search failure on esp-dl targets. Calib-only sweep does not work — see Composition discipline #4 and the Worked example. Always run all three legs (or short-circuit on target_metric).
Two changes per iteration: tempting when one of the changes "looks safe", but it produces confounded experiments. The classic failure mode: enabling percentile and bias_correct together, then blaming percentile when the iteration regresses (see the Worked example). One knob at a time.
Abbreviating Phase-3 lever templates (especially equalization). Writing only {"equalization": {"enabled": true}} is not equivalent to enabling lever 3d — the resolved opt_level falls back to esp-ppq's default 1, which does not cross Add/Sub branches and therefore silently skips every residual / inverted-residual chain (ResNet bottlenecks, MobileNet-V2/V3 inverted residuals, etc.) — exactly the layers lever 3d was designed to reach. The skill's lever-3d template recommends opt_level=2. Always copy comparison.json["next_step_hint"]["advice"]'s Template change snippet verbatim and only fill in rationale; the harness now emits a warning when opt_level is unset on an enabled equalization pass, and the warning is your signal that the iteration didn't actually test what its rationale claims.
Python environment drift: since the skill runs in the current Python interpreter (no Docker isolation), make sure the user's esp_ppq install matches the esp-dl tag they plan to deploy with.
Skipping --finalize when the user gave an iteration budget: the user says "iterate 3 times" / "迭代 3 轮", agent runs iter_0/1/2, sees comparison.json still pointing at phase-2-calib-tqt-sweep ("run mse next"), and accidentally runs a 4th iteration. User-budget stop > state-machine stop. As soon as the user-budgeted iteration count is reached, run the command in comparison.json["early_finalize_command"] (or copy from the "Tip: how to wrap up at any time" block in stdout) so outputs/best/ and outputs/final_report.md are produced regardless of phase. Especially important in Phase 5, which has no hard iteration cap — the only state-machine stop signals there are target_metric and plateau, so the user's budget is the primary guardrail.
Treating Phase 5 hints like a template: the Phase-5 advice text is meta-guidance, the phase5_signals block is data summary. Neither is a fillable setting.json skeleton. If you find yourself looking for "iteration_id" to paste, you're confusing Phase 3 with Phase 5 — go read phase5_signals and the on-disk error artifacts, then write the next setting.json yourself.
Phase 5 rationale without citations: a multi-knob Phase 5 iteration whose rationale is "trying equalization + bias_correct together" violates Composition discipline #5. Cite the iteration numbers whose data motivates each pass (e.g. "iter_9 showed equalization improved by +0.45 on bottleneck Convs; iter_7 layer_stats showed /features.../Conv output has |Noise Mean|=0.18 > 0.1×Noise Std=0.07 — bias_correct triggers"). Without citations the agent loses attribution and the iteration is a guess wearing a multi-knob disguise.
Bailing on Phase 5 stop signal (4) while user budget remains: this is the failure mode that bit example_quantize_mobilenetv2_bad_esp32s3/outputs/ — the agent ran 12 of 20 requested iterations, tried 3 Phase-5 attempts that regressed, and signed off with "out of ideas". Stop signal (4) is strict: it requires the user to NOT have given a budget AND untried_phase5_patterns, untried_phase3_levers, and untried_5beta_reapply to all be empty. If a user budget is in play, signal (4) is disabled — the agent must keep iterating, drawing from the untried lists or variations of attempted patterns, until the budget is hit. The hard-reject of premature --finalize enforced by compare_iterations.py is the operational guardrail; without --force-finalize, the script literally cannot write the final report while phase=phase-5 + target not met + no plateau.
Confusing on-device-cost with pc-time-cost: the new affects inference speed column refers strictly to deployment-time cost (does the quantized model run slower on the ESP target?). Long PC quantization time (e.g. blockwise reconstruction) does not show up here — see the "On-device runtime cost cheat-sheet" for the full per-pass mapping. When picking between rank=1 (best accuracy) and rank=2 (runner-up), the only reason to defer to runner-up is if rank=1's affects inference speed is Yes and the accuracy delta is small enough to live with.

Worked example: MobileNet-V2 on ESP32-P4

This phenomenon is not specific to MobileNet and not specific to esp32p4. Any "calibration tightly coupled with a training-based pass" scenario in esp-dl quantization can exhibit it: a calibration regresses standalone but rebounds to the strongest base once paired with TQT(default). The numbers below use MobileNet-V2 on ESP32-P4 as a concrete demonstration only.

Reproducer artifacts (from a real run): example_quantize_mobilenetv2_esp32p4/outputs/, particularly outputs/comparison.json and outputs/iter_10/setting.json.

Key empirical observations

iter_0 baseline (kl, no extras): top1 = 71.15%.
iter_2 (kl + TQT(default)): top1 = 71.15% — TQT on kl produced no improvement.
iter_6 (kl + TQT aggressive lr=1e-4 steps=2000): top1 = 71.30% — the best the old skill version could find, after 6 rounds of escalation.
iter_10 (percentile + TQT(default)): top1 = 71.475% — the highest result, found only when the user manually forced this combination.

Standalone percentile would have looked bad. A clean percentile-only (no TQT) iteration would have regressed below the kl baseline (heavy-tailed activations + tighter scale = legitimate signal clipped). The old skill version inferred from a confounded iter_1 = percentile + bias_correct regression that "percentile is bad", then spent iter_2..9 on kl-only and never re-tested percentile. Even a Phase-2-style calib-only sweep (mse alone, percentile alone) would have repeated the verdict and condemned percentile a second time.

Why percentile + TQT wins anyway. Percentile clips the heavy tail at p=0.9999, giving TQT a tighter, more uniform scale grid to optimise on. TQT then recovers the clipped information through log2_scale adjustment per layer. kl-fitted scales already "compromise" with the tail, leaving less room for TQT to move; mse sits in between.

Lesson encoded into Phase 2. The skill now mandates calib × TQT(default) cartesian product — three iterations: kl + TQT(default), mse + TQT(default), percentile + TQT(default). No calibration is judged by its standalone score; the combined top1/top5/whatever-metric is the ranking signal. The Composition discipline #4 codifies this rule.

Follow-on (Phase 5 reachability). The same project went on to produce iter_14 (top1 = 72.30%, the final best, beat target 71.878%) — but only after the Phase-3 linear search had finished. The trajectory was:

iter_5 (3a-3 TQT block_size=4→2): +0.30%, became best at 71.45%.
iter_9 (3d equalization opt_level=2): +0.45%, became best at 71.60%.
iter_10/iter_11 (3e int16 promotion on 2→3 worst layers): +0.075%, became best at 71.70%.
End of Phase-3 linear levers. State machine emits phase-5-agent-driven because target_metric=71.878% was still unmet. phase5_signals.improving_levers listed iter_5 (3a-3), iter_9 (3d), iter_10/11 (3e); phase5_signals.untried_calib_swaps listed mse, percentile (best was still on kl).
iter_13 (Phase 5 — calib cross-pollination on top of the lever stack: kl→percentile while keeping TQT + equalization + int16x3): +0.55% to 72.25%. The Phase-2 isolated ranking had said percentile regressed (iter_3 percentile + TQT only was 71.0%); once stacked the verdict flipped.
iter_14 (Phase 5 — stack: added bias_correct on top of iter_13): +0.05% to 72.30%, target met.

iter_13 and iter_14 were unreachable from any single Phase-3 lever applied to any single Phase-2 leg — they required composing improving Phase-3 levers with a calibration re-test. This is the kind of move Composition discipline #5 sanctions.

Reference index

references/ppq_methods.md — every esp-ppq method, principle, parameters, when to use, what to avoid, plus the target-policy compatibility matrix.
references/decision_playbook.md — distribution pattern to candidate method mapping, plus the Multi-iteration strategy table aligned with the Phases above.
references/contract.md — what the user's user_quant.py must expose.
references/setting_json_schema.md — JSON schema the agent writes each iteration, with Phase 2 / Phase 3 templates.
scripts/run_iteration.py — the harness; one quantize + analyse + evaluate pass.
scripts/apply_setting.py — pure JSON → QuantizationSetting mapping; performs target-policy compatibility checks (LSQ on POWER_OF_2 → auto-disable; esp32p4 equalization → warn-only).
scripts/compare_iterations.py — cross-iteration diff + next-step state machine driving the Phase 1 → 2 → 3 → 5 procedure. Auto-finalises on phase-4 (writes outputs/best/ + outputs/final_report.md); --finalize forces finalize at any time (escape hatch when the user stops early at a fixed iteration budget); --force overrides the marker-based preserve check on hand-edited reports. Phase-5 hints surface a phase5_signals block of historical improving / regressing levers + untried calibrations + top error layers + non-computing hot ops + coverage tracking (phase5_pattern_coverage, untried_phase5_patterns, untried_phase3_levers — the lever ids that _PHASE3_CAP=5 left uncovered, untried_5beta_reapply — calibrations tried earlier on a shallower stack that need re-evaluating on the current best's deeper stack) plus a soft "Tunable parameters" advisory section. Purely data, no template. Coverage helpers _classify_phase5_patterns, _phase5_cutoff_iter_id, _phase5_pattern_coverage, _untried_phase3_levers, _untried_phase5_patterns, _phase5_lever_stack, _phase5_5beta_attempts, _untried_5beta_reapply are the building blocks; the hard-reject of premature --finalize in main() consumes them and refuses to write outputs unless --force-finalize confirms intentional early stop. The finalize_reason recorded in comparison.json and the ## Stop reason section of final_report.md document the category (target_reached / plateau / force_finalize_phase5 / manual_finalize_phase4 / manual_finalize_pre_phase5) and any untried coverage gaps. R8 (lever 3c) trigger constants _R8_MAX_SNR_PRIMARY (lower band) / _R8_MAX_SNR_UPPER (upper band) / _R8_STD_RATIO_REINFORCE / _R8_ACTIVATION_VETO_RATIO live at the top of the script for one-place tuning. Hindsight regression test: tests/hindsight_r8_examples.py parametrises over every example_quantize_*/outputs/ checked in alongside the skill.
assets/user_quant_torch_example.py — copy/edit for torch model flows.
assets/user_quant_onnx_example.py — copy/edit for ONNX model flows.
assets/extra_requirements.txt — extra Python packages the harness needs.

name

espdl-quantize

description

ESP-DL Quantization Tuning Skill

About <SKILL_DIR> in shell snippets below. This skill is agent-directory agnostic — it may be installed as .cursor/skills/espdl-quantize/, .opencode/skills/espdl-quantize/, or under any other agent's skills folder. Whenever you see <SKILL_DIR> in a shell command, substitute the absolute path of the directory containing this SKILL.md (the agent runtime gives you that path when it loads the skill). Setting it once at the start of a session makes the rest copy-pasteable:
SKILL_DIR=/abs/path/to/espdl-quantize   # the directory holding this SKILL.md
All in-skill markdown links (e.g. [scripts/run_iteration.py](scripts/run_iteration.py)) are already relative to <SKILL_DIR> and need no substitution.

Generality boundary

Why this skill exists

What this skill brings to the table:

Knowledge — every esp-ppq method's principle, parameters, applicable scenarios, and anti-patterns are codified in references/ppq_methods.md.
A decision rulebook — given the top-K worst layers' input/weight/output distributions, references/decision_playbook.md maps observed patterns to candidate methods.
A fixed harness — scripts/run_iteration.py takes the user's contract module plus a JSON setting and emits structured artifacts (metrics, layerwise error, per-layer stats) so the agent only has to read JSON to make the next decision.
A search state machine — scripts/compare_iterations.py inspects what's already been tried and tells the agent which iteration to run next via comparison.json["next_step_hint"]. The hint embeds a complete setting.json template so the agent only has to fill in the rationale.
Target-aware safety net — the harness detects passes that conflict with the target's quantization policy:
- LSQ on POWER_OF_2 targets (esp32p4 / esp32s3 / c) — auto-disabled. esp-ppq's LSQDelegator silently disables continuous-scale training under POWER_OF_2, so the pass would degenerate to weight-only tuning while paying full TQT-level PC time. Use TQT instead — it trains log2_scale and is POWER_OF_2-native.
- Layer-wise equalization on esp32p4 — warn-only (changed in this revision). esp-ppq officially marks the combination as "Not recommend" (see esp-ppq/md_doc/Passes/LayerwiseEqualization.md, "Usage" section), but empirical runs show some MobileNet-family / depthwise-separable networks still benefit. The harness now lets the pass run when equalization.enabled=true and emits a strong warning; the agent should treat it as a Phase 3 lever to try only after the calib×TQT cartesian product has settled.

What the user has to provide

A single Python module (typically named user_quant.py) that exports:

QUANT_CONFIG dict — model path, input shape, target chip, bits, primary_metric, etc.
create_calib_dataloader() — returns the calibration DataLoader.
evaluate(quant_graph) — returns a dict whose keys include QUANT_CONFIG["primary_metric"].
For torch flow only: get_torch_model() — returns the nn.Module.
Optional: collate_fn(batch) and evaluate_fast(quant_graph).

The full contract spec is in references/contract.md. Two ready-to-copy examples live in assets/user_quant_torch_example.py and assets/user_quant_onnx_example.py.

The skill never edits the contract module. All iteration state lives under outputs/.

High-level flow

flowchart TD
    contract[user_quant.py] --> harness[run_iteration.py]
    setjson[outputs/iter_N/setting.json<br/>written by agent] --> harness
    harness --> ppqapi["esp_ppq.api.espdl_quantize_torch / _onnx"]
    ppqapi --> graph["esp-ppq BaseGraph<br/>(esp_ppq.IR.BaseGraph)"]
    graph --> lwerr["layerwise_error_analyse"]
    graph --> stat["statistical_analyse"]
    graph --> evalfn[user.evaluate]
    lwerr --> art[outputs/iter_N/]
    stat --> art
    evalfn --> art
    art --> compare[compare_iterations.py]
    compare --> hint["comparison.json<br/>next_step_hint"]
    hint --> phase{phase?}
    phase -- "1 / 2 / 3" --> agentTpl["agent fills rationale on embedded template"]
    phase -- "5 (open exploration)" --> agentFree["agent reads phase5_signals + artifacts,<br/>writes setting.json from scratch"]
    phase -- "4 (final)" --> finalize["outputs/best/ + outputs/final_report.md"]
    agentTpl -.writes next setting.-> setjson
    agentFree -.writes next setting.-> setjson

Phases

Phase 0 — Validate contract and environment

Important — ignore Docker / image / /work mentions you may see elsewhere. Some user projects (and user_quant.py itself) still carry comments left over from an older Docker-based workflow — phrases like "build the image", "Phase 0 — docker 准备", /work inside Docker, or docker run --gpus all. Those are legacy text only, not steps to execute. The skill now runs entirely in the current Python interpreter.

Before any quantization, do these once per session:

Make sure the current Python environment has esp_ppq (with the [cpu] extra), torch, plus the small set of helpers the harness needs:
```
pip install -e <path/to/esp-ppq>[cpu]
pip install -r "$SKILL_DIR/assets/extra_requirements.txt"
```
The skill is environment-agnostic — it does not require Docker. As long as python -c "import esp_ppq, torch, onnx, onnxsim, pandas, scipy, tqdm" succeeds, you are ready to go.

Validate the user's contract module imports cleanly and exposes the required functions/keys:

python "$SKILL_DIR/scripts/run_iteration.py" \
  --user-quant <path/to/user_quant.py> \
  --output-dir <path/to/user_project>/outputs/contract_check \
  --check-contract

Make sure the iteration workdir exists (default: <contract_dir>/outputs/). The harness creates it on first run.

The working directory for python should be the directory containing user_quant.py (or any directory — the harness resolves relative paths in QUANT_CONFIG against the contract module's directory).

Phase 1 — Baseline (iter_0)

Run the default QuantizationSettingFactory.espdl_setting() once. The agent does NOT propose any settings here — the harness uses a built-in baseline JSON when --baseline is passed.

python "$SKILL_DIR/scripts/run_iteration.py" \
  --user-quant <path/to/user_quant.py> \
  --output-dir <path/to/user_project>/outputs/iter_0 \
  --baseline

After it finishes, read these files:

outputs/iter_0/metrics.json — what evaluate() returned, plus _primary shortcut.
outputs/iter_0/layerwise_error.json — {op_name: snr} sorted descending by error. Covers only is_computing_op (Conv / Gemm / ConvTranspose / MatMul / Attention / PPQBiasFusedMatMul / LSTM); the SNR is the isolated contribution of that op when it alone is quantized.
outputs/iter_0/layer_stats.json — statistical_analyse filtered by the layerwise top-K (legacy artifact; same coverage as layerwise).
outputs/iter_0/layer_stats_full.json — (new) the full statistical_analyse output: every non-passive op's per-input/per-output distribution + cumulative SNR. This is the only artifact that includes Add / Concat / Resize / AveragePool / Sigmoid / Softmax / GRU / LayerNorm.
outputs/iter_0/non_computing_hot_ops.json — (new) the top-K non-COMPUTING_OP layers ranked by max per-variable SNR, plus inputs_float_std_ratio (max/min Float Std of input variables, used by playbook rule R8).
outputs/iter_0/graphwise_jumps.json — (new) adjacent computing-op pairs whose cumulative SNR gap is not explained by the downstream op's isolated contribution. Lists the intervening non-computing ops as suspected culprits.
outputs/iter_0/console.log — full stdout/stderr.

Phase 2 — Calibration × TQT(default) cartesian product (mandatory)

The TQT default schedule is strict:

{
  "lr": 1e-5,
  "steps": 500,
  "block_size": 4,
  "is_scale_trainable": true,
  "gamma": 0.0,
  "int_lambda": 0.0,
  "collecting_device": "cuda"
}

Iteration sequence:

Iter	calib_algorithm	other passes	Purpose
`iter_1`	`kl`	TQT(default)	Pair iter_0(kl-only) and iter_1(kl+TQT) to read off the TQT-on-kl delta. If iter_1 hits target, stop.
`iter_2`	`mse`	TQT(default)	Same with mse. If hits target, stop.
`iter_3`	`percentile`	TQT(default)	Same with percentile. Often the hidden winner on heavy-tailed activations even when standalone percentile would regress.

The way to drive this is to run scripts/compare_iterations.py between iterations:

python "$SKILL_DIR/scripts/compare_iterations.py" \
  --output-dir <path/to/user_project>/outputs

Critical: iterations are strictly sequential — never run two in parallel. Single GPU, calibration-data download race, and any of the three legs can short-circuit the rest if it hits target_metric. If you spawn parallel subagents the search breaks.

Phase 3 — Residual fixes from best-so-far

Lever	Tier	On-device cost	What changes	When to use
3a-1	A	0	TQT `steps: 500 → 1000` (lr=1e-5, block_size=4 unchanged)	Phase-2 winner is TQT-based and gap to target is non-trivial. One knob only — Composition discipline #2.
3a-2	A	0	TQT `lr: 1e-5 → 5e-5, steps: 1000 → 2000` (block_size=4 unchanged)	3a-1 already gave a positive net effect. Do NOT push beyond this on the lr/steps axis (lr=1e-4 / steps=4000 stably regress on representative reproducers).
3a-3	A	0	TQT `block_size: 4 → 2` (lr/steps from best-so-far unchanged)	CONDITIONAL — enter only on one of these two signals: (1) unstable fallback — last iter was 3a-1 or 3a-2, regressed by < 0.5% relative AND introduced a new layer into the top-5 error list (TQT joint training perturbed a previously-quiet layer); or (2) gap-shrink after convergence — 3a-1/3a-2 both improved on best AND none of R5/R8/R3 structural triggers match in best's `layer_stats.json` / `non_computing_hot_ops.json`. Smaller block_size = closer to layerwise = more stable. Do not try block_size=1 (full layerwise, no upside) or block_size≥6 (overlaps lever 3g, unstable).
3b	A	0	`bias_correct.enabled=true`	A top-error op's output row shows `
3c	A	0	`fusion_alignment.align_elementwise_to = 'Align to Large'` (and friends)	R8 trigger fires on best's `non_computing_hot_ops.json`: a Concat/Add/Sub/Mul/Resize/AveragePool entry whose `max_snr ∈ (0.20, 0.30]` (primary Goldilocks band), OR `inputs_float_std_ratio > 5` (legacy reinforcement, fires outside the band too). Skipped above the band (`max_snr > 0.30` — residual too severe; use 3a-3/3d/3e), below the band (too little to fix), or via top-3 activation veto (`Relu/Swish/Sigmoid` `max_snr > 1.2× candidate` — activation-dominated, fix via TQT/int16). Constants live in `compare_iterations.py`; see references/decision_playbook.md §R8 for the empirical calibration.
3d	A	0	enable `equalization` (full lever-3d template; do not abbreviate to `enabled=true` only — esp-ppq defaults `opt_level=1` while the template recommends `opt_level=2`, see Common pitfalls)	Conv→activation→Conv chain with weight per-channel `max/mean > 5`. Per-tensor weight targets (`esp32s3 / c`) are the canonical use case; on `esp32p4` the pass is warn-only — esp-ppq officially "Not recommend" for per-channel weights but it can empirically help on some MobileNet-family / depthwise-separable nets.
3e	B	+	`dispatching_table` int16 on top 1-3 worst layers	One layer's SNR > 2× median of the top-5; structural fixes failed. ≤10% of total ops. Permanent on-device runtime cost (~2× cycles + ~2× activation memory on promoted ops).
3f	B	+	`weight_split` on a single Conv with weight kurtosis > 10	Equalization didn't fix it (or wasn't applicable on esp32p4). ≤3 split layers. Permanent on-device runtime cost (one extra Add op per split layer).
3g	C	0	`blockwise_reconstruction` (last resort, stacked on top of best)	Tier A + Tier B all plateaued and gap > 5% absolute. GPU strongly recommended. The lever template no longer disables TQT — the engine runs `TrainedQuantizationThresholdPass` and then `AdaroundPass` sequentially (see `esp-ppq/esp_ppq/quantization/quantizer/base.py`), so the two passes coexist in the pipeline. PC quantization time roughly doubles vs the prior best, but accuracy attribution stays clean (blockwise is the only new variable). LSQ × {TQT, blockwise} remains hard-rejected by `apply_setting._check_mutex` because LSQ silently degenerates on POWER_OF_2 targets.

Stop condition	Routes to	Why
`primary_metric` reached `target_metric`	phase-4-final-report	Target met — keep poking is a waste.
Plateau: last 3 iterations all within 0.1% relative of best	phase-4-final-report	Accuracy stopped moving; Phase 5 has the same problem. The window is 3 (not 2) because real iteration histories often have a single sub-0.1% wobble in the middle of an otherwise-improving run; requiring 3 consecutive flat iterations rules out that false-positive.
`_PHASE3_CAP` (= 5) Phase-3 iterations run after Phase 2 AND target NOT reached	phase-5-agent-driven	The linear search ran out of cap-budget but the metric is still moving — let the agent explore. The unfilled tail of the linear list shows up in `phase5_signals.untried_phase3_levers`.
All linear-order Phase-3 levers (3a-1, 3a-2, 3b, 3c, 3d, 3e, 3f, 3g) tried or correctly skipped AND target NOT reached	phase-5-agent-driven	Same idea — the prescribed list is exhausted, but the model has more accuracy to give.

Phase 5 — Agent-driven exploration

bias_correct(theexample_quantize_mobilenetv2_esp32p4/outputs/winning configuration, iter_14) is unreachable from any single Phase-3 lever applied to any single Phase-2 leg. Worse, Composition discipline #4 (calibration is not separable from the training pass) means even Phase 2 cannot tell you whetherpercentile` becomes the best calibration once equalization + int16 + bias are stacked on top of it.

In Phase 5 the state machine yields and the agent drives. The contract:

The hint in comparison.json["next_step_hint"]["advice"] is meta-guidance, not a setting.json template. There is no iteration_id skeleton to fill in — you write the next setting.json from scratch.
The hint is paired with a structured comparison.json["next_step_hint"]["phase5_signals"] block that summarises history: which iterations improved over their prior best (and by how much), which regressed (so you don't re-stack their changes), what calibration algorithms haven't been tried on top of the current lever stack, and pointers to the on-disk artifacts to consult before proposing the next change.
Each Phase-5 iteration must still mutate from best-so-far (Composition discipline #1) and stop escalating after one regression (#3). Discipline #2 (one knob per iteration) is relaxed — see Composition discipline #5 below.

Inspiration patterns (these are starting points; let the actual data decide which one applies on your model):

STACK improving levers. When phase5_signals.improving_levers lists ≥2 entries (e.g. iter_5 enabled tqt_optimization, iter_9 enabled equalization), the first natural Phase-5 iteration is one that turns BOTH on at once. The mobilenetv2-p4 path went iter_11 (TQT + equalization + int16x3) → iter_13 (added calibration swap, +0.55%) → iter_14 (added bias_correct, +0.05%, final best).
CROSS-POLLINATE CALIB. When phase5_signals.untried_calib_swaps is non-empty, run a single iteration that swaps the calibration on top of the current lever stack. The Phase-2 cartesian product evaluates calib × TQT(default) in isolation; once 3-4 levers are stacked the ranking can flip — exactly what happened on mobilenetv2-p4 when percentile, which had previously been considered a calibration loser, became the winner once stacked with TQT + equalization + int16.
ABLATE. Once Phase 5 finds a new best, drop one component at a time and check whether accuracy stays above target. This produces a "minimal recipe": fewer passes, fewer surprises, shorter PC quantization time. Two ablation directions are particularly useful — drop the highest-cost component (e.g. one int16 op, or weight_split) for on-device speed; drop a Tier-A pass (TQT off, equalization off) to test which passes are actually load-bearing.
DIVE INTO ARTIFACTS. When the above three don't produce an obvious next move, open best's layerwise_error.json, layer_stats_full.json, non_computing_hot_ops.json, and graphwise_jumps.json. Pick a layer with a concrete distribution observation (e.g. "Conv layer X has Float Std skew >5 but the weight per-channel max/mean is only 2.3 — equalization won't help; this is a high-variance activation, try TQT escalation or int16 on this single op") and write the next iteration around that observation. Cite the file + the number in rationale.

Stop signals (each → finalize). Two are auto, two are agent-driven:

primary_metric reached target_metric — compare_iterations.py AUTO-finalizes via phase-4-final-report. final_report.md records Stop reason category: target_reached.
Plateau — last 3 iterations all within 0.1% relative of best. compare_iterations.py AUTO-finalizes via phase-4-final-report. final_report.md records Stop reason category: plateau plus the three plateau values.
User-given iteration budget reached — agent runs --finalize --force-finalize NOW, regardless of phase and regardless of remaining untried patterns/levers. User budget is the hard ceiling. --force-finalize is the explicit opt-in that confirms "this early stop is intentional"; final_report.md records Stop reason category: force_finalize_phase5 plus the untried lists so the user can see what was skipped.
Coverage-exhausted "ran out of ideas" — STRICT. Only fires when ALL the following hold simultaneously:
- the user did NOT give a specific iteration budget;
- phase5_signals.untried_phase5_patterns is empty (every pattern attempted at least once);
- phase5_signals.untried_phase3_levers is empty (every linear-order Phase-3 lever either tried or correctly skipped by its trigger);
- phase5_signals.untried_5beta_reapply is empty (every calib swap re-tested on the current deepest stack — see "5β-reapply" below);
- the most recent iterations did not produce a new best.
If signal (3) is in play, signal (4) is disabled. Keep iterating until the user budget is met, drawing fresh variations from the untried lists. Phase 5 has NO hard iteration cap from the state machine; the user is the cap.

The comparison.json["early_finalize_command"] field always contains the one-line --finalize invocation; the stdout "Tip" block reprints it after every run.

User-budget enforcement in Phase 5

Phase 5's "no hard cap" property means the state machine will keep emitting hints forever if you let it. The user budget is what bounds the loop. Concretely:

Track the user-budget count in your head (or in scratchpad). Increment after each iteration completes.
After the N-th iteration where N == user budget, run the --finalize command and stop.
Never invoke stop signal (4) when a user budget is in play — even if all patterns and levers are covered. Spend the remaining budget on variations of attempted patterns: re-stack improving levers in different combinations, try the same pattern on a different layer subset, ablate a different component, dive into a different artifact than last time. Variation under user budget is REQUIRED; fabricating defensible variations is part of the Phase 5 contract.
The mechanism that prevents the wrong call here is rationale citation discipline #5: every iteration must name the iter id(s) whose data motivated the change. If you cannot name any prior iter that motivates the next change AND the user budget remains, look at a different artifact, find a number you can name, and use that as your rationale — do not finalize.

Why `_PHASE3_CAP=5` leaves untried linear-order levers (by design)

Pro: Phase 5 can start exploring sooner (cross-pollination + ablation + stacking are higher-leverage moves than the tail of the linear list in many real cases — see the mobilenetv2-p4 iter_13 +0.55% jump).
Pro: when target_metric is hit early in Phase 3, the cap is irrelevant — the short-circuit fires first.
Con: levers near the end of the linear order (typically 3d / 3e / 3f / 3g) are often left untried when the cap fires. The 3a-3 conditional path can also occupy a cap slot, making this worse.

The "Pretending..." sentence is gone — replaced by the hard rule that user budget trumps signal (4).
The hint now surfaces untried_phase3_levers=[3f, 3g] and untried_phase5_patterns=[5gamma, 5delta] as named targets.
Premature --finalize in phase-5 (target not met, no plateau) is HARD REJECTED: compare_iterations.py refuses to write outputs/best/ / outputs/final_report.md and exits with code 1. The agent must either continue iterating or explicitly pass --force-finalize. The earlier soft-warning version of this guardrail was a load- bearing failure mode that produced both this 12-of-20 run and the later 18-of-30 / 21-of-30 runs in example_quantize_mobilenetv2_esp32p4_tmp/outputs/.

These three together make the iteration-budget mismatch self-correcting.

How premature finalize is prevented

The hard-reject contract for --finalize is the operational enforcement of the user- budget rule in stop signal (3). Concretely:

The agent (or user) runs compare_iterations.py --finalize while phase == phase-5-agent-driven and the script computes target_metric is NOT reached and the recent iterations are NOT a plateau.
compare_iterations.py prints the rejection block, listing untried_phase5_patterns, untried_phase3_levers, and untried_5beta_reapply.
comparison.json is still written (so the agent can re-read the hint), but outputs/best/ and outputs/final_report.md are NOT touched.
The script exits with code 1.
The agent picks ONE of:
- re-run without --finalize and continue iterating (use the printed untried lists as concrete next targets); OR
- pass --force-finalize alongside --finalize to confirm intentional early stop. The resulting final_report.md has a ## Stop reason section with category force_finalize_phase5 plus the untried lists so the user can see exactly what was skipped.

Phase 4 — Final report

Two ways to enter Phase 4:

State-machine trigger (machine view): comparison.json["next_step_hint"]["phase"] == "phase-4-final-report". The state machine emits this when target reached, plateau, Phase-3 cap, or all linear-order levers tried (see Phase 3's stop-condition list).
User-budget trigger (human view): the user gave the agent a specific iteration budget — phrasings like "iterate 3 times", "迭代 3 轮", "只跑 N 轮", "iterate N times", "最多 N 轮", "only N iterations". When this budget is hit, the user-budget trigger always wins even if the state machine still wants to keep going.

Why the auto-finalize is bullet-proof. compare_iterations.py writes outputs/best/ and outputs/final_report.md whenever either trigger fires:

Automatic when phase == "phase-4-final-report" — every invocation of compare_iterations.py --output-dir <outputs> checks this and finalizes if true. Agents reading the script's stdout will see a [compare] phase-4 detected; finalize results: ... block.
On demand via the --finalize flag at any time, regardless of phase. This is the escape hatch for the user-budget case — agents should copy the command from comparison.json["early_finalize_command"] (or the printed "Tip: how to wrap up at any time" block at the bottom of compare_iterations.py's stdout) and run it after the last user-budgeted iteration completes.

Iteration history table — new columns. The auto-generated table now includes:

rank — dense ranking by primary_value (1 = best). Recomputed from disk on every finalize, so adding more iterations later won't reverse the relative order of any pre-existing pair (this is asserted by the unit tests). Columns visible in the report and in comparison.json["iteration_ranks"].
affects inference speed — "No" for almost all settings; "Yes (...)" only when the iteration enables dispatching_table int16 promotion or weight_split (the only two passes with permanent on-device runtime cost; see "On-device runtime cost cheat-sheet"). When the best iteration has affects inference speed = Yes, inspect the rank-2/3 rows — if they trade < 0.1% accuracy for affects inference speed = No, the user may prefer the runner-up for production deployment.

Recommended agent workflow after finalize:

Read outputs/final_report.md. The Summary, Iteration history (with rank + speed columns), Best setting, Python snippet are auto-generated; expand ## Key findings and ## Remaining gap with concrete bullets if the model warrants.
Run a single full-eval re-check to replace the iteration loop's evaluate_fast() number with the user's real evaluate(): python {SKILL_DIR}/scripts/run_iteration.py --user-quant <...> --setting outputs/best/setting.json --output-dir outputs/iter_<NEW> --use-full-eval. If the resulting <primary_metric> differs from what's in the Summary, update the Summary line in final_report.md. The marker line keeps the file refreshable; once you remove the marker (or pass --force from the script), subsequent automated runs won't clobber edits.
If you ever need to regenerate the report from scratch (e.g. after fixing a bug in an iteration), run: python {SKILL_DIR}/scripts/compare_iterations.py --output-dir <outputs> --finalize --force.

Legacy fallback. The pre-auto-finalize workflow (manually run --write-best, hand-write outputs/final_report.md) is still supported for completeness — if for any reason compare_iterations.py does not emit the artifacts (e.g. broken iteration data on disk), the agent can fall back to that flow. The --write-best flag now writes only outputs/best/; the report remains the agent's responsibility in that fallback.

Final-report template (auto-emitted by the script, for reference / audit):

<!-- auto-generated marker line — agents may edit Key findings / Remaining gap -->
# Final Report: <model> on <target>

## Summary
- Best iteration: iter_<N>
- <primary_metric>: <value> (target_metric=<target or "not set">)
- _Note: value comes from evaluate_fast(); run --use-full-eval to refresh._
- On-device speed cost vs baseline (best): <No | Yes (...)>
- Other metrics: <copy from outputs/best/metrics.json>

## Iteration history
| iter | method changed | <primary_metric> | delta | outcome | rank | affects inference speed |

## Best setting
<inline the FULL outputs/best/setting.json>

## Python snippet
<auto-translated QuantizationSettingFactory.espdl_setting() recipe>

## Key findings
<auto-bullets — agents extend with concrete distribution interpretations>

## Remaining gap (if target not met)
<auto-bullets — agents replace boilerplate with model-specific recommendations>

Composition discipline (read before every iteration)

These rules govern the iteration loop. Violating any of them = current iteration is discarded, agent rolls back to best-so-far and re-runs.

Mutate from best-so-far, not the last iteration. Always start the next setting.json from comparison.json["best_iteration"]["dir"]/setting.json. The most recent run can be a regression you should not inherit.
One new method (or one parameter change) per iteration (Phases 1-3 only). Calibration algorithm and tqt_optimization with the default schedule are treated as the conjoined Phase 2 base — they enter and leave together inside Phase 2. Outside Phase 2 and inside Phase 3, change exactly one knob. If two changes are stacked and the iteration regresses, you can't tell which one hurt. (Phase 5 relaxes this rule — see #5.)
Stop escalating after one regression. If iter_N raises a TQT hyper-parameter (or tightens a lever) and the metric drops, do not push further on that axis; pivot to a different lever from the Phase 3 table.
Never retire a calibration algorithm based on its calib-only score. Calibration is the input distribution shaper for downstream passes (especially TQT in esp-dl, where POWER_OF_2 makes TQT the only available training-based pass). Calib-only accuracy does not predict combined accuracy: percentile may regress standalone but become the strongest TQT base, because tail-clipping leaves more "training space" for TQT to recover. Phase 2 must always evaluate calibration with calib × TQT(default) cartesian product, never with calib-only ranking. The same principle applies in Phase 5: calibration ranked against the Phase-3 lever stack can flip vs the Phase-2 cartesian ranking — re-test untried calibrations after the lever stack settles. See Worked example below for the concrete reproducer.
Phase 5 multi-knob changes are allowed iff the rationale cites historical evidence. The "one knob per iteration" rule (#2) is relaxed in Phase 5 — an iteration may stack 2+ passes — but only when each pass change names the specific iteration whose data motivates it (e.g. "iter_5 showed lever 3a-3 stabilised the perturbed layer; iter_9 showed equalization improved bottleneck Conv chains; combining them tests whether the gains compound"). Without citations, a multi-knob change is a guess and must be split into single-knob steps as in Phase 3. The reason: in Phase 3 the linear search makes attribution clear; in Phase 5 the only mechanism keeping attribution honest is the rationale itself.

Operating principles

Always look at distributions before changing settings

esp-ppq three-function coverage table

esp-ppq exposes three error analysers; each has a distinct scope and they are not interchangeable. The harness invokes all three every iteration; the playbook combines them.

Function	Scope	What is filtered out	SNR semantics	Output artifact
`layerwise_error_analyse`	`is_computing_op` only (Conv/Gemm/ConvTranspose/MatMul/Attention/PPQBiasFusedMatMul/LSTM)	Everything else, including all activations / Concat / Add / Resize / Pool / Sigmoid / Softmax	Isolated — quantize this op only, leave the rest in FP, measure the final-output SNR delta.	`layerwise_error.json`
`graphwise_error_analyse`	Same `is_computing_op` set	Same	Cumulative — quantize everything up to and including this op, leave downstream in FP, measure the final-output SNR. Larger than layerwise; difference reveals interactions.	`graphwise_error.json`
`statistical_analyse`	All `QuantableOperation` minus `PASSIVE_OPERATIONS` (`MaxPool / Slice / Reshape / Transpose / Identity / Squeeze / Unsqueeze / Cast / GatherND / Scatter / Pad / Tile / NonMaxSuppression / RoiAlign / TopK / Resize / Split / Flatten / DepthToSpace / SpaceToDepth`)	Passive ops only — includes Add / Concat / AveragePool / Sigmoid / Softmax / GRU / LayerNorm / Resize and many other non-COMPUTING_OP types	Per-variable cumulative SNR — distribution stats (Float Mean/Std/Skew/Kurt, Quant counterparts, Noise Mean/Std, Noise:Signal ratio) on every input/output tensor. SNR is computed against the FP reference at that variable, not at final output.	`layer_stats_full.json`, plus filtered `layer_stats.json` and aggregated `non_computing_hot_ops.json`

non_computing_hot_ops.json ranks non-COMPUTING_OP layers by their max per-variable SNR (from statistical_analyse) and adds inputs_float_std_ratio (max/min Float Std across input variables) for direct R8 trigger detection.
graphwise_jumps.json computes graphwise[op_next] − graphwise[op_prev] − layerwise[op_next] for every adjacent pair of computing ops in the simplified graph. When this excess is > 0.02 (overridable via QUANT_CONFIG["graphwise_intervening_excess_threshold"]), the non-computing ops sitting between op_prev and op_next are flagged as suspected culprits.

The decision_playbook reads all three artifacts before proposing any Phase-3 lever.

Accuracy-first method ordering (with a soft penalty for on-device runtime cost)

Tier A — High accuracy, no on-device runtime cost (the Phase 2 base + most Phase 3 levers):

TQT (tqt_optimization) — POWER_OF_2-native gradient training of log2_scale. The strongest single pass for esp-dl POWER_OF_2 targets when calibration alone plateaus. The Phase 2 cartesian product uses the conservative default schedule (lr=1e-5, steps=500, block_size=4); Phase 3 escalates in three stages: 3a-1 (steps 500→1000), 3a-2 (steps 1000→2000, lr 1e-5→5e-5), 3a-3 (CONDITIONAL — block_size 4→2 for stability when 3a-1/3a-2 perturb a quiet layer or all other levers don't apply).
Calibration algorithm (calib_algorithm) — {kl, mse, percentile} are the three options Phase 2 sweeps (each combined with TQT(default)). The interplay between calibration and TQT is non-monotonic, see Composition discipline #4.
Bias correction — fixes systematic mean-shift error baked into Conv/Gemm bias.
Equalization — large gains on Conv→Conv chains (depthwise/pointwise pairs in MobileNet, bottleneck residuals in ResNet, inverted residuals in MobileNet-V2/V3, etc.). On per-tensor weight targets (esp32s3 / c) the canonical Tier A pass; on esp32p4 per-channel weight target it is warn-only (esp-ppq officially "Not recommend" for per-channel; empirically can still help on some networks — see references/ppq_methods.md §1 for the full rationale).
Fusion alignment (fusion_setting) — surgical fix for Concat/Add/Resize layers with mismatched input ranges.

Tier B — High accuracy, but slows on-device inference (apply surgically; gain must justify the slowdown — typically only on the worst 1-3 layers):

Mixed precision via dispatching_table (→ int16) — promote the very worst layer(s) from int8 to int16. Cap at ≤10% of total ops, ideally 1-3.
Weight split (weight_split) — only on layers whose weight outliers couldn't be handled by equalization. Inserts an extra Add op per split layer at runtime.

Tier C — Last resort (when Tier A+B plateau and the user accepts long offline training):

Blockwise reconstruction (blockwise_reconstruction) — biggest hammer. Scales are frozen by default (POWER_OF_2-safe); only weights move. GPU strongly recommended; CPU runs are hours per iteration. Coexists with TQT in the esp-ppq pipeline — the engine runs TrainedQuantizationThresholdPass and then AdaroundPass sequentially, so the two passes do not conflict. Combining doubles PC quantization time but improves accuracy attribution when used in 3g (blockwise is the only new variable on top of best). LSQ × {TQT, blockwise} remains hard-rejected (LSQ silently degenerates on POWER_OF_2 targets).

Tier D — Auto-disabled on POWER_OF_2 targets:

LSQ (lsq_optimization) — esp-ppq's LSQDelegator disables scale training under POWER_OF_2, so LSQ would silently degenerate to weight-only tuning while paying TQT-level PC time. The harness in scripts/apply_setting.py detects this conflict and disables LSQ with a clear warning. Use TQT instead.

See references/ppq_methods.md §11 (compatibility matrix) and §12 (target-policy compatibility) for the full method-vs-method and method-vs-target compatibility rules.

On-device runtime cost cheat-sheet

Pass / Lever	On-device cost	Why
`calib_algorithm` (kl/mse/percentile/minmax/isotone)	0	Only changes scale values, not the runtime graph.
`tqt_optimization` (any 3a-1/3a-2/3a-3 schedule)	0	Trains `log2_scale` offline; output is plain POWER_OF_2 scales.
`bias_correct`	0	Adjusts the bias tensor in-place; same op count, same memory.
`fusion_alignment`	0	Forces input scales to align — emits identical kernels.
`equalization`	0	Folds a diagonal scale into adjacent weights; runtime graph unchanged.
`blockwise_reconstruction`	0	Same as TQT: output is just adjusted weights/scales. PC-time is huge though.
`lsq_optimization`	0 if it ran	Auto-disabled on POWER_OF_2; not a real option here.
`dispatching_table` (→ int16 promotion)	+	Each promoted op pays ~2× cycles + ~2× activation memory permanently.
`weight_split`	+	Inserts an extra `Add` op per split layer (and the underlying Conv loses its bias-fusion).
Mixed-precision via per-op rules in `dispatching_table` (FP fallback)	++	Any op that can't run native int8/16 falls back to esp-dl FP path; multi-x latency hit.

Use `evaluate_fast()` during iteration if available

Names in `dispatching_table` must match simplified ONNX

Common pitfalls

Overwriting the user's ONNX: espdl_quantize_onnx saves the simplified ONNX back to the input path. The harness works around this by copying the user's ONNX to a temporary file in outputs/iter_<N>/_input.onnx before calling the API. Don't bypass this.
Stale calibration: the dataloader is rebuilt every iteration — calibration is fast enough that caching across iterations isn't worth the bug surface.
Setting interested_layers=None vs []: in esp-ppq None and [] both mean "all layers" for some passes but only one is accepted by others. The harness normalises this via scripts/apply_setting.py — agents writing JSON should use null or omit the field; never use empty arrays for interested_layers.
TQT/blockwise on CPU: these passes are 10-100x faster on GPU. If device=cpu and the user asks for TQT/blockwise, warn them the iteration will take a long time before submitting.
LSQ on POWER_OF_2 targets: the harness will skip LSQ and warn — don't re-enable it in the next iteration. The accuracy you wanted from LSQ comes from TQT here.
Calling --finalize in Phase 5 without --force-finalize: the script HARD REJECTS this by design (exit code 1, no outputs written) when target_metric is not reached and no plateau is detected. The rejection block surfaces untried_phase5_patterns, untried_phase3_levers, and untried_5beta_reapply so the agent can either keep iterating or — if the user budget really is exhausted — re-invoke with --force-finalize to confirm intentional early stop. The earlier soft-warning version was the load-bearing failure mode behind the 12-of-20, 18-of-30 and 21-of-30 premature- stop runs.
Skipping the Phase 2 cartesian product: this is the #1 search failure on esp-dl targets. Calib-only sweep does not work — see Composition discipline #4 and the Worked example. Always run all three legs (or short-circuit on target_metric).
Two changes per iteration: tempting when one of the changes "looks safe", but it produces confounded experiments. The classic failure mode: enabling percentile and bias_correct together, then blaming percentile when the iteration regresses (see the Worked example). One knob at a time.
Abbreviating Phase-3 lever templates (especially equalization). Writing only {"equalization": {"enabled": true}} is not equivalent to enabling lever 3d — the resolved opt_level falls back to esp-ppq's default 1, which does not cross Add/Sub branches and therefore silently skips every residual / inverted-residual chain (ResNet bottlenecks, MobileNet-V2/V3 inverted residuals, etc.) — exactly the layers lever 3d was designed to reach. The skill's lever-3d template recommends opt_level=2. Always copy comparison.json["next_step_hint"]["advice"]'s Template change snippet verbatim and only fill in rationale; the harness now emits a warning when opt_level is unset on an enabled equalization pass, and the warning is your signal that the iteration didn't actually test what its rationale claims.
Python environment drift: since the skill runs in the current Python interpreter (no Docker isolation), make sure the user's esp_ppq install matches the esp-dl tag they plan to deploy with.
Skipping --finalize when the user gave an iteration budget: the user says "iterate 3 times" / "迭代 3 轮", agent runs iter_0/1/2, sees comparison.json still pointing at phase-2-calib-tqt-sweep ("run mse next"), and accidentally runs a 4th iteration. User-budget stop > state-machine stop. As soon as the user-budgeted iteration count is reached, run the command in comparison.json["early_finalize_command"] (or copy from the "Tip: how to wrap up at any time" block in stdout) so outputs/best/ and outputs/final_report.md are produced regardless of phase. Especially important in Phase 5, which has no hard iteration cap — the only state-machine stop signals there are target_metric and plateau, so the user's budget is the primary guardrail.
Treating Phase 5 hints like a template: the Phase-5 advice text is meta-guidance, the phase5_signals block is data summary. Neither is a fillable setting.json skeleton. If you find yourself looking for "iteration_id" to paste, you're confusing Phase 3 with Phase 5 — go read phase5_signals and the on-disk error artifacts, then write the next setting.json yourself.
Phase 5 rationale without citations: a multi-knob Phase 5 iteration whose rationale is "trying equalization + bias_correct together" violates Composition discipline #5. Cite the iteration numbers whose data motivates each pass (e.g. "iter_9 showed equalization improved by +0.45 on bottleneck Convs; iter_7 layer_stats showed /features.../Conv output has |Noise Mean|=0.18 > 0.1×Noise Std=0.07 — bias_correct triggers"). Without citations the agent loses attribution and the iteration is a guess wearing a multi-knob disguise.
Bailing on Phase 5 stop signal (4) while user budget remains: this is the failure mode that bit example_quantize_mobilenetv2_bad_esp32s3/outputs/ — the agent ran 12 of 20 requested iterations, tried 3 Phase-5 attempts that regressed, and signed off with "out of ideas". Stop signal (4) is strict: it requires the user to NOT have given a budget AND untried_phase5_patterns, untried_phase3_levers, and untried_5beta_reapply to all be empty. If a user budget is in play, signal (4) is disabled — the agent must keep iterating, drawing from the untried lists or variations of attempted patterns, until the budget is hit. The hard-reject of premature --finalize enforced by compare_iterations.py is the operational guardrail; without --force-finalize, the script literally cannot write the final report while phase=phase-5 + target not met + no plateau.
Confusing on-device-cost with pc-time-cost: the new affects inference speed column refers strictly to deployment-time cost (does the quantized model run slower on the ESP target?). Long PC quantization time (e.g. blockwise reconstruction) does not show up here — see the "On-device runtime cost cheat-sheet" for the full per-pass mapping. When picking between rank=1 (best accuracy) and rank=2 (runner-up), the only reason to defer to runner-up is if rank=1's affects inference speed is Yes and the accuracy delta is small enough to live with.

Worked example: MobileNet-V2 on ESP32-P4

This phenomenon is not specific to MobileNet and not specific to esp32p4. Any "calibration tightly coupled with a training-based pass" scenario in esp-dl quantization can exhibit it: a calibration regresses standalone but rebounds to the strongest base once paired with TQT(default). The numbers below use MobileNet-V2 on ESP32-P4 as a concrete demonstration only.

Reproducer artifacts (from a real run): example_quantize_mobilenetv2_esp32p4/outputs/, particularly outputs/comparison.json and outputs/iter_10/setting.json.

Key empirical observations

iter_0 baseline (kl, no extras): top1 = 71.15%.
iter_2 (kl + TQT(default)): top1 = 71.15% — TQT on kl produced no improvement.
iter_6 (kl + TQT aggressive lr=1e-4 steps=2000): top1 = 71.30% — the best the old skill version could find, after 6 rounds of escalation.
iter_10 (percentile + TQT(default)): top1 = 71.475% — the highest result, found only when the user manually forced this combination.

iter_5 (3a-3 TQT block_size=4→2): +0.30%, became best at 71.45%.
iter_9 (3d equalization opt_level=2): +0.45%, became best at 71.60%.
iter_10/iter_11 (3e int16 promotion on 2→3 worst layers): +0.075%, became best at 71.70%.
End of Phase-3 linear levers. State machine emits phase-5-agent-driven because target_metric=71.878% was still unmet. phase5_signals.improving_levers listed iter_5 (3a-3), iter_9 (3d), iter_10/11 (3e); phase5_signals.untried_calib_swaps listed mse, percentile (best was still on kl).
iter_13 (Phase 5 — calib cross-pollination on top of the lever stack: kl→percentile while keeping TQT + equalization + int16x3): +0.55% to 72.25%. The Phase-2 isolated ranking had said percentile regressed (iter_3 percentile + TQT only was 71.0%); once stacked the verdict flipped.
iter_14 (Phase 5 — stack: added bias_correct on top of iter_13): +0.05% to 72.30%, target met.

Reference index

references/ppq_methods.md — every esp-ppq method, principle, parameters, when to use, what to avoid, plus the target-policy compatibility matrix.
references/decision_playbook.md — distribution pattern to candidate method mapping, plus the Multi-iteration strategy table aligned with the Phases above.
references/contract.md — what the user's user_quant.py must expose.
references/setting_json_schema.md — JSON schema the agent writes each iteration, with Phase 2 / Phase 3 templates.
scripts/run_iteration.py — the harness; one quantize + analyse + evaluate pass.
scripts/apply_setting.py — pure JSON → QuantizationSetting mapping; performs target-policy compatibility checks (LSQ on POWER_OF_2 → auto-disable; esp32p4 equalization → warn-only).
scripts/compare_iterations.py — cross-iteration diff + next-step state machine driving the Phase 1 → 2 → 3 → 5 procedure. Auto-finalises on phase-4 (writes outputs/best/ + outputs/final_report.md); --finalize forces finalize at any time (escape hatch when the user stops early at a fixed iteration budget); --force overrides the marker-based preserve check on hand-edited reports. Phase-5 hints surface a phase5_signals block of historical improving / regressing levers + untried calibrations + top error layers + non-computing hot ops + coverage tracking (phase5_pattern_coverage, untried_phase5_patterns, untried_phase3_levers — the lever ids that _PHASE3_CAP=5 left uncovered, untried_5beta_reapply — calibrations tried earlier on a shallower stack that need re-evaluating on the current best's deeper stack) plus a soft "Tunable parameters" advisory section. Purely data, no template. Coverage helpers _classify_phase5_patterns, _phase5_cutoff_iter_id, _phase5_pattern_coverage, _untried_phase3_levers, _untried_phase5_patterns, _phase5_lever_stack, _phase5_5beta_attempts, _untried_5beta_reapply are the building blocks; the hard-reject of premature --finalize in main() consumes them and refuses to write outputs unless --force-finalize confirms intentional early stop. The finalize_reason recorded in comparison.json and the ## Stop reason section of final_report.md document the category (target_reached / plateau / force_finalize_phase5 / manual_finalize_phase4 / manual_finalize_pre_phase5) and any untried coverage gaps. R8 (lever 3c) trigger constants _R8_MAX_SNR_PRIMARY (lower band) / _R8_MAX_SNR_UPPER (upper band) / _R8_STD_RATIO_REINFORCE / _R8_ACTIVATION_VETO_RATIO live at the top of the script for one-place tuning. Hindsight regression test: tests/hindsight_r8_examples.py parametrises over every example_quantize_*/outputs/ checked in alongside the skill.
assets/user_quant_torch_example.py — copy/edit for torch model flows.
assets/user_quant_onnx_example.py — copy/edit for ONNX model flows.
assets/extra_requirements.txt — extra Python packages the harness needs.

espdl-quantize

Mehr aus diesem Repository

Mehr aus diesem Repository

ESP-DL Quantization Tuning Skill

Generality boundary

Why this skill exists

What the user has to provide

High-level flow

Phases

Phase 0 — Validate contract and environment

Phase 1 — Baseline (iter_0)

Phase 2 — Calibration × TQT(default) cartesian product (mandatory)

Phase 3 — Residual fixes from best-so-far

Phase 5 — Agent-driven exploration

User-budget enforcement in Phase 5

Why _PHASE3_CAP=5 leaves untried linear-order levers (by design)

How premature finalize is prevented

Phase 4 — Final report

Composition discipline (read before every iteration)

Operating principles

Always look at distributions before changing settings

esp-ppq three-function coverage table

Accuracy-first method ordering (with a soft penalty for on-device runtime cost)

On-device runtime cost cheat-sheet

Use evaluate_fast() during iteration if available

Names in dispatching_table must match simplified ONNX

Common pitfalls

Worked example: MobileNet-V2 on ESP32-P4

Reference index

ESP-DL Quantization Tuning Skill

Generality boundary

Why this skill exists

What the user has to provide

High-level flow

Phases

Phase 0 — Validate contract and environment

Phase 1 — Baseline (iter_0)

Phase 2 — Calibration × TQT(default) cartesian product (mandatory)

Phase 3 — Residual fixes from best-so-far

Phase 5 — Agent-driven exploration

User-budget enforcement in Phase 5

Why _PHASE3_CAP=5 leaves untried linear-order levers (by design)

How premature finalize is prevented

Phase 4 — Final report

Composition discipline (read before every iteration)

Operating principles

Always look at distributions before changing settings

esp-ppq three-function coverage table

Accuracy-first method ordering (with a soft penalty for on-device runtime cost)

On-device runtime cost cheat-sheet

Use evaluate_fast() during iteration if available

Names in dispatching_table must match simplified ONNX

Common pitfalls

Worked example: MobileNet-V2 on ESP32-P4

Reference index

Why `_PHASE3_CAP=5` leaves untried linear-order levers (by design)

Use `evaluate_fast()` during iteration if available

Names in `dispatching_table` must match simplified ONNX

Why `_PHASE3_CAP=5` leaves untried linear-order levers (by design)

Use `evaluate_fast()` during iteration if available

Names in `dispatching_table` must match simplified ONNX