一键在 Manus 中运行任何 Skill

$pwd:

tune-ci-thresholds

Name: Tune Ci Thresholds
Author: sgl-project

// Run CI tests N times per stage on the H20 CI-reproduction host, produce a per-metric worst-of-N observation report, and (on user confirmation) write the worst-of-N values back into the test files as new baselines. Use when recalibrating CI thresholds after an engine update. Currently supports qwen3-omni-v1 and s2-pro-v1; extensible via models/<name>/config.yaml.

在 Manus 中运行

$ git log --oneline --stat

stars:296

forks:131

updated:2026年5月23日 21:24

文件资源管理器

6 个文件

SKILL.md

readonly

related-skills.json

同仓库

running-eval-suite.md

from "sgl-project/sglang-omni"

Run all reference benchmarks under benchmarks/eval/ and refresh the reference-table cells in benchmark_*.py. Auto-detects host hardware (H200/H100/H800/...). For each row's matching hw entry: replace cells in place. For new hardware or a new workload tag: append a new row to the section's table. Local-Pipeline-Result tables are skipped. One slash command runs everything end-to-end and auto-commits.

2026-05-22296

package.json

"author": "sgl-project"

"repository": "sgl-project/sglang-omni"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

软件质量保证分析师与测试员计算机与数学类职业15-1253L4

一键运行任何 Skill

name	tune-ci-thresholds
description	Run CI tests N times per stage on the H20 CI-reproduction host, produce a per-metric worst-of-N observation report, and (on user confirmation) write the worst-of-N values back into the test files as new baselines. Use when recalibrating CI thresholds after an engine update. Currently supports qwen3-omni-v1 and s2-pro-v1; extensible via models/<name>/config.yaml.

tune-ci-thresholds

Scope

This skill is for the H20 CI-reproduction host only (the same image CI uses, frankleeeee/sglang-omni:dev; the container name varies). Numbers from environments that differ meaningfully from CI (different GPU model, different image, different pinned sglang/torch) are not comparable and must not drive threshold changes. If you just want to run the tests locally, use pytest directly — this skill is not for that.

The skill is observation-first: it runs tests N times and produces a worst-of-N report. After the report is shown, it offers a one-shot apply step that writes the worst-of-N values directly into the test files as the new P95 baselines and accuracy / WER thresholds — only if the user explicitly confirms. The skill still does NOT re-run apply_slack separately, generate patch files, or commit / push anything; if the user rejects the apply prompt, the test files stay untouched and the user picks values manually from the report.

Models

Each supported model has a config under models/<name>/:

config.yaml — hf model id, datasets, default venv, test globs, per-test extra env, stage-key naming, and metric_sources (per-test result-JSON paths that tune.py reads to get metric values)
stages.yaml — generated by tune.py discover --model <name>

List what's configured:

python .claude/skills/tune-ci-thresholds/tune.py models-list

Today: qwen3-omni-v1, s2-pro-v1. To add another model, drop in a new models/<name>/config.yaml and run tune.py discover --model <name>. No Python code changes needed unless the new model emits metrics with a constant-naming convention not covered by match_metric() in tune.py — in that case the matcher has to grow first.

Prerequisites (I verify, I do not create)

Running inside the CI-reproduction container (image frankleeeee/sglang-omni:dev or equivalent). The container name is not checked — rely on the image being correct.
Venv ready; default path comes from the selected model's config.yaml, overridable via --venv-python or $TUNE_VENV_PYTHON
Branch checked out, dependencies installed
Model weights and datasets from the config cached locally. precheck lists each asset as ✓ / ✗ and, on any miss, prints the exact huggingface-cli download … commands to run.
Env vars under auto_env in the model's config.yaml are set automatically at tune.py startup. The user does NOT need to export them. Proxy env vars (http_proxy etc.) are left alone — the tests' own disable_proxy() helper strips them for loopback calls, matching real CI.
No GPU processes holding memory at precheck time. If all GPUs are busy, precheck fails with the busy PID list and the user must free them. During tune.py run, the tool runs delete_gpu_process.sh and waits until each selected GPU is ≤ 2048 MiB before every pytest invocation and retry — this matches CI's per-stage cleanup, but only inside an active calibration run. Precheck itself never kills processes.

If anything's off, precheck fails with an actionable message; fix it yourself and retry.

Invocation

/tune-ci-thresholds — default model, all stages, 5 repeats
/tune-ci-thresholds --model qwen3-omni-v1 --stages mmsu_accuracy --repeats 3
/tune-ci-thresholds --resume <run-dir> — continue an interrupted run

Common Qwen3-Omni V1 presets:

# Full threshold stages, excluding docs smoke tests.
python .claude/skills/tune-ci-thresholds/tune.py --model qwen3-omni-v1 run \
  --stages mmmu,mmmu_talker,mmsu,mmsu_talker,tts,videoamme,videoamme_talker,videomme,videomme_talker \
  --repeats 5 --output-dir .tune-runs/<timestamp>_qwen3-omni-v1_cuda-graph_no-docs_r5

Environment and networking notes

Some CI-reproduction hosts need outbound network proxies or a HuggingFace mirror. Keep those values environment-specific and do not commit real proxy hosts, ports, usernames, tokens, or personal paths into this skill.
Prefer explicit environment variables in the same shell command that starts tune.py when a long run may be backgrounded. Use placeholders in docs and replace them only in the local shell: TUNE_VENV_PYTHON=<venv-python>, ALL_PROXY=<proxy-url>, HTTP_PROXY=<proxy-url>, HTTPS_PROXY=<proxy-url>, NO_PROXY=localhost,127.0.0.1,::1, HF_ENDPOINT=<hf-endpoint>, HF_HOME=<hf-cache-dir>, and HF_HUB_DISABLE_XET=1 when the environment needs them.
Do not wrap pytest with proxychains4: it can proxy loopback health checks and make local server startup look broken. Use proxy env vars plus NO_PROXY for local addresses.
If HuggingFace cache locks appear, inspect active pytest/server/download processes first. Only stop processes from the current calibration run.

Performance optimization checks

When recalibrating after performance work, first identify what changed since the last comparable calibration. Use the previous report's provenance commit, the current precheck.json commit, or a user-provided baseline, then inspect the commit range before judging the numbers:
```
git log --oneline <previous-calibration-commit>..<current-calibration-commit>
git diff --stat <previous-calibration-commit>..<current-calibration-commit>
```
From that range, list the performance-sensitive changes and their expected enablement signals. Examples: CUDA Graph replay, torch.compile, fused kernels, batching/concurrency changes, cache changes, scheduler changes, or preprocessing/audio/video pipeline changes.
Do not infer that an optimization is active from config alone. For every relevant optimization, look for runtime evidence in logs, metrics, or profiler output that proves the optimized path actually ran. For example, CUDA Graph may require cuda graph: True decode logs; a future torch.compile change may require compile/cache-hit logs or other project-specific evidence.
If performance is unexpectedly flat or worse, inspect both configuration and propagation through server args, runners, schedulers, and stage factories before applying thresholds. An optimization being configured and the optimized path actually being used are different things.
In the final report, separate accuracy, WER, and speed conclusions. Explain which stages match the expected optimization gains and which remain dominated by other work such as preprocessing, long prefill, audio synthesis, ASR, or video decoding.

Monitoring, failures, and completeness (mandatory)

Agent polling — never blind-wait

Maximum idle poll interval: 120 seconds (2 minutes). Never use block_until_ms ≥ 50 minutes or any equivalent long sleep while a calibration run is active. Long blind waits hide server crashes and waste hours.
While tune.py run is in progress, every 120s at most:
1. Run python tune.py status --run-dir <run-dir> and read JSON.
2. Tail <run-dir>/run.log and the active <run-dir>/_pytest/<test>/run{k}.log (last ~30 lines).
3. Report ok/total completeness and GPU memory to yourself.
If status shows pytest_active: false but completeness is not complete: true and the last log lines show crash / OOM / server startup failure, do not keep waiting — immediately resume:
```
python tune.py --model <M> run --output-dir <run-dir> --resume
```
If GPU memory is > 2048 MiB on any GPU needed for the next run, do not start another pytest — wait for tune.py cleanup or run status until memory drops.

tune.py built-in safeguards (v0.4+)

GPU hard gate (< 2 GiB): no pytest restart unless every selected GPU has memory.used <= 2048 MiB and no compute apps. Enforced at:
1. _ensure_gpus_free() — kill stale processes, poll up to 10 min
2. _pick_gpus_for_launch() — select GPUs only after cleanup
3. _launch_gpu_gate() — recheck 3s before pytest Popen; if memory rose, abort launch and cleanup again
4. After every run / before every retry — _ensure_gpus_free() again Never launch on 17 GiB stale contexts. If gate fails, the run aborts that attempt and retries only after memory drops.
Pytest watchdog: polls every 30s; kills pytest early when the log shows server crash signatures (OOM, segfault, router/worker death).
Auto-retry passes: after the first pass, run automatically re-executes any stage-run whose metrics are incomplete (up to --max-passes, default 10), with GPU cleanup between passes.
Per-run retries: up to 4 attempts for OOM / crash / GPU-not-clear failures before marking a stage-run incomplete.
status subcommand: machine-readable snapshot for agent polling.
report gate: refuses to write report.md unless every stage × repeat has complete metrics (130/130 for full qwen3, etc.).

Completeness is a hard prerequisite for thresholds

Never show the apply prompt (step 9) or write thresholds unless tune.py status --run-dir <run-dir> returns "complete": true.
Partial runs may exist on disk for debugging, but they are not valid calibration artifacts. Do not infer worst-of-N from missing runs.
If completeness fails after --max-passes, relay the missing list from status JSON and --resume — do not proceed to apply.

Resume

On interruptions or failed stage-runs, resume with the same --output-dir --resume; completed stage-runs are skipped, incomplete ones are purged and re-run automatically.
Do not rerun completed repeats from scratch unless the run directory is corrupt.

Steps I follow

Run python .claude/skills/tune-ci-thresholds/tune.py models-list to discover available models. Then for the selected model, run python tune.py --model <M> stages-list to read the per-test-file bases (e.g. mmmu, mmmu_talker, mmsu, mmsu_talker, tts, ...) and group aliases such as @accuracy, @speed, and @wer.
One-time parameter prompt. If the invocation omits --model, --stages, or --repeats, collect missing fields from the user exactly once. After this, do not ask the user anything else for the rest of the run.

Use two mechanisms together:

A. Plain text prompt for stages — because the base list (up to 6+) does not fit in AskUserQuestion's 4-option cap. Print a single message listing every base from tune.py --model <M> stages-list, then wait for the user's reply on the next turn. Format:
```
Which tests should I calibrate? Reply with one or more of:
  ALL                          (every stage)
  mmmu                         tests/test_model/test_qwen3_omni_mmmu_ci.py — acc + speed
  mmmu_talker                  tests/test_model/test_qwen3_omni_mmmu_talker_ci.py — acc + wer + speed
  mmsu                         tests/test_model/test_qwen3_omni_mmsu_ci.py — acc + speed
  mmsu_talker                  tests/test_model/test_qwen3_omni_mmsu_talker_ci.py — acc + wer + speed
  videomme                     tests/test_model/test_qwen3_omni_videomme_ci.py — acc + speed
  videomme_talker              tests/test_model/test_qwen3_omni_videomme_talker_ci.py — acc + wer + speed
  videoamme                    tests/test_model/test_qwen3_omni_videoamme_ci.py — acc + speed
  videoamme_talker             tests/test_model/test_qwen3_omni_videoamme_talker_ci.py — acc + wer + speed
  tts                          tests/test_model/test_qwen3_omni_tts_ci.py — speed + wer
Shortcuts: @accuracy, @speed, @wer (metric-group aliases).
Combine with commas (e.g. "mmmu,mmsu" or "mmmu,@wer").
```
Parse the user's free-text reply (trim whitespace, split on commas) and pass verbatim to --stages; tune.py handles expansion.

B. AskUserQuestion for model and repeats — both are small finite sets. Put both in a single AskUserQuestion call (two questions). Skip any field already specified by the invocation.
- model: list the names from models-list. If only one is available and no --model given, skip asking (just use it).
- repeats: options 1 (smoke) / 2 / 3 / 5 (default).
If the invocation already has --stages, --model, and --repeats, skip step 2 entirely.

When passing --stages to tune.py run, bases (mmmu), exact stage keys (mmmu_accuracy), and @group aliases are all accepted and expanded automatically.
Run python tune.py --model <M> precheck --output-dir <run-dir>. On failure, relay the message verbatim and stop. <run-dir> must live under .tune-runs/<timestamp>_<label>/ at the repo root (e.g. .tune-runs/20260423T050000Z_mmsu_r3/). That path is already gitignored. Do NOT point <run-dir> inside .claude/skills/ or anywhere else under version control — run artifacts can be large and must not leak into commits.
State plan in one line: Running <M>: <stages>, <N> repeats, est. <T>. No further confirmation.
Before launching run, tell the user the output dir and the log paths, plus the 2-minute polling contract:
```
Output dir: <run-dir>
tail -f <run-dir>/run.log                               # tune.py progress
tail -f <run-dir>/_pytest/<test>/run1.log               # pytest subprocess
Agent polls every ≤120s:
  python tune.py status --run-dir <run-dir>
```
Then run python tune.py --model <M> run --stages ... --repeats N --output-dir <run-dir>. While the subprocess runs, poll with status every ≤120s — never blind-wait ≥50 min. On crash or incomplete metrics, --resume immediately.
When tune.py run exits 0, verify completeness once more: python tune.py status --run-dir <run-dir> must show "complete": true. Then open <run-dir>/report.md.
For every {{CONTEXT:<stage_key>}} placeholder: a. Load models/<M>/stages.yaml; find that stage's test path and context_vars. b. Read the test file; extract the literal numeric value of each listed constant (e.g. MAX_SAMPLES = 2000 → 2000). c. Load precheck.json for GPU count + model. d. Replace the placeholder with one line, e.g.: — <N>× <gpu_model> from precheck.json, 2000 samples, max_tokens=32, concurrency=8, 5 runs. If the stage is the docs stage (no threshold constants), write — <N>× <gpu_model>, docs smoke, <N> runs. e. If a context var is not found in the file, write ?. Never guess or copy from another stage.
Tell the user the report path. Treat <run-dir>/report.md as the canonical calibration artifact: it must keep the full per-run tables, worst-of-N rows, provenance, context lines, and (after apply) the applied-changes table. Do not replace it with a lightweight summary.
Apply prompt — strictly after the entire run is done AND complete. This prompt is the LAST thing the skill does, and must only fire once ALL of the following have completed for the whole --stages set: tune.py run has exited with exit code 0, tune.py status --run-dir <run-dir> shows "complete": true (every stage × repeat has full metrics — e.g. 130/130 for full qwen3), report.md has been written, every {{CONTEXT:...}} placeholder in step 7 has been resolved, and step 8 has shown the user the report path. Never ask between stages, between repeats, on partial failure, or while any pytest subprocess is still alive — the user may be running unattended for an hour+ and must not be interrupted mid-run. If the run was aborted, completeness check failed, or any stage-run is missing metrics, skip step 9 entirely.

Use AskUserQuestion to ask exactly once which apply mode to use:
- report — only the report, no test files touched
- smart — auto-tighten speed thresholds; ask per metric for acc/wer and any speed metric that would loosen
- full — write worst-of-N for every metric, no further prompts If the user picks report, stop without touching any file.
For smart and full, first run python tune.py apply-plan --run-dir <run-dir> to get a JSON with, per metric: source_kind (bare / nested), symbol, subkey, concurrency, worst_op, per_run_raw, worst_raw, worst_rounded (display-only), write_value (the literal to write), current_raw, and direction (tightens / loosens / equal / unknown).

Which value to write:
- wer and accuracy: always write_value (= worst_raw exactly). Never round WER/accuracy to display.digits — report percentages use 2 decimal places for readability, but test-file literals must preserve the full observed float so a max-bound WER threshold is never accidentally tightened (e.g. 0.010596 → 0.01).
- speed: use write_value from apply-plan (rounded unless that would tighten beyond worst_raw). Never re-round or multiply by scale.
Bounded write rules (enforced in write_value):
- worst_op == "min": written value must be <= worst_raw
- worst_op == "max": written value must be >= worst_raw If display rounding would violate either bound, write_value falls back to worst_raw with full precision.
Mode full: for every metric in every non-docs stage, edit the test file using the rules in (b) below, no questions asked.

Mode smart: classify each metric:
- auto-apply iff stage_group == "speed" AND direction == "tightens". Edit using rules in (b).
- auto-skip iff direction == "equal" (nothing to do).
- interactive otherwise — i.e. all accuracy and wer metrics, plus any speed metric that would loosen the threshold. For each interactive metric, fire AskUserQuestion (one per metric) showing:
  - the per-run raw values from per_run_raw
  - the current literal in the test file (current_raw)
  - the proposed value (write_value — full-precision for wer/acc)
  - direction tag with options:
  1. Keep current — leave the literal as-is
  2. Apply worst-of-N (<write_value>) — write write_value
  3. Custom value — the user supplies a number; write it verbatim after validating it parses as a float Always include the "Other" free-text fallback (the AskUserQuestion harness adds it automatically). If the user gives a custom numeric value, validate that it parses as a float and write exactly that raw value (not the display-scaled value).
(b) Edit rules (used by both full and smart's auto-apply path, and after the user accepts in interactive prompts):
- Write write_value from apply-plan — never worst_rounded directly, and never re-format with display.digits.
- Bare source (no [...]), e.g. MMMU_MIN_ACCURACY: replace the RHS literal of MMMU_MIN_ACCURACY = <old> with write_value.
- Nested source, e.g. _MMMU_P95['throughput_qps']: use the concurrency field from apply-plan output, then replace the entry under _MMMU_P95[<C>]["throughput_qps"] with write_value. If concurrency is null (no CONCURRENCY symbol in the test file) and the dict has a single key, fall back to that key; if multiple keys exist and concurrency is null, abort the apply step for that metric and warn the user.
- For any metric whose direction came back unknown (couldn't parse current literal — usually means the test file diverged from stages.yaml), do not edit; warn and continue.
After all edits across all stages, do two things:

(c) Append an "Applied changes" section to <run-dir>/report.md so the artifact records what was actually written. Use the Edit tool to insert this block immediately before the existing ## Provenance heading:
```
## Applied changes

| Stage | Metric | Old | New | Direction |
|-------|--------|-----|-----|-----------|
| <stage_key> | <source> | <current_raw> | <new_raw> | <direction> |
...
```
Rules:
- Include only metrics that were actually edited. Rows for "Keep current" choices, mode-report runs, and equal / unknown skips are omitted.
- Stage is the stage_key (e.g. mmsu_accuracy).
- Metric is the literal metric.source from apply-plan — bare (MMSU_MIN_ACCURACY) or nested (_MMSU_P95[8]['throughput_qps'] with the resolved concurrency substituted in).
- Old / New are raw numeric values (matching what's in the test file, not display-scaled). Trim trailing zeros for readability.
- Direction describes the effect on CI strictness — derived from worst_op and the sign of new - old:
  - worst_op == "min" (threshold is a lower bound, e.g. throughput_qps): new > old → tightens, new < old → loosens.
  - worst_op == "max" (threshold is an upper bound, e.g. latency_mean_s, rtf_mean, WER_..._MAX): new < old → tightens, new > old → loosens. Format the cell as tightens (Δ%) or loosens (Δ%) where Δ% is the signed percent change of the raw value relative to the old raw value, e.g. tightens (+2.1%), loosens (-7.9%). Use one decimal place. Direction MUST come from worst_op (not from sign-of-Δ alone) — for max-bounded metrics, a negative Δ% is a tightening.
- If nothing was edited (all kept / all skipped), do not append the section at all.
(d) List every changed <file>:<symbol> = <new> tuple in one chat message. If the user has explicitly authorized commit/push, continue to the version-control step below; otherwise stop.
Optional version-control step — only with explicit user authorization.
- Keep .tune-runs/ local and uncommitted.
- If the calibration evidence should be committed, copy the final <run-dir>/report.md (after context replacement and any applied-changes section) to a stable path under docs/calibration/ and commit that raw observation report. A short summary under docs/ is optional, but it must not replace the raw per-run report.
- Commit only threshold/test edits, skill/config changes, and requested calibration reports / summaries under docs/.
- Run repository pre-commit hooks normally; do not bypass hooks.
- Push only the current feature/calibration branch, never main.
- Provide a PR description with: summary, calibration run directory, CUDA Graph evidence, worst-of-N highlights, threshold-apply policy, and test/pre-commit verification.

What I do not do

Set up container / venv / caches
Check out branches, install packages
Run apply_slack or generate patch files
Commit or push without explicit user authorization
Edit test files outside of the explicit apply prompt (step 9)
Write ad-hoc apply scripts that re-round metrics — always use apply-plan's write_value field when editing test files
Round WER or accuracy thresholds to display.digits (report-only)
Ask mid-run for confirmation. (I may ask once up front for missing model/stages/repeats — step 2 — and once at the end for the apply prompt — step 9. No other questions.)

Files in this skill

.claude/skills/tune-ci-thresholds/
├── SKILL.md
├── tune.py                              # CLI; METRIC_SPECS + JSON extractor
│                                        # subcommands: run, report, status,
│                                        # apply-plan, precheck, discover
└── models/
    ├── qwen3-omni-v1/                   # v1 pipeline (qwen3-omni)
    │   ├── config.yaml
    │   └── stages.yaml
    └── s2-pro-v1/                       # v1 pipeline (FishAudio S2-Pro,
        ├── config.yaml                  #   uses per-test-file `variants`)
        └── stages.yaml

How metric values get read

tune.py spawns pytest with --basetemp=<fresh dir>/_pytest/<test>/basetemp_run{k}. Each test writes its result JSON (mmmu_results.json, speed_results.json, …) under that dir at a deterministic path. After pytest exits, tune.py loads those JSONs and pulls each metric by dotted key. Nothing is parsed from stdout — the test doesn't need to print anything.

The metric_sources block in config.yaml declares, per test file:

json_file — path relative to pytest basetemp (the default file for every metric in this test)
paths — {metric_key: "dotted.path"}, or "file::dotted.path" inline if the metric lives in a different JSON than the default
variants — optional; for tests that produce parallel result trees (e.g. nonstream / stream voice-clone in the same pytest run). Each variant entry has constant_filter (regex matched against the bare constant name with any leading underscore stripped), json_file, sample_counts, paths — same shape as the file-level fields. Constants matching a variant's filter are routed only to that variant; stage keys become <base>_<variant>_<group> (e.g. tts_nonstream_speed, tts_stream_wer). The bare base (tts) still resolves to all variants via the alias system.

Regenerating stages.yaml

If a test file's sha256 no longer matches models/<M>/stages.yaml, run will warn. Regenerate with:

python tune.py --model <M> discover

This is deterministic (AST + config lookup, no LLM calls).

Adding a new model

Create models/<new-name>/config.yaml mirroring qwen3-omni-v1/config.yaml, including metric_sources entries for every test file the model owns.
Run python tune.py --model <new-name> discover.
Any metric that shows up as NEEDS_CONFIG means the constant was recognized but the config is missing a paths entry for it — add the dotted JSON key and re-run discover.

Adding a new metric to an existing model

If a new test file adds a threshold constant:

Matching an existing naming pattern (*_ACC_MIN, *_WER_MAX_CORPUS, nested _*_P95[*].<known_key>) → discover picks it up for free. Add its JSON dotted key under metric_sources.<test_file>.paths.
New nested-dict key (e.g. _*_P95[*].ttft_ms) → add to _NESTED and METRIC_SPECS in tune.py.
New naming pattern (e.g. *_BLEU_MIN) → extend match_metric() and METRIC_SPECS in tune.py.
Metric lives in a different JSON than the test's default → use the <file>::<dotted.path> inline form in metric_sources.<test>.paths.

Threshold constants whose name match_metric() doesn't recognize are silently ignored — extend match_metric() if you add a new pattern.

name	tune-ci-thresholds
description	Run CI tests N times per stage on the H20 CI-reproduction host, produce a per-metric worst-of-N observation report, and (on user confirmation) write the worst-of-N values back into the test files as new baselines. Use when recalibrating CI thresholds after an engine update. Currently supports qwen3-omni-v1 and s2-pro-v1; extensible via models/<name>/config.yaml.

tune-ci-thresholds

Scope

Models

Each supported model has a config under models/<name>/:

config.yaml — hf model id, datasets, default venv, test globs, per-test extra env, stage-key naming, and metric_sources (per-test result-JSON paths that tune.py reads to get metric values)
stages.yaml — generated by tune.py discover --model <name>

List what's configured:

python .claude/skills/tune-ci-thresholds/tune.py models-list

Prerequisites (I verify, I do not create)

Running inside the CI-reproduction container (image frankleeeee/sglang-omni:dev or equivalent). The container name is not checked — rely on the image being correct.
Venv ready; default path comes from the selected model's config.yaml, overridable via --venv-python or $TUNE_VENV_PYTHON
Branch checked out, dependencies installed
Model weights and datasets from the config cached locally. precheck lists each asset as ✓ / ✗ and, on any miss, prints the exact huggingface-cli download … commands to run.
Env vars under auto_env in the model's config.yaml are set automatically at tune.py startup. The user does NOT need to export them. Proxy env vars (http_proxy etc.) are left alone — the tests' own disable_proxy() helper strips them for loopback calls, matching real CI.
No GPU processes holding memory at precheck time. If all GPUs are busy, precheck fails with the busy PID list and the user must free them. During tune.py run, the tool runs delete_gpu_process.sh and waits until each selected GPU is ≤ 2048 MiB before every pytest invocation and retry — this matches CI's per-stage cleanup, but only inside an active calibration run. Precheck itself never kills processes.

If anything's off, precheck fails with an actionable message; fix it yourself and retry.

Invocation

/tune-ci-thresholds — default model, all stages, 5 repeats
/tune-ci-thresholds --model qwen3-omni-v1 --stages mmsu_accuracy --repeats 3
/tune-ci-thresholds --resume <run-dir> — continue an interrupted run

Common Qwen3-Omni V1 presets:

# Full threshold stages, excluding docs smoke tests.
python .claude/skills/tune-ci-thresholds/tune.py --model qwen3-omni-v1 run \
  --stages mmmu,mmmu_talker,mmsu,mmsu_talker,tts,videoamme,videoamme_talker,videomme,videomme_talker \
  --repeats 5 --output-dir .tune-runs/<timestamp>_qwen3-omni-v1_cuda-graph_no-docs_r5

Environment and networking notes

Some CI-reproduction hosts need outbound network proxies or a HuggingFace mirror. Keep those values environment-specific and do not commit real proxy hosts, ports, usernames, tokens, or personal paths into this skill.
Prefer explicit environment variables in the same shell command that starts tune.py when a long run may be backgrounded. Use placeholders in docs and replace them only in the local shell: TUNE_VENV_PYTHON=<venv-python>, ALL_PROXY=<proxy-url>, HTTP_PROXY=<proxy-url>, HTTPS_PROXY=<proxy-url>, NO_PROXY=localhost,127.0.0.1,::1, HF_ENDPOINT=<hf-endpoint>, HF_HOME=<hf-cache-dir>, and HF_HUB_DISABLE_XET=1 when the environment needs them.
Do not wrap pytest with proxychains4: it can proxy loopback health checks and make local server startup look broken. Use proxy env vars plus NO_PROXY for local addresses.
If HuggingFace cache locks appear, inspect active pytest/server/download processes first. Only stop processes from the current calibration run.

Performance optimization checks

When recalibrating after performance work, first identify what changed since the last comparable calibration. Use the previous report's provenance commit, the current precheck.json commit, or a user-provided baseline, then inspect the commit range before judging the numbers:
```
git log --oneline <previous-calibration-commit>..<current-calibration-commit>
git diff --stat <previous-calibration-commit>..<current-calibration-commit>
```
From that range, list the performance-sensitive changes and their expected enablement signals. Examples: CUDA Graph replay, torch.compile, fused kernels, batching/concurrency changes, cache changes, scheduler changes, or preprocessing/audio/video pipeline changes.
Do not infer that an optimization is active from config alone. For every relevant optimization, look for runtime evidence in logs, metrics, or profiler output that proves the optimized path actually ran. For example, CUDA Graph may require cuda graph: True decode logs; a future torch.compile change may require compile/cache-hit logs or other project-specific evidence.
If performance is unexpectedly flat or worse, inspect both configuration and propagation through server args, runners, schedulers, and stage factories before applying thresholds. An optimization being configured and the optimized path actually being used are different things.
In the final report, separate accuracy, WER, and speed conclusions. Explain which stages match the expected optimization gains and which remain dominated by other work such as preprocessing, long prefill, audio synthesis, ASR, or video decoding.

Monitoring, failures, and completeness (mandatory)

Agent polling — never blind-wait

Maximum idle poll interval: 120 seconds (2 minutes). Never use block_until_ms ≥ 50 minutes or any equivalent long sleep while a calibration run is active. Long blind waits hide server crashes and waste hours.
While tune.py run is in progress, every 120s at most:
1. Run python tune.py status --run-dir <run-dir> and read JSON.
2. Tail <run-dir>/run.log and the active <run-dir>/_pytest/<test>/run{k}.log (last ~30 lines).
3. Report ok/total completeness and GPU memory to yourself.
If status shows pytest_active: false but completeness is not complete: true and the last log lines show crash / OOM / server startup failure, do not keep waiting — immediately resume:
```
python tune.py --model <M> run --output-dir <run-dir> --resume
```
If GPU memory is > 2048 MiB on any GPU needed for the next run, do not start another pytest — wait for tune.py cleanup or run status until memory drops.

tune.py built-in safeguards (v0.4+)

GPU hard gate (< 2 GiB): no pytest restart unless every selected GPU has memory.used <= 2048 MiB and no compute apps. Enforced at:
1. _ensure_gpus_free() — kill stale processes, poll up to 10 min
2. _pick_gpus_for_launch() — select GPUs only after cleanup
3. _launch_gpu_gate() — recheck 3s before pytest Popen; if memory rose, abort launch and cleanup again
4. After every run / before every retry — _ensure_gpus_free() again Never launch on 17 GiB stale contexts. If gate fails, the run aborts that attempt and retries only after memory drops.
Pytest watchdog: polls every 30s; kills pytest early when the log shows server crash signatures (OOM, segfault, router/worker death).
Auto-retry passes: after the first pass, run automatically re-executes any stage-run whose metrics are incomplete (up to --max-passes, default 10), with GPU cleanup between passes.
Per-run retries: up to 4 attempts for OOM / crash / GPU-not-clear failures before marking a stage-run incomplete.
status subcommand: machine-readable snapshot for agent polling.
report gate: refuses to write report.md unless every stage × repeat has complete metrics (130/130 for full qwen3, etc.).

Completeness is a hard prerequisite for thresholds

Never show the apply prompt (step 9) or write thresholds unless tune.py status --run-dir <run-dir> returns "complete": true.
Partial runs may exist on disk for debugging, but they are not valid calibration artifacts. Do not infer worst-of-N from missing runs.
If completeness fails after --max-passes, relay the missing list from status JSON and --resume — do not proceed to apply.

Resume

On interruptions or failed stage-runs, resume with the same --output-dir --resume; completed stage-runs are skipped, incomplete ones are purged and re-run automatically.
Do not rerun completed repeats from scratch unless the run directory is corrupt.

Steps I follow

Run python .claude/skills/tune-ci-thresholds/tune.py models-list to discover available models. Then for the selected model, run python tune.py --model <M> stages-list to read the per-test-file bases (e.g. mmmu, mmmu_talker, mmsu, mmsu_talker, tts, ...) and group aliases such as @accuracy, @speed, and @wer.
One-time parameter prompt. If the invocation omits --model, --stages, or --repeats, collect missing fields from the user exactly once. After this, do not ask the user anything else for the rest of the run.

Use two mechanisms together:

A. Plain text prompt for stages — because the base list (up to 6+) does not fit in AskUserQuestion's 4-option cap. Print a single message listing every base from tune.py --model <M> stages-list, then wait for the user's reply on the next turn. Format:
```
Which tests should I calibrate? Reply with one or more of:
  ALL                          (every stage)
  mmmu                         tests/test_model/test_qwen3_omni_mmmu_ci.py — acc + speed
  mmmu_talker                  tests/test_model/test_qwen3_omni_mmmu_talker_ci.py — acc + wer + speed
  mmsu                         tests/test_model/test_qwen3_omni_mmsu_ci.py — acc + speed
  mmsu_talker                  tests/test_model/test_qwen3_omni_mmsu_talker_ci.py — acc + wer + speed
  videomme                     tests/test_model/test_qwen3_omni_videomme_ci.py — acc + speed
  videomme_talker              tests/test_model/test_qwen3_omni_videomme_talker_ci.py — acc + wer + speed
  videoamme                    tests/test_model/test_qwen3_omni_videoamme_ci.py — acc + speed
  videoamme_talker             tests/test_model/test_qwen3_omni_videoamme_talker_ci.py — acc + wer + speed
  tts                          tests/test_model/test_qwen3_omni_tts_ci.py — speed + wer
Shortcuts: @accuracy, @speed, @wer (metric-group aliases).
Combine with commas (e.g. "mmmu,mmsu" or "mmmu,@wer").
```
Parse the user's free-text reply (trim whitespace, split on commas) and pass verbatim to --stages; tune.py handles expansion.

B. AskUserQuestion for model and repeats — both are small finite sets. Put both in a single AskUserQuestion call (two questions). Skip any field already specified by the invocation.
- model: list the names from models-list. If only one is available and no --model given, skip asking (just use it).
- repeats: options 1 (smoke) / 2 / 3 / 5 (default).
If the invocation already has --stages, --model, and --repeats, skip step 2 entirely.

When passing --stages to tune.py run, bases (mmmu), exact stage keys (mmmu_accuracy), and @group aliases are all accepted and expanded automatically.
Run python tune.py --model <M> precheck --output-dir <run-dir>. On failure, relay the message verbatim and stop. <run-dir> must live under .tune-runs/<timestamp>_<label>/ at the repo root (e.g. .tune-runs/20260423T050000Z_mmsu_r3/). That path is already gitignored. Do NOT point <run-dir> inside .claude/skills/ or anywhere else under version control — run artifacts can be large and must not leak into commits.
State plan in one line: Running <M>: <stages>, <N> repeats, est. <T>. No further confirmation.
Before launching run, tell the user the output dir and the log paths, plus the 2-minute polling contract:
```
Output dir: <run-dir>
tail -f <run-dir>/run.log                               # tune.py progress
tail -f <run-dir>/_pytest/<test>/run1.log               # pytest subprocess
Agent polls every ≤120s:
  python tune.py status --run-dir <run-dir>
```
Then run python tune.py --model <M> run --stages ... --repeats N --output-dir <run-dir>. While the subprocess runs, poll with status every ≤120s — never blind-wait ≥50 min. On crash or incomplete metrics, --resume immediately.
When tune.py run exits 0, verify completeness once more: python tune.py status --run-dir <run-dir> must show "complete": true. Then open <run-dir>/report.md.
For every {{CONTEXT:<stage_key>}} placeholder: a. Load models/<M>/stages.yaml; find that stage's test path and context_vars. b. Read the test file; extract the literal numeric value of each listed constant (e.g. MAX_SAMPLES = 2000 → 2000). c. Load precheck.json for GPU count + model. d. Replace the placeholder with one line, e.g.: — <N>× <gpu_model> from precheck.json, 2000 samples, max_tokens=32, concurrency=8, 5 runs. If the stage is the docs stage (no threshold constants), write — <N>× <gpu_model>, docs smoke, <N> runs. e. If a context var is not found in the file, write ?. Never guess or copy from another stage.
Tell the user the report path. Treat <run-dir>/report.md as the canonical calibration artifact: it must keep the full per-run tables, worst-of-N rows, provenance, context lines, and (after apply) the applied-changes table. Do not replace it with a lightweight summary.
Apply prompt — strictly after the entire run is done AND complete. This prompt is the LAST thing the skill does, and must only fire once ALL of the following have completed for the whole --stages set: tune.py run has exited with exit code 0, tune.py status --run-dir <run-dir> shows "complete": true (every stage × repeat has full metrics — e.g. 130/130 for full qwen3), report.md has been written, every {{CONTEXT:...}} placeholder in step 7 has been resolved, and step 8 has shown the user the report path. Never ask between stages, between repeats, on partial failure, or while any pytest subprocess is still alive — the user may be running unattended for an hour+ and must not be interrupted mid-run. If the run was aborted, completeness check failed, or any stage-run is missing metrics, skip step 9 entirely.

Use AskUserQuestion to ask exactly once which apply mode to use:
- report — only the report, no test files touched
- smart — auto-tighten speed thresholds; ask per metric for acc/wer and any speed metric that would loosen
- full — write worst-of-N for every metric, no further prompts If the user picks report, stop without touching any file.
For smart and full, first run python tune.py apply-plan --run-dir <run-dir> to get a JSON with, per metric: source_kind (bare / nested), symbol, subkey, concurrency, worst_op, per_run_raw, worst_raw, worst_rounded (display-only), write_value (the literal to write), current_raw, and direction (tightens / loosens / equal / unknown).

Which value to write:
- wer and accuracy: always write_value (= worst_raw exactly). Never round WER/accuracy to display.digits — report percentages use 2 decimal places for readability, but test-file literals must preserve the full observed float so a max-bound WER threshold is never accidentally tightened (e.g. 0.010596 → 0.01).
- speed: use write_value from apply-plan (rounded unless that would tighten beyond worst_raw). Never re-round or multiply by scale.
Bounded write rules (enforced in write_value):
- worst_op == "min": written value must be <= worst_raw
- worst_op == "max": written value must be >= worst_raw If display rounding would violate either bound, write_value falls back to worst_raw with full precision.
Mode full: for every metric in every non-docs stage, edit the test file using the rules in (b) below, no questions asked.

Mode smart: classify each metric:
- auto-apply iff stage_group == "speed" AND direction == "tightens". Edit using rules in (b).
- auto-skip iff direction == "equal" (nothing to do).
- interactive otherwise — i.e. all accuracy and wer metrics, plus any speed metric that would loosen the threshold. For each interactive metric, fire AskUserQuestion (one per metric) showing:
  - the per-run raw values from per_run_raw
  - the current literal in the test file (current_raw)
  - the proposed value (write_value — full-precision for wer/acc)
  - direction tag with options:
  1. Keep current — leave the literal as-is
  2. Apply worst-of-N (<write_value>) — write write_value
  3. Custom value — the user supplies a number; write it verbatim after validating it parses as a float Always include the "Other" free-text fallback (the AskUserQuestion harness adds it automatically). If the user gives a custom numeric value, validate that it parses as a float and write exactly that raw value (not the display-scaled value).
(b) Edit rules (used by both full and smart's auto-apply path, and after the user accepts in interactive prompts):
- Write write_value from apply-plan — never worst_rounded directly, and never re-format with display.digits.
- Bare source (no [...]), e.g. MMMU_MIN_ACCURACY: replace the RHS literal of MMMU_MIN_ACCURACY = <old> with write_value.
- Nested source, e.g. _MMMU_P95['throughput_qps']: use the concurrency field from apply-plan output, then replace the entry under _MMMU_P95[<C>]["throughput_qps"] with write_value. If concurrency is null (no CONCURRENCY symbol in the test file) and the dict has a single key, fall back to that key; if multiple keys exist and concurrency is null, abort the apply step for that metric and warn the user.
- For any metric whose direction came back unknown (couldn't parse current literal — usually means the test file diverged from stages.yaml), do not edit; warn and continue.
After all edits across all stages, do two things:

(c) Append an "Applied changes" section to <run-dir>/report.md so the artifact records what was actually written. Use the Edit tool to insert this block immediately before the existing ## Provenance heading:
```
## Applied changes

| Stage | Metric | Old | New | Direction |
|-------|--------|-----|-----|-----------|
| <stage_key> | <source> | <current_raw> | <new_raw> | <direction> |
...
```
Rules:
- Include only metrics that were actually edited. Rows for "Keep current" choices, mode-report runs, and equal / unknown skips are omitted.
- Stage is the stage_key (e.g. mmsu_accuracy).
- Metric is the literal metric.source from apply-plan — bare (MMSU_MIN_ACCURACY) or nested (_MMSU_P95[8]['throughput_qps'] with the resolved concurrency substituted in).
- Old / New are raw numeric values (matching what's in the test file, not display-scaled). Trim trailing zeros for readability.
- Direction describes the effect on CI strictness — derived from worst_op and the sign of new - old:
  - worst_op == "min" (threshold is a lower bound, e.g. throughput_qps): new > old → tightens, new < old → loosens.
  - worst_op == "max" (threshold is an upper bound, e.g. latency_mean_s, rtf_mean, WER_..._MAX): new < old → tightens, new > old → loosens. Format the cell as tightens (Δ%) or loosens (Δ%) where Δ% is the signed percent change of the raw value relative to the old raw value, e.g. tightens (+2.1%), loosens (-7.9%). Use one decimal place. Direction MUST come from worst_op (not from sign-of-Δ alone) — for max-bounded metrics, a negative Δ% is a tightening.
- If nothing was edited (all kept / all skipped), do not append the section at all.
(d) List every changed <file>:<symbol> = <new> tuple in one chat message. If the user has explicitly authorized commit/push, continue to the version-control step below; otherwise stop.
Optional version-control step — only with explicit user authorization.
- Keep .tune-runs/ local and uncommitted.
- If the calibration evidence should be committed, copy the final <run-dir>/report.md (after context replacement and any applied-changes section) to a stable path under docs/calibration/ and commit that raw observation report. A short summary under docs/ is optional, but it must not replace the raw per-run report.
- Commit only threshold/test edits, skill/config changes, and requested calibration reports / summaries under docs/.
- Run repository pre-commit hooks normally; do not bypass hooks.
- Push only the current feature/calibration branch, never main.
- Provide a PR description with: summary, calibration run directory, CUDA Graph evidence, worst-of-N highlights, threshold-apply policy, and test/pre-commit verification.

What I do not do

Set up container / venv / caches
Check out branches, install packages
Run apply_slack or generate patch files
Commit or push without explicit user authorization
Edit test files outside of the explicit apply prompt (step 9)
Write ad-hoc apply scripts that re-round metrics — always use apply-plan's write_value field when editing test files
Round WER or accuracy thresholds to display.digits (report-only)
Ask mid-run for confirmation. (I may ask once up front for missing model/stages/repeats — step 2 — and once at the end for the apply prompt — step 9. No other questions.)

Files in this skill

.claude/skills/tune-ci-thresholds/
├── SKILL.md
├── tune.py                              # CLI; METRIC_SPECS + JSON extractor
│                                        # subcommands: run, report, status,
│                                        # apply-plan, precheck, discover
└── models/
    ├── qwen3-omni-v1/                   # v1 pipeline (qwen3-omni)
    │   ├── config.yaml
    │   └── stages.yaml
    └── s2-pro-v1/                       # v1 pipeline (FishAudio S2-Pro,
        ├── config.yaml                  #   uses per-test-file `variants`)
        └── stages.yaml

How metric values get read

The metric_sources block in config.yaml declares, per test file:

json_file — path relative to pytest basetemp (the default file for every metric in this test)
paths — {metric_key: "dotted.path"}, or "file::dotted.path" inline if the metric lives in a different JSON than the default
variants — optional; for tests that produce parallel result trees (e.g. nonstream / stream voice-clone in the same pytest run). Each variant entry has constant_filter (regex matched against the bare constant name with any leading underscore stripped), json_file, sample_counts, paths — same shape as the file-level fields. Constants matching a variant's filter are routed only to that variant; stage keys become <base>_<variant>_<group> (e.g. tts_nonstream_speed, tts_stream_wer). The bare base (tts) still resolves to all variants via the alias system.

Regenerating stages.yaml

If a test file's sha256 no longer matches models/<M>/stages.yaml, run will warn. Regenerate with:

python tune.py --model <M> discover

This is deterministic (AST + config lookup, no LLM calls).

Adding a new model

Create models/<new-name>/config.yaml mirroring qwen3-omni-v1/config.yaml, including metric_sources entries for every test file the model owns.
Run python tune.py --model <new-name> discover.
Any metric that shows up as NEEDS_CONFIG means the constant was recognized but the config is missing a paths entry for it — add the dotted JSON key and re-run discover.

Adding a new metric to an existing model

If a new test file adds a threshold constant:

Matching an existing naming pattern (*_ACC_MIN, *_WER_MAX_CORPUS, nested _*_P95[*].<known_key>) → discover picks it up for free. Add its JSON dotted key under metric_sources.<test_file>.paths.
New nested-dict key (e.g. _*_P95[*].ttft_ms) → add to _NESTED and METRIC_SPECS in tune.py.
New naming pattern (e.g. *_BLEU_MIN) → extend match_metric() and METRIC_SPECS in tune.py.
Metric lives in a different JSON than the test's default → use the <file>::<dotted.path> inline form in metric_sources.<test>.paths.

Threshold constants whose name match_metric() doesn't recognize are silently ignored — extend match_metric() if you add a new pattern.

tune-ci-thresholds

同仓库更多 Skills

同仓库更多 Skills

tune-ci-thresholds

Scope

Models

Prerequisites (I verify, I do not create)

Invocation

Environment and networking notes

Performance optimization checks

Monitoring, failures, and completeness (mandatory)

Agent polling — never blind-wait

tune.py built-in safeguards (v0.4+)

Completeness is a hard prerequisite for thresholds

Resume

Steps I follow

What I do not do

Files in this skill

How metric values get read

Regenerating stages.yaml

Adding a new model

Adding a new metric to an existing model

tune-ci-thresholds

Scope

Models

Prerequisites (I verify, I do not create)

Invocation

Environment and networking notes

Performance optimization checks

Monitoring, failures, and completeness (mandatory)

Agent polling — never blind-wait

tune.py built-in safeguards (v0.4+)

Completeness is a hard prerequisite for thresholds

Resume

Steps I follow

What I do not do

Files in this skill

How metric values get read

Regenerating stages.yaml

Adding a new model

Adding a new metric to an existing model