com um clique
evaluate-zephyr-perf
Run a perf gate on a PR that touches lib/zephyr internals.
Instalar com Codex ou Claude Copie este prompt, cole no Codex, Claude ou outro assistente e deixe que ele revise a página da skill e instale para você.
Menu
Run a perf gate on a PR that touches lib/zephyr internals.
Instalar com Codex ou Claude Copie este prompt, cole no Codex, Claude ou outro assistente e deixe que ele revise a página da skill e instale para você.
Baseado na classificação ocupacional SOC
Lint, run the pre-PR checks, commit, push, and author or update the branch's pull request in the required plain-text format. Use when committing, pushing, or creating/updating a PR.
Modify or upstream a Grug/Grugformer experiment variant.
Curate the experiment report index at docs/reports/index.md.
Triage a failed canary ferry run (CI-invoked).
Refresh Marin TPU-vLLM forks from a tpu-inference release/LKG pair, update exact SHA pins, run TPU smokes, and open the Marin PR.
Profile JAX training and analyze hotspots. Use when profiling or optimizing training throughput.
| name | evaluate-zephyr-perf |
| description | Run a perf gate on a PR that touches lib/zephyr internals. |
Perf gate for Zephyr-internals PRs. The agent reads the diff, picks a max_gate,
submits a treatment ferry, fetches the matching scheduled-baseline perf report,
compares the two JSON reports against the threshold table, and posts one
canonical comment to the PR.
The control side is always the latest successful scheduled
marin-canary-datakit-tier<N> workflow run on main. Each tier's Capture perf report step uploads a datakit-tier<N>-perf-report workflow artifact (90-day
retention) and mirrors the same JSON to
gs://marin-us-central1/infra/datakit/ferry_perf/. The agent never submits a
baseline ferry of its own.
Triggers only on changes to Zephyr internals (lib/zephyr/src/zephyr/**).
Datakit / dedup / normalize / tokenize live in lib/marin/... and are out of
scope. If a PR touches both, gate the Zephyr part; the datakit canaries cover
the rest.
The agent may, without asking:
max_gate.max_gate allows.The agent does not open follow-up issues — even on ❌ fail. The PR comment
is the artifact; the author owns the response.
Ask before:
lib/iris/config/marin.yaml.Run when both hold:
lib/zephyr/src/zephyr/** (or lib/zephyr/pyproject.toml), ANDOut of scope (do not trigger):
lib/marin/src/marin/processing/classification/deduplication/** (dedup)lib/marin/src/marin/datakit/normalize/** (normalize)lib/marin/src/marin/processing/tokenize/** (tokenize)lib/fray/** — flag it in the PR comment but don't auto-gate; ask the reviewer*.md, lib/zephyr/AGENTS.md, lib/zephyr/OPS.md)lib/zephyr/tests/**)There is no path-glob script — the agent makes the scope call from the diff. When in doubt, ask the reviewer.
| Gate | Tier workflow | Schedule | Ferry / coverage | Wall-time |
|---|---|---|---|---|
| skip | — | — | All-trivial diff (comments, docstrings, type hints, renames). Reviewer must concur. | — |
| 1 — smoke | marin-canary-datakit-tier1.yaml | daily 06:30 UTC | experiments.ferries.datakit_ferry (FineWeb-Edu sample/10BT). End-to-end pass at small scale. | ~30–60 min |
| 2 — long-tail stress | marin-canary-datakit-tier2.yaml | daily 07:00 UTC | experiments.ferries.datakit_tier2_skewed_ferry. Synthetic skewed doc-length distribution (log-normal mean ~5 KB body + Pareto tail + ~100 mega-docs in [128 MB, 256 MB]) — exercises spill, scatter, consolidate under buffer pressure. | ~2.5 h |
| 3 — nemotron | marin-canary-datakit-tier3.yaml | weekly Mon 01:00 UTC | experiments.ferries.datakit_nemotron_ferry with quality=high, max_files=1000. Production-scale bulk filtered web within the GH 6h cap. Runs in europe-west4, non-preemptible. | ~3 h |
Gate 1 always runs first, regardless of max_gate. If Gate 1 passes and
max_gate >= 2, escalate to Gate 2; if Gate 2 passes and max_gate >= 3,
escalate to Gate 3. If any gate fails, post the verdict and stop — no point
burning bigger budget on a regression already proven at smaller scale.
The gate is not chosen mechanically from file paths. The agent reads the
diff, judges (see Assess the diff), and confirms max_gate with the
reviewer before submitting any ferry. The reviewer can override with a
different max_gate (or skip) in the confirmation reply. There are no PR
labels — confirmation is a chat exchange in the invoking session.
Baseline freshness: tier1/tier2 baselines are <24h old; the tier3 baseline can be up to a week old (weekly schedule). Surface the baseline age in the comment.
gh pr diff <PR_NUMBER> # PRs
git diff <merge_base>...<head> # local
For each touched zephyr file, answer five yes/no questions:
| # | Question | Yes if… |
|---|---|---|
| 1 | Trivial? | comment/docstring/whitespace-only, rename, pure type-hint, log-string text, dead-code removal with no callers. |
| 2 | Affects shuffle? | scatter pipeline (hashing, fanout, combiner, byte-range sidecar), partitioning, k-way merge, chunk routing. |
| 3 | Affects memory? | buffer sizes, in-memory accumulation, chunk shapes, spill thresholds, retained references, RPC payload size. |
| 4 | Affects CPU? | hot loops, serialization paths, sort/merge inner loops, polling intervals, lock contention, JSON/parquet read/write. |
| 5 | Changes zephyr design? | new public API, changed actor protocol, changed stage semantics, changed .result() ordering, changed retry/error classification, changed plan/fusion rules. |
Use lib/zephyr/AGENTS.md and diff context to identify files most likely to
matter (scatter / planner / executor / sort / spill have the highest prior of
perf impact) — judgment, not a path-glob rule.
Decision (max_gate — Gate 1 always runs first regardless):
max_gate = "skip".max_gate = "3".max_gate = "1".Record the answers and a one-line rationale per file:
{
"max_gate": "3",
"rationale": "shuffle.py: changes scatter combiner from per-key to per-shard buffer (memory + CPU)",
"per_file": {
"lib/zephyr/src/zephyr/shuffle.py": {
"trivial": false, "shuffle": true, "memory": true, "cpu": true, "design": false,
"summary": "scatter combiner buffering changed"
}
}
}
Render this assessment as a small table in the final PR comment.
max_gate with the reviewerBefore submitting any ferry, post the assessment to the reviewer in the invoking chat session (not as a PR comment) and wait for confirmation:
🤖 Zephyr perf gate assessment
Proposed max_gate: <skip|1|2|3>
Rationale: <one-line summary>
Per-file:
- <path>: <one-line summary>
- ...
Reply "go" to run, or override with "max_gate=<skip|1|2|3>".
On override, record it in the assessment JSON rationale field (e.g.
"reviewer override: max_gate=2 — only minhash sensitivity matters"), then
proceed. If the reviewer says skip, post a one-line PR comment that the gate
was waived and stop — no ferry.
Each gate compares against the latest successful scheduled tier-N run on
main. Capture run id, head SHA, and createdAt for comment provenance; the
artifact download happens inside each gate (step 5d).
gh run list --repo marin-community/marin \
--workflow=marin-canary-datakit-tier<N>.yaml \
--branch=main --status=success --limit=1 \
--json databaseId,headSha,createdAt -q '.[0]'
Iris bundles the working directory; submit the treatment from a worktree at the PR head.
TS=$(date -u +%Y%m%dT%H%M%SZ)
WT_DIR="../.zephyr_perf_worktrees"
mkdir -p "$WT_DIR"
TREATMENT_WT="$WT_DIR/${PR_NUMBER}-${TS}-treatment"
gh pr view <PR_NUMBER> --json headRefOid -q .headRefOid > /tmp/pr-head
TREATMENT_SHA=$(cat /tmp/pr-head)
git worktree add "$TREATMENT_WT" "$TREATMENT_SHA"
Confirm the treatment compiles, type-checks, and passes the zephyr suite before paying for ferries:
( cd "$TREATMENT_WT" && \
./infra/pre-commit.py lib/zephyr/ && \
uv run pyrefly && \
uv run pytest lib/zephyr/tests/ )
Treatment-only — there is no control worktree. CI is assumed green on main; if
it isn't, the broken commit is upstream of the gate's concerns — call that out
separately.
If any of these fail, stop here. Do not submit ferries. Post a halt comment using the same sentinel as the verdict (so re-runs upsert in place):
PR=<PR_NUMBER>
REPO=marin-community/marin
BODY=$(mktemp)
cat > "$BODY" <<EOF
<!-- zephyr-perf-gate -->
🤖 ## Zephyr perf gate — halted (local tests failed)
Treatment worktree (\`$TREATMENT_SHA\`) failed lint / pyrefly / zephyr tests.
Ferries were not submitted.
Fix the failing tests and the gate will re-run.
EOF
EXISTING=$(gh api --paginate "repos/$REPO/issues/$PR/comments" \
--jq '.[] | select(.body | startswith("<!-- zephyr-perf-gate -->")) | .id' | head -1)
if [ -n "$EXISTING" ]; then
gh api --method PATCH "repos/$REPO/issues/comments/$EXISTING" -F "body=@$BODY"
else
gh api --method POST "repos/$REPO/issues/$PR/comments" -F "body=@$BODY"
fi
Same protocol for all three gates; substitute the tier-N specifics.
a. Submit the treatment ferry.
mkdir -p /tmp/zephyr-perf/<PR>
uv run python scripts/ci/submit_perf_run.py \
--gate <N> --pr <PR_NUMBER> --cwd "$TREATMENT_WT" \
--status-out gs://marin-us-central1/tmp/ttl=7d/zephyr-perf/<PR>/treatment-g<N>.json \
> /tmp/zephyr-perf/<PR>/submit-g<N>.json
TREATMENT_JOB_ID=$(jq -r .job_id < /tmp/zephyr-perf/<PR>/submit-g<N>.json)
submit_perf_run.py mirrors the iris CLI shape used by the tier-N workflow YAML
(region, memory, disk, cpu, priority, preemptibility, extra env vars). Drift
between this script and the tier YAML breaks parity — keep them in lockstep.
b. Babysit until terminal. Delegate to babysit-zephyr (or babysit-job for the outer Iris job). Don't poll tightly — sleep ≥ 10 min between checks for Gate 1, ≥ 15 min for Gate 2, ≥ 20 min for Gate 3. If the leg flakes (worker pool wedged, coord zombie), escalate to debug; do not silently retry — a flaky run masks a real regression.
c. Collect the treatment perf report. Use the same script the scheduled workflows use, so the JSON matches the baseline structurally:
uv run python scripts/ci/collect_perf_metrics.py \
--job-id "$TREATMENT_JOB_ID" \
--status gs://marin-us-central1/tmp/ttl=7d/zephyr-perf/<PR>/treatment-g<N>.json \
--out /tmp/zephyr-perf/<PR>/treatment-g<N>-perf-report.json
d. Pull the baseline perf report.
RUN_ID=$(gh run list --repo marin-community/marin \
--workflow=marin-canary-datakit-tier<N>.yaml \
--branch=main --status=success --limit=1 \
--json databaseId -q '.[0].databaseId')
mkdir -p /tmp/zephyr-perf/<PR>/baseline-g<N>
gh run download "$RUN_ID" --name datakit-tier<N>-perf-report \
--dir /tmp/zephyr-perf/<PR>/baseline-g<N>
# → /tmp/zephyr-perf/<PR>/baseline-g<N>/perf-report.json
If the artifact is missing/unreadable (rare — 90d retention), fall back to the
GCS mirror
gs://marin-us-central1/infra/datakit/ferry_perf/report_*_tier<N>/perf_report.json
and gsutil cp it locally.
e. Compare and write the verdict. Read both JSONs and write the verdict comment by hand following the threshold table. There is no compare script — judgment about cached steps, multi-attempt churn, and infra noise lives in the agent.
wall_seconds_total is the launcher-task wall time (max duration_ms
across the launcher's own tasks, in seconds). It excludes time spent in
JOB_STATE_PENDING / JOB_STATE_BUILDING waiting for capacity — that queue
wait isn't a perf signal (see Failure modes). stage_wall_seconds is the
actual pipeline-step work, derived from the iris job tree.
| Signal | ✅ Pass | ⚠ Warn | ❌ Hard fail |
|---|---|---|---|
wall_seconds_total delta (treatment − baseline) / baseline | ≤ +5% | +5–10% | > +10% |
Per-step stage_wall_seconds delta (any stage) | ≤ +5% | +5–10% | > +10% |
Any new entry in infra_failures (treatment > baseline in any bucket: oom, hardware_fault, scheduling_timeout, application_failure, other) | — | — | any |
failed_shards strictly higher in treatment | — | — | any |
peak_worker_memory_mb delta | ≤ +5% | +5–15% | > +15% |
Mark ⚠ inconclusive (not pass/warn/fail) and re-submit the treatment when
any of these holds:
treatment.preemption_count materially higher than baseline (e.g. > 3 over
baseline, or > 0 when baseline is 0). Stage durations split across attempts
aren't comparable.treatment.task_state_counts.preempted > 0.treatment.infra_failures.hardware_fault > 0 (TPU/CPU bad-node retry).Do not call a regression on a single preempted or hardware-flaky run.
If a step appears in either report's cached_steps, its
stage_wall_seconds[step] is 0.0 and the delta is meaningless. Render "—" in
the per-step table; do not count toward the verdict. Note the cache hit in a
footnote.
Before writing the comment, walk both JSONs:
ferry_module match (else mis-wired).iris_job_id differs (else comparing a run to itself).task_state_counts totals roughly equal (large divergence usually means one
side did less work — flag it).The comment must begin with the sentinel and 🤖.
<!-- zephyr-perf-gate -->
🤖 ## Zephyr perf gate — Gate <N> (<tier name>)
**Verdict:** ✅ pass | ⚠ warn | ⚠ inconclusive | ❌ fail
**Baseline:** scheduled run [#<RUN_ID>](<url>), sha=`<sha>`, age=<N>d
**Hard fails:** … (omit if none)
**Warns:** … (omit if none)
### Diff assessment
(per-file table with the five yes/no answers + one-line summary, plus the
overall rationale — rendered from the step 1 assessment JSON)
### Run summary
| | Baseline | Treatment |
|---|---|---|
| Iris job | `<id>` | `<id>` |
| Status | succeeded | succeeded |
| Total wall-time | 31m 12s | 32m 04s (+2.8%) |
| Peak worker memory (MB) | 14202 | 14180 |
### Stage timings
| Stage | Baseline | Treatment | Δ | Verdict |
|---|---|---|---|---|
| download | 12s | 12s | +0% | ✅ |
| normalize | 14m 05s | 14m 30s | +3.0% | ✅ |
| minhash | 6m 50s | 7m 01s | +2.7% | ✅ |
### Infra
| | Baseline | Treatment |
|---|---|---|
| Preemptions | 0 | 0 |
| Failed shards | 0 | 0 |
| Infra failures | (none) | (none) |
| Task states | succeeded=42 | succeeded=42 |
<details><summary>Raw treatment report</summary>
(JSON contents of /tmp/zephyr-perf/<PR>/treatment-g<N>-perf-report.json)
</details>
If the gate returns ❌ fail, stop — post the verdict, don't escalate.
Sentinel-marked so re-runs replace rather than stack — find the existing comment, then patch or post:
PR=<PR_NUMBER>
REPO=marin-community/marin
BODY=/tmp/zephyr-perf/$PR/comment.md
EXISTING=$(gh api --paginate "repos/$REPO/issues/$PR/comments" \
--jq '.[] | select(.body | startswith("<!-- zephyr-perf-gate -->")) | .id' | head -1)
if [ -n "$EXISTING" ]; then
gh api --method PATCH "repos/$REPO/issues/comments/$EXISTING" -F "body=@$BODY"
else
gh api --method POST "repos/$REPO/issues/$PR/comments" -F "body=@$BODY"
fi
No separate issue is filed on ❌ fail. The author decides next steps.
git worktree remove "$TREATMENT_WT"
Wipe stale worktrees from earlier runs:
shopt -s nullglob
for wt in ../.zephyr_perf_worktrees/${PR_NUMBER}-*; do
git worktree remove --force "$wt"
done
shopt -u nullglob
gs://marin-us-central1/infra/datakit/ferry_perf/report_*_tier<N>/perf_report.json.
If both are unreachable, post a comment explaining the gap and ping the
reviewer; do not submit a baseline ferry of your own.treatment.infra_failures.oom > baseline.infra_failures.oom is enough —
surface the worker-pool death log with the OOM line in the comment.max_gate override (step 1a); re-run at the
forced gate.preemption_count and
task_state_counts.preempted materially higher than baseline. Mark verdict
⚠ inconclusive, surface the churn, re-submit treatment.JOB_STATE_PENDING/JOB_STATE_BUILDING before any stage starts. Not
a perf signal — the gate measures stage wall-times. Note the queue wait if
notable (>30 min); does not affect the verdict.⚠ inconclusive rather than ⚠ warn.
Hard-fail thresholds (>+10%, new infra failures) still apply.infra_failures.hardware_fault. Same
handling as preemptions — count toward churn, re-run if pervasive.stage_wall_seconds is 0.0 and the step is in cached_steps. Surface "—"
for delta; do not penalize. Note in a footnote.main is older than a week, surface the age
prominently so the reviewer can decide whether to trust the comparison.babysit-zephyr — monitoring each run while in flight.babysit-job — the outer Iris job lifecycle.debug — when a leg flakes and the cause is unclear.