Execute qualquer Skill no Manus
com um clique

Execute qualquer Skill no Manus com um clique

Começar

evaluate-zephyr-perf

Estrelas1.129

Forks133

Atualizado19 de junho de 2026 às 21:26

Run a perf gate on a PR that touches lib/zephyr internals.

Instalação

Instalar com Codex ou Claude Copie este prompt, cole no Codex, Claude ou outro assistente e deixe que ele revise a página da skill e instale para você.

Executar no Manus

Fonte

marin-community

marin-community/marin

Abrir repositório GitHub Ver repositórios do creator

Download

Executar no Manus

Ocupações relacionadasSOC

Baseado na classificação ocupacional SOC

Analistas de garantia de qualidade de software e testadoresInformática e Matemática·SOC 15-1253

SKILL.md

readonly

Mais deste repositório

mesmo repositório

commit

marin-community/marin

Lint, run the pre-PR checks, commit, push, and author or update the branch's pull request in the required plain-text format. Use when committing, pushing, or creating/updating a PR.

2026-06-201.1k

change-grug

marin-community/marin

Modify or upstream a Grug/Grugformer experiment variant.

2026-06-191.1k

organize-experiments

marin-community/marin

Curate the experiment report index at docs/reports/index.md.

2026-06-191.1k

triage-canary

marin-community/marin

Triage a failed canary ferry run (CI-invoked).

2026-06-191.1k

refresh-tpu-vllm-forks

marin-community/marin

Refresh Marin TPU-vLLM forks from a tpu-inference release/LKG pair, update exact SHA pins, run TPU smokes, and open the Marin PR.

2026-06-171.1k

profile-training

marin-community/marin

Profile JAX training and analyze hotspots. Use when profiling or optimizing training throughput.

2026-06-171.1k

Execute qualquer Skill com um clique

name	evaluate-zephyr-perf
description	Run a perf gate on a PR that touches lib/zephyr internals.

Skill: Zephyr Perf Gate

Perf gate for Zephyr-internals PRs. The agent reads the diff, picks a max_gate, submits a treatment ferry, fetches the matching scheduled-baseline perf report, compares the two JSON reports against the threshold table, and posts one canonical comment to the PR.

The control side is always the latest successful scheduled marin-canary-datakit-tier<N> workflow run on main. Each tier's Capture perf report step uploads a datakit-tier<N>-perf-report workflow artifact (90-day retention) and mirrors the same JSON to gs://marin-us-central1/infra/datakit/ferry_perf/. The agent never submits a baseline ferry of its own.

Triggers only on changes to Zephyr internals (lib/zephyr/src/zephyr/**). Datakit / dedup / normalize / tokenize live in lib/marin/... and are out of scope. If a PR touches both, gate the Zephyr part; the datakit canaries cover the rest.

Autonomy

The agent may, without asking:

Read the PR diff and decide scope + max_gate.
Create a temporary git worktree at the PR head SHA.
Submit Iris ferry jobs at the same priority as the matching tier workflow.
Poll job state and pull coordinator logs.
Post one canonical comment on the PR (sentinel-marked, idempotent).
Escalate up the gate ladder (1 → 2 → 3) when the prior gate passes and max_gate allows.

The agent does not open follow-up issues — even on ❌ fail. The PR comment is the artifact; the author owns the response.

Ask before:

Re-running on a different cluster than lib/iris/config/marin.yaml.
Stopping a ferry that has not crossed its tier wall-time.

Trigger / Scope

Run when both hold:

The PR diff has at least one non-test, non-docs file under lib/zephyr/src/zephyr/** (or lib/zephyr/pyproject.toml), AND
The reviewer asked for a perf gate (comment or @-mention).

Out of scope (do not trigger):

lib/marin/src/marin/processing/classification/deduplication/** (dedup)
lib/marin/src/marin/datakit/normalize/** (normalize)
lib/marin/src/marin/processing/tokenize/** (tokenize)
lib/fray/** — flag it in the PR comment but don't auto-gate; ask the reviewer
Docs-only diffs (*.md, lib/zephyr/AGENTS.md, lib/zephyr/OPS.md)
Test-only changes (lib/zephyr/tests/**)

There is no path-glob script — the agent makes the scope call from the diff. When in doubt, ask the reviewer.

Gate ladder

Gate	Tier workflow	Schedule	Ferry / coverage	Wall-time
skip	—	—	All-trivial diff (comments, docstrings, type hints, renames). Reviewer must concur.	—
1 — smoke	`marin-canary-datakit-tier1.yaml`	daily 06:30 UTC	`experiments.ferries.datakit_ferry` (FineWeb-Edu sample/10BT). End-to-end pass at small scale.	~30–60 min
2 — long-tail stress	`marin-canary-datakit-tier2.yaml`	daily 07:00 UTC	`experiments.ferries.datakit_tier2_skewed_ferry`. Synthetic skewed doc-length distribution (log-normal mean ~5 KB body + Pareto tail + ~100 mega-docs in [128 MB, 256 MB]) — exercises spill, scatter, consolidate under buffer pressure.	~2.5 h
3 — nemotron	`marin-canary-datakit-tier3.yaml`	weekly Mon 01:00 UTC	`experiments.ferries.datakit_nemotron_ferry` with `quality=high`, `max_files=1000`. Production-scale bulk filtered web within the GH 6h cap. Runs in europe-west4, non-preemptible.	~3 h

Gate 1 always runs first, regardless of max_gate. If Gate 1 passes and max_gate >= 2, escalate to Gate 2; if Gate 2 passes and max_gate >= 3, escalate to Gate 3. If any gate fails, post the verdict and stop — no point burning bigger budget on a regression already proven at smaller scale.

The gate is not chosen mechanically from file paths. The agent reads the diff, judges (see Assess the diff), and confirms max_gate with the reviewer before submitting any ferry. The reviewer can override with a different max_gate (or skip) in the confirmation reply. There are no PR labels — confirmation is a chat exchange in the invoking session.

Baseline freshness: tier1/tier2 baselines are <24h old; the tier3 baseline can be up to a week old (weekly schedule). Surface the baseline age in the comment.

Workflow

1. Assess the diff

gh pr diff <PR_NUMBER>          # PRs
git diff <merge_base>...<head>  # local

For each touched zephyr file, answer five yes/no questions:

#	Question	Yes if…
1	Trivial?	comment/docstring/whitespace-only, rename, pure type-hint, log-string text, dead-code removal with no callers.
2	Affects shuffle?	scatter pipeline (hashing, fanout, combiner, byte-range sidecar), partitioning, k-way merge, chunk routing.
3	Affects memory?	buffer sizes, in-memory accumulation, chunk shapes, spill thresholds, retained references, RPC payload size.
4	Affects CPU?	hot loops, serialization paths, sort/merge inner loops, polling intervals, lock contention, JSON/parquet read/write.
5	Changes zephyr design?	new public API, changed actor protocol, changed stage semantics, changed `.result()` ordering, changed retry/error classification, changed plan/fusion rules.

Use lib/zephyr/AGENTS.md and diff context to identify files most likely to matter (scatter / planner / executor / sort / spill have the highest prior of perf impact) — judgment, not a path-glob rule.

Decision (max_gate — Gate 1 always runs first regardless):

All-trivial (q1 yes everywhere, q2–q5 no everywhere) → max_gate = "skip".
Any q2 / q3 / q4 / q5 = yes anywhere → max_gate = "3".
Otherwise → max_gate = "1".

Record the answers and a one-line rationale per file:

{
  "max_gate": "3",
  "rationale": "shuffle.py: changes scatter combiner from per-key to per-shard buffer (memory + CPU)",
  "per_file": {
    "lib/zephyr/src/zephyr/shuffle.py": {
      "trivial": false, "shuffle": true, "memory": true, "cpu": true, "design": false,
      "summary": "scatter combiner buffering changed"
    }
  }
}

Render this assessment as a small table in the final PR comment.

1a. Confirm `max_gate` with the reviewer

Before submitting any ferry, post the assessment to the reviewer in the invoking chat session (not as a PR comment) and wait for confirmation:

🤖 Zephyr perf gate assessment

Proposed max_gate: <skip|1|2|3>
Rationale: <one-line summary>

Per-file:
- <path>: <one-line summary>
- ...

Reply "go" to run, or override with "max_gate=<skip|1|2|3>".

On override, record it in the assessment JSON rationale field (e.g. "reviewer override: max_gate=2 — only minhash sensitivity matters"), then proceed. If the reviewer says skip, post a one-line PR comment that the gate was waived and stop — no ferry.

2. Locate the scheduled baseline

Each gate compares against the latest successful scheduled tier-N run on main. Capture run id, head SHA, and createdAt for comment provenance; the artifact download happens inside each gate (step 5d).

gh run list --repo marin-community/marin \
  --workflow=marin-canary-datakit-tier<N>.yaml \
  --branch=main --status=success --limit=1 \
  --json databaseId,headSha,createdAt -q '.[0]'

3. Set up the treatment worktree

Iris bundles the working directory; submit the treatment from a worktree at the PR head.

TS=$(date -u +%Y%m%dT%H%M%SZ)
WT_DIR="../.zephyr_perf_worktrees"
mkdir -p "$WT_DIR"
TREATMENT_WT="$WT_DIR/${PR_NUMBER}-${TS}-treatment"

gh pr view <PR_NUMBER> --json headRefOid -q .headRefOid > /tmp/pr-head
TREATMENT_SHA=$(cat /tmp/pr-head)
git worktree add "$TREATMENT_WT" "$TREATMENT_SHA"

4. Run zephyr tests on the treatment worktree

Confirm the treatment compiles, type-checks, and passes the zephyr suite before paying for ferries:

( cd "$TREATMENT_WT" && \
  ./infra/pre-commit.py lib/zephyr/ && \
  uv run pyrefly && \
  uv run pytest lib/zephyr/tests/ )

Treatment-only — there is no control worktree. CI is assumed green on main; if it isn't, the broken commit is upstream of the gate's concerns — call that out separately.

If any of these fail, stop here. Do not submit ferries. Post a halt comment using the same sentinel as the verdict (so re-runs upsert in place):

PR=<PR_NUMBER>
REPO=marin-community/marin
BODY=$(mktemp)
cat > "$BODY" <<EOF
<!-- zephyr-perf-gate -->
🤖 ## Zephyr perf gate — halted (local tests failed)

Treatment worktree (\`$TREATMENT_SHA\`) failed lint / pyrefly / zephyr tests.
Ferries were not submitted.

Fix the failing tests and the gate will re-run.
EOF
EXISTING=$(gh api --paginate "repos/$REPO/issues/$PR/comments" \
  --jq '.[] | select(.body | startswith("<!-- zephyr-perf-gate -->")) | .id' | head -1)
if [ -n "$EXISTING" ]; then
  gh api --method PATCH "repos/$REPO/issues/comments/$EXISTING" -F "body=@$BODY"
else
  gh api --method POST  "repos/$REPO/issues/$PR/comments"      -F "body=@$BODY"
fi

5. Run a gate (Gate 1 always; Gates 2/3 conditional on prior pass)

Same protocol for all three gates; substitute the tier-N specifics.

a. Submit the treatment ferry.

mkdir -p /tmp/zephyr-perf/<PR>
uv run python scripts/ci/submit_perf_run.py \
  --gate <N> --pr <PR_NUMBER> --cwd "$TREATMENT_WT" \
  --status-out gs://marin-us-central1/tmp/ttl=7d/zephyr-perf/<PR>/treatment-g<N>.json \
  > /tmp/zephyr-perf/<PR>/submit-g<N>.json
TREATMENT_JOB_ID=$(jq -r .job_id < /tmp/zephyr-perf/<PR>/submit-g<N>.json)

submit_perf_run.py mirrors the iris CLI shape used by the tier-N workflow YAML (region, memory, disk, cpu, priority, preemptibility, extra env vars). Drift between this script and the tier YAML breaks parity — keep them in lockstep.

b. Babysit until terminal. Delegate to babysit-zephyr (or babysit-job for the outer Iris job). Don't poll tightly — sleep ≥ 10 min between checks for Gate 1, ≥ 15 min for Gate 2, ≥ 20 min for Gate 3. If the leg flakes (worker pool wedged, coord zombie), escalate to debug; do not silently retry — a flaky run masks a real regression.

c. Collect the treatment perf report. Use the same script the scheduled workflows use, so the JSON matches the baseline structurally:

uv run python scripts/ci/collect_perf_metrics.py \
  --job-id "$TREATMENT_JOB_ID" \
  --status gs://marin-us-central1/tmp/ttl=7d/zephyr-perf/<PR>/treatment-g<N>.json \
  --out /tmp/zephyr-perf/<PR>/treatment-g<N>-perf-report.json

d. Pull the baseline perf report.

RUN_ID=$(gh run list --repo marin-community/marin \
  --workflow=marin-canary-datakit-tier<N>.yaml \
  --branch=main --status=success --limit=1 \
  --json databaseId -q '.[0].databaseId')
mkdir -p /tmp/zephyr-perf/<PR>/baseline-g<N>
gh run download "$RUN_ID" --name datakit-tier<N>-perf-report \
  --dir /tmp/zephyr-perf/<PR>/baseline-g<N>
# → /tmp/zephyr-perf/<PR>/baseline-g<N>/perf-report.json

If the artifact is missing/unreadable (rare — 90d retention), fall back to the GCS mirror gs://marin-us-central1/infra/datakit/ferry_perf/report_*_tier<N>/perf_report.json and gsutil cp it locally.

e. Compare and write the verdict. Read both JSONs and write the verdict comment by hand following the threshold table. There is no compare script — judgment about cached steps, multi-attempt churn, and infra noise lives in the agent.

Threshold table (apply per gate)

wall_seconds_total is the launcher-task wall time (max duration_ms across the launcher's own tasks, in seconds). It excludes time spent in JOB_STATE_PENDING / JOB_STATE_BUILDING waiting for capacity — that queue wait isn't a perf signal (see Failure modes). stage_wall_seconds is the actual pipeline-step work, derived from the iris job tree.

Signal	✅ Pass	⚠ Warn	❌ Hard fail
`wall_seconds_total` delta (treatment − baseline) / baseline	≤ +5%	+5–10%	> +10%
Per-step `stage_wall_seconds` delta (any stage)	≤ +5%	+5–10%	> +10%
Any new entry in `infra_failures` (treatment > baseline in any bucket: `oom`, `hardware_fault`, `scheduling_timeout`, `application_failure`, `other`)	—	—	any
`failed_shards` strictly higher in treatment	—	—	any
`peak_worker_memory_mb` delta	≤ +5%	+5–15%	> +15%

Inconclusive (infra noise, not a code regression)

Mark ⚠ inconclusive (not pass/warn/fail) and re-submit the treatment when any of these holds:

treatment.preemption_count materially higher than baseline (e.g. > 3 over baseline, or > 0 when baseline is 0). Stage durations split across attempts aren't comparable.
treatment.task_state_counts.preempted > 0.
treatment.infra_failures.hardware_fault > 0 (TPU/CPU bad-node retry).
The treatment ran on a visibly different cluster generation than the baseline.

Do not call a regression on a single preempted or hardware-flaky run.

Cached steps

If a step appears in either report's cached_steps, its stage_wall_seconds[step] is 0.0 and the delta is meaningless. Render "—" in the per-step table; do not count toward the verdict. Note the cache hit in a footnote.

Agent self-check

Before writing the comment, walk both JSONs:

Treatment and baseline ferry_module match (else mis-wired).
iris_job_id differs (else comparing a run to itself).
task_state_counts totals roughly equal (large divergence usually means one side did less work — flag it).

Comment shape (canonical)

The comment must begin with the sentinel and 🤖.

<!-- zephyr-perf-gate -->
🤖 ## Zephyr perf gate — Gate <N> (<tier name>)

**Verdict:** ✅ pass | ⚠ warn | ⚠ inconclusive | ❌ fail

**Baseline:** scheduled run [#<RUN_ID>](<url>), sha=`<sha>`, age=<N>d

**Hard fails:** … (omit if none)
**Warns:** … (omit if none)

### Diff assessment

(per-file table with the five yes/no answers + one-line summary, plus the
overall rationale — rendered from the step 1 assessment JSON)

### Run summary

| | Baseline | Treatment |
|---|---|---|
| Iris job | `<id>` | `<id>` |
| Status | succeeded | succeeded |
| Total wall-time | 31m 12s | 32m 04s (+2.8%) |
| Peak worker memory (MB) | 14202 | 14180 |

### Stage timings

| Stage | Baseline | Treatment | Δ | Verdict |
|---|---|---|---|---|
| download | 12s | 12s | +0% | ✅ |
| normalize | 14m 05s | 14m 30s | +3.0% | ✅ |
| minhash | 6m 50s | 7m 01s | +2.7% | ✅ |

### Infra

| | Baseline | Treatment |
|---|---|---|
| Preemptions | 0 | 0 |
| Failed shards | 0 | 0 |
| Infra failures | (none) | (none) |
| Task states | succeeded=42 | succeeded=42 |

<details><summary>Raw treatment report</summary>

(JSON contents of /tmp/zephyr-perf/<PR>/treatment-g<N>-perf-report.json)

</details>

If the gate returns ❌ fail, stop — post the verdict, don't escalate.

6. Post one canonical comment

Sentinel-marked so re-runs replace rather than stack — find the existing comment, then patch or post:

PR=<PR_NUMBER>
REPO=marin-community/marin
BODY=/tmp/zephyr-perf/$PR/comment.md
EXISTING=$(gh api --paginate "repos/$REPO/issues/$PR/comments" \
  --jq '.[] | select(.body | startswith("<!-- zephyr-perf-gate -->")) | .id' | head -1)

if [ -n "$EXISTING" ]; then
  gh api --method PATCH "repos/$REPO/issues/comments/$EXISTING" -F "body=@$BODY"
else
  gh api --method POST  "repos/$REPO/issues/$PR/comments"      -F "body=@$BODY"
fi

No separate issue is filed on ❌ fail. The author decides next steps.

7. Clean up

git worktree remove "$TREATMENT_WT"

Wipe stale worktrees from earlier runs:

shopt -s nullglob
for wt in ../.zephyr_perf_worktrees/${PR_NUMBER}-*; do
  git worktree remove --force "$wt"
done
shopt -u nullglob

Failure modes

Treatment flakes: re-submit treatment; do not call the gate on a single failed run. If it flakes again, escalate to debug.
Baseline artifact missing/unreadable: try the GCS mirror gs://marin-us-central1/infra/datakit/ferry_perf/report_*_tier<N>/perf_report.json. If both are unreachable, post a comment explaining the gap and ping the reviewer; do not submit a baseline ferry of your own.
Treatment OOMs at a stage the baseline didn't: hard fail. treatment.infra_failures.oom > baseline.infra_failures.oom is enough — surface the worker-pool death log with the OOM line in the comment.
Agent says out of scope but the reviewer disagrees: the reviewer re-invokes with an explicit max_gate override (step 1a); re-run at the forced gate.
Iris worker preemptions during the run: spot-VM preemptions inflate wall-time, retry counts, and churn. Signals: preemption_count and task_state_counts.preempted materially higher than baseline. Mark verdict ⚠ inconclusive, surface the churn, re-submit treatment.
Cluster scheduling delay (queue wait, not pipeline wall-time): the job sits in JOB_STATE_PENDING/JOB_STATE_BUILDING before any stage starts. Not a perf signal — the gate measures stage wall-times. Note the queue wait if notable (>30 min); does not affect the verdict.
Cluster contention / mixed worker generations: a competing large job or an autoscaler bringing up a different machine type/zone can shift baseline timing 10–30% with no code change. If the wall-time delta is in the warn band and the contention signal is plausible, mark ⚠ inconclusive rather than ⚠ warn. Hard-fail thresholds (>+10%, new infra failures) still apply.
TPU/CPU bad-node retries: surface as infra_failures.hardware_fault. Same handling as preemptions — count toward churn, re-run if pervasive.
Cached steps in baseline but not treatment (or vice versa): the step's stage_wall_seconds is 0.0 and the step is in cached_steps. Surface "—" for delta; do not penalize. Note in a footnote.
Stale tier3 baseline (>1 week old): tier3 runs weekly. If the latest successful tier3 run on main is older than a week, surface the age prominently so the reviewer can decide whether to trust the comparison.

Composes with

babysit-zephyr — monitoring each run while in flight.
babysit-job — the outer Iris job lifecycle.
debug — when a leg flakes and the cause is unclear.

name	evaluate-zephyr-perf
description	Run a perf gate on a PR that touches lib/zephyr internals.

Skill: Zephyr Perf Gate

Autonomy

The agent may, without asking:

Read the PR diff and decide scope + max_gate.
Create a temporary git worktree at the PR head SHA.
Submit Iris ferry jobs at the same priority as the matching tier workflow.
Poll job state and pull coordinator logs.
Post one canonical comment on the PR (sentinel-marked, idempotent).
Escalate up the gate ladder (1 → 2 → 3) when the prior gate passes and max_gate allows.

The agent does not open follow-up issues — even on ❌ fail. The PR comment is the artifact; the author owns the response.

Ask before:

Re-running on a different cluster than lib/iris/config/marin.yaml.
Stopping a ferry that has not crossed its tier wall-time.

Trigger / Scope

Run when both hold:

The PR diff has at least one non-test, non-docs file under lib/zephyr/src/zephyr/** (or lib/zephyr/pyproject.toml), AND
The reviewer asked for a perf gate (comment or @-mention).

Out of scope (do not trigger):

lib/marin/src/marin/processing/classification/deduplication/** (dedup)
lib/marin/src/marin/datakit/normalize/** (normalize)
lib/marin/src/marin/processing/tokenize/** (tokenize)
lib/fray/** — flag it in the PR comment but don't auto-gate; ask the reviewer
Docs-only diffs (*.md, lib/zephyr/AGENTS.md, lib/zephyr/OPS.md)
Test-only changes (lib/zephyr/tests/**)

There is no path-glob script — the agent makes the scope call from the diff. When in doubt, ask the reviewer.

Gate ladder

Gate	Tier workflow	Schedule	Ferry / coverage	Wall-time
skip	—	—	All-trivial diff (comments, docstrings, type hints, renames). Reviewer must concur.	—
1 — smoke	`marin-canary-datakit-tier1.yaml`	daily 06:30 UTC	`experiments.ferries.datakit_ferry` (FineWeb-Edu sample/10BT). End-to-end pass at small scale.	~30–60 min
2 — long-tail stress	`marin-canary-datakit-tier2.yaml`	daily 07:00 UTC	`experiments.ferries.datakit_tier2_skewed_ferry`. Synthetic skewed doc-length distribution (log-normal mean ~5 KB body + Pareto tail + ~100 mega-docs in [128 MB, 256 MB]) — exercises spill, scatter, consolidate under buffer pressure.	~2.5 h
3 — nemotron	`marin-canary-datakit-tier3.yaml`	weekly Mon 01:00 UTC	`experiments.ferries.datakit_nemotron_ferry` with `quality=high`, `max_files=1000`. Production-scale bulk filtered web within the GH 6h cap. Runs in europe-west4, non-preemptible.	~3 h

Baseline freshness: tier1/tier2 baselines are <24h old; the tier3 baseline can be up to a week old (weekly schedule). Surface the baseline age in the comment.

Workflow

1. Assess the diff

gh pr diff <PR_NUMBER>          # PRs
git diff <merge_base>...<head>  # local

For each touched zephyr file, answer five yes/no questions:

#	Question	Yes if…
1	Trivial?	comment/docstring/whitespace-only, rename, pure type-hint, log-string text, dead-code removal with no callers.
2	Affects shuffle?	scatter pipeline (hashing, fanout, combiner, byte-range sidecar), partitioning, k-way merge, chunk routing.
3	Affects memory?	buffer sizes, in-memory accumulation, chunk shapes, spill thresholds, retained references, RPC payload size.
4	Affects CPU?	hot loops, serialization paths, sort/merge inner loops, polling intervals, lock contention, JSON/parquet read/write.
5	Changes zephyr design?	new public API, changed actor protocol, changed stage semantics, changed `.result()` ordering, changed retry/error classification, changed plan/fusion rules.

Decision (max_gate — Gate 1 always runs first regardless):

All-trivial (q1 yes everywhere, q2–q5 no everywhere) → max_gate = "skip".
Any q2 / q3 / q4 / q5 = yes anywhere → max_gate = "3".
Otherwise → max_gate = "1".

Record the answers and a one-line rationale per file:

{
  "max_gate": "3",
  "rationale": "shuffle.py: changes scatter combiner from per-key to per-shard buffer (memory + CPU)",
  "per_file": {
    "lib/zephyr/src/zephyr/shuffle.py": {
      "trivial": false, "shuffle": true, "memory": true, "cpu": true, "design": false,
      "summary": "scatter combiner buffering changed"
    }
  }
}

Render this assessment as a small table in the final PR comment.

1a. Confirm `max_gate` with the reviewer

Before submitting any ferry, post the assessment to the reviewer in the invoking chat session (not as a PR comment) and wait for confirmation:

🤖 Zephyr perf gate assessment

Proposed max_gate: <skip|1|2|3>
Rationale: <one-line summary>

Per-file:
- <path>: <one-line summary>
- ...

Reply "go" to run, or override with "max_gate=<skip|1|2|3>".

2. Locate the scheduled baseline

gh run list --repo marin-community/marin \
  --workflow=marin-canary-datakit-tier<N>.yaml \
  --branch=main --status=success --limit=1 \
  --json databaseId,headSha,createdAt -q '.[0]'

3. Set up the treatment worktree

Iris bundles the working directory; submit the treatment from a worktree at the PR head.

TS=$(date -u +%Y%m%dT%H%M%SZ)
WT_DIR="../.zephyr_perf_worktrees"
mkdir -p "$WT_DIR"
TREATMENT_WT="$WT_DIR/${PR_NUMBER}-${TS}-treatment"

gh pr view <PR_NUMBER> --json headRefOid -q .headRefOid > /tmp/pr-head
TREATMENT_SHA=$(cat /tmp/pr-head)
git worktree add "$TREATMENT_WT" "$TREATMENT_SHA"

4. Run zephyr tests on the treatment worktree

Confirm the treatment compiles, type-checks, and passes the zephyr suite before paying for ferries:

( cd "$TREATMENT_WT" && \
  ./infra/pre-commit.py lib/zephyr/ && \
  uv run pyrefly && \
  uv run pytest lib/zephyr/tests/ )

Treatment-only — there is no control worktree. CI is assumed green on main; if it isn't, the broken commit is upstream of the gate's concerns — call that out separately.

If any of these fail, stop here. Do not submit ferries. Post a halt comment using the same sentinel as the verdict (so re-runs upsert in place):

PR=<PR_NUMBER>
REPO=marin-community/marin
BODY=$(mktemp)
cat > "$BODY" <<EOF
<!-- zephyr-perf-gate -->
🤖 ## Zephyr perf gate — halted (local tests failed)

Treatment worktree (\`$TREATMENT_SHA\`) failed lint / pyrefly / zephyr tests.
Ferries were not submitted.

Fix the failing tests and the gate will re-run.
EOF
EXISTING=$(gh api --paginate "repos/$REPO/issues/$PR/comments" \
  --jq '.[] | select(.body | startswith("<!-- zephyr-perf-gate -->")) | .id' | head -1)
if [ -n "$EXISTING" ]; then
  gh api --method PATCH "repos/$REPO/issues/comments/$EXISTING" -F "body=@$BODY"
else
  gh api --method POST  "repos/$REPO/issues/$PR/comments"      -F "body=@$BODY"
fi

5. Run a gate (Gate 1 always; Gates 2/3 conditional on prior pass)

Same protocol for all three gates; substitute the tier-N specifics.

a. Submit the treatment ferry.

mkdir -p /tmp/zephyr-perf/<PR>
uv run python scripts/ci/submit_perf_run.py \
  --gate <N> --pr <PR_NUMBER> --cwd "$TREATMENT_WT" \
  --status-out gs://marin-us-central1/tmp/ttl=7d/zephyr-perf/<PR>/treatment-g<N>.json \
  > /tmp/zephyr-perf/<PR>/submit-g<N>.json
TREATMENT_JOB_ID=$(jq -r .job_id < /tmp/zephyr-perf/<PR>/submit-g<N>.json)

c. Collect the treatment perf report. Use the same script the scheduled workflows use, so the JSON matches the baseline structurally:

uv run python scripts/ci/collect_perf_metrics.py \
  --job-id "$TREATMENT_JOB_ID" \
  --status gs://marin-us-central1/tmp/ttl=7d/zephyr-perf/<PR>/treatment-g<N>.json \
  --out /tmp/zephyr-perf/<PR>/treatment-g<N>-perf-report.json

d. Pull the baseline perf report.

RUN_ID=$(gh run list --repo marin-community/marin \
  --workflow=marin-canary-datakit-tier<N>.yaml \
  --branch=main --status=success --limit=1 \
  --json databaseId -q '.[0].databaseId')
mkdir -p /tmp/zephyr-perf/<PR>/baseline-g<N>
gh run download "$RUN_ID" --name datakit-tier<N>-perf-report \
  --dir /tmp/zephyr-perf/<PR>/baseline-g<N>
# → /tmp/zephyr-perf/<PR>/baseline-g<N>/perf-report.json

Threshold table (apply per gate)

Signal	✅ Pass	⚠ Warn	❌ Hard fail
`wall_seconds_total` delta (treatment − baseline) / baseline	≤ +5%	+5–10%	> +10%
Per-step `stage_wall_seconds` delta (any stage)	≤ +5%	+5–10%	> +10%
Any new entry in `infra_failures` (treatment > baseline in any bucket: `oom`, `hardware_fault`, `scheduling_timeout`, `application_failure`, `other`)	—	—	any
`failed_shards` strictly higher in treatment	—	—	any
`peak_worker_memory_mb` delta	≤ +5%	+5–15%	> +15%

Inconclusive (infra noise, not a code regression)

Mark ⚠ inconclusive (not pass/warn/fail) and re-submit the treatment when any of these holds:

treatment.preemption_count materially higher than baseline (e.g. > 3 over baseline, or > 0 when baseline is 0). Stage durations split across attempts aren't comparable.
treatment.task_state_counts.preempted > 0.
treatment.infra_failures.hardware_fault > 0 (TPU/CPU bad-node retry).
The treatment ran on a visibly different cluster generation than the baseline.

Do not call a regression on a single preempted or hardware-flaky run.

Cached steps

Agent self-check

Before writing the comment, walk both JSONs:

Treatment and baseline ferry_module match (else mis-wired).
iris_job_id differs (else comparing a run to itself).
task_state_counts totals roughly equal (large divergence usually means one side did less work — flag it).

Comment shape (canonical)

The comment must begin with the sentinel and 🤖.

<!-- zephyr-perf-gate -->
🤖 ## Zephyr perf gate — Gate <N> (<tier name>)

**Verdict:** ✅ pass | ⚠ warn | ⚠ inconclusive | ❌ fail

**Baseline:** scheduled run [#<RUN_ID>](<url>), sha=`<sha>`, age=<N>d

**Hard fails:** … (omit if none)
**Warns:** … (omit if none)

### Diff assessment

(per-file table with the five yes/no answers + one-line summary, plus the
overall rationale — rendered from the step 1 assessment JSON)

### Run summary

| | Baseline | Treatment |
|---|---|---|
| Iris job | `<id>` | `<id>` |
| Status | succeeded | succeeded |
| Total wall-time | 31m 12s | 32m 04s (+2.8%) |
| Peak worker memory (MB) | 14202 | 14180 |

### Stage timings

| Stage | Baseline | Treatment | Δ | Verdict |
|---|---|---|---|---|
| download | 12s | 12s | +0% | ✅ |
| normalize | 14m 05s | 14m 30s | +3.0% | ✅ |
| minhash | 6m 50s | 7m 01s | +2.7% | ✅ |

### Infra

| | Baseline | Treatment |
|---|---|---|
| Preemptions | 0 | 0 |
| Failed shards | 0 | 0 |
| Infra failures | (none) | (none) |
| Task states | succeeded=42 | succeeded=42 |

<details><summary>Raw treatment report</summary>

(JSON contents of /tmp/zephyr-perf/<PR>/treatment-g<N>-perf-report.json)

</details>

If the gate returns ❌ fail, stop — post the verdict, don't escalate.

6. Post one canonical comment

Sentinel-marked so re-runs replace rather than stack — find the existing comment, then patch or post:

PR=<PR_NUMBER>
REPO=marin-community/marin
BODY=/tmp/zephyr-perf/$PR/comment.md
EXISTING=$(gh api --paginate "repos/$REPO/issues/$PR/comments" \
  --jq '.[] | select(.body | startswith("<!-- zephyr-perf-gate -->")) | .id' | head -1)

if [ -n "$EXISTING" ]; then
  gh api --method PATCH "repos/$REPO/issues/comments/$EXISTING" -F "body=@$BODY"
else
  gh api --method POST  "repos/$REPO/issues/$PR/comments"      -F "body=@$BODY"
fi

No separate issue is filed on ❌ fail. The author decides next steps.

7. Clean up

git worktree remove "$TREATMENT_WT"

Wipe stale worktrees from earlier runs:

shopt -s nullglob
for wt in ../.zephyr_perf_worktrees/${PR_NUMBER}-*; do
  git worktree remove --force "$wt"
done
shopt -u nullglob

Failure modes

Treatment flakes: re-submit treatment; do not call the gate on a single failed run. If it flakes again, escalate to debug.
Baseline artifact missing/unreadable: try the GCS mirror gs://marin-us-central1/infra/datakit/ferry_perf/report_*_tier<N>/perf_report.json. If both are unreachable, post a comment explaining the gap and ping the reviewer; do not submit a baseline ferry of your own.
Treatment OOMs at a stage the baseline didn't: hard fail. treatment.infra_failures.oom > baseline.infra_failures.oom is enough — surface the worker-pool death log with the OOM line in the comment.
Agent says out of scope but the reviewer disagrees: the reviewer re-invokes with an explicit max_gate override (step 1a); re-run at the forced gate.
Iris worker preemptions during the run: spot-VM preemptions inflate wall-time, retry counts, and churn. Signals: preemption_count and task_state_counts.preempted materially higher than baseline. Mark verdict ⚠ inconclusive, surface the churn, re-submit treatment.
Cluster scheduling delay (queue wait, not pipeline wall-time): the job sits in JOB_STATE_PENDING/JOB_STATE_BUILDING before any stage starts. Not a perf signal — the gate measures stage wall-times. Note the queue wait if notable (>30 min); does not affect the verdict.
Cluster contention / mixed worker generations: a competing large job or an autoscaler bringing up a different machine type/zone can shift baseline timing 10–30% with no code change. If the wall-time delta is in the warn band and the contention signal is plausible, mark ⚠ inconclusive rather than ⚠ warn. Hard-fail thresholds (>+10%, new infra failures) still apply.
TPU/CPU bad-node retries: surface as infra_failures.hardware_fault. Same handling as preemptions — count toward churn, re-run if pervasive.
Cached steps in baseline but not treatment (or vice versa): the step's stage_wall_seconds is 0.0 and the step is in cached_steps. Surface "—" for delta; do not penalize. Note in a footnote.
Stale tier3 baseline (>1 week old): tier3 runs weekly. If the latest successful tier3 run on main is older than a week, surface the age prominently so the reviewer can decide whether to trust the comparison.

Composes with

babysit-zephyr — monitoring each run while in flight.
babysit-job — the outer Iris job lifecycle.
debug — when a leg flakes and the cause is unclear.

evaluate-zephyr-perf

Mais deste repositório

Mais deste repositório

Skill: Zephyr Perf Gate

Autonomy

Trigger / Scope

Gate ladder

Workflow

1. Assess the diff

1a. Confirm max_gate with the reviewer

2. Locate the scheduled baseline

3. Set up the treatment worktree

4. Run zephyr tests on the treatment worktree

5. Run a gate (Gate 1 always; Gates 2/3 conditional on prior pass)

Threshold table (apply per gate)

Inconclusive (infra noise, not a code regression)

Cached steps

Agent self-check

Comment shape (canonical)

6. Post one canonical comment

7. Clean up

Failure modes

Composes with

Skill: Zephyr Perf Gate

Autonomy

Trigger / Scope

Gate ladder

Workflow

1. Assess the diff

1a. Confirm max_gate with the reviewer

2. Locate the scheduled baseline

3. Set up the treatment worktree

4. Run zephyr tests on the treatment worktree

5. Run a gate (Gate 1 always; Gates 2/3 conditional on prior pass)

Threshold table (apply per gate)

Inconclusive (infra noise, not a code regression)

Cached steps

Agent self-check

Comment shape (canonical)

6. Post one canonical comment

7. Clean up

Failure modes

Composes with

1a. Confirm `max_gate` with the reviewer

1a. Confirm `max_gate` with the reviewer