원클릭으로
triage-canary
Triage a failed canary ferry run (CI-invoked).
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
메뉴
Triage a failed canary ferry run (CI-invoked).
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
SOC 직업 분류 기준
Lint, run the pre-PR checks, commit, push, and author or update the branch's pull request in the required plain-text format. Use when committing, pushing, or creating/updating a PR.
Modify or upstream a Grug/Grugformer experiment variant.
Run a perf gate on a PR that touches lib/zephyr internals.
Curate the experiment report index at docs/reports/index.md.
Refresh Marin TPU-vLLM forks from a tpu-inference release/LKG pair, update exact SHA pins, run TPU smokes, and open the Marin PR.
Profile JAX training and analyze hotspots. Use when profiling or optimizing training throughput.
| name | triage-canary |
| description | Triage a failed canary ferry run (CI-invoked). |
Triage a failed canary ferry run. Diagnose root cause, file a GitHub issue, write a Slack summary. Diagnosis and reporting only — no code changes, no PRs.
| Variable | Description |
|---|---|
CANARY_LANE | gpu (CoreWeave) or tpu (GCP) |
CANARY_JOB_ID | Iris job ID |
CANARY_RUN_ID | W&B run ID |
IRIS_CONFIG | Path to Iris cluster config |
IRIS_NAMESPACE | Kubernetes namespace (CW only) |
WANDB_ENTITY | W&B entity |
WANDB_PROJECT | W&B project |
GHA_RUN_URL | Full URL to the GitHub Actions run |
The cluster is still live. Collect signal now — it will be torn down after you.
.venv/bin/iris --config=$IRIS_CONFIG job list --json~/.kube/coreweave-iris, namespace $IRIS_NAMESPACE (defaults to iris-ci — the canary shares this namespace with PR CI).
Get pod status, controller logs, task pod logs, warning events, pod describe.
Filter by iris.job_id=<CANARY_JOB_ID with '/' replaced by '.'> so you only see this canary's pods, not co-tenant CI pods. Example: kubectl -n iris-ci get pods -l iris.job_id=runner.iris-run-job-abc123.iris process logs and iris job list.scripts/ci/validate_canary_metrics.py if you need the validation output.Classify into one of: infra/scheduling, training crash, metric regression, controller bug, data/storage.
Use hypothesis-driven diagnosis: state hypothesis, gather evidence, narrow. Attempt to reproduce the issue locally and minimally. Triple check that you're narrowing down on the same issue as the one that actually broke the canary.
Follow the file-issue skill. Use the bug-report template.
[canary-{lane}] {short failure description}bug, agent-generated, canary--body-file with a temp file (see file-issue skill for the pattern).slack_message.mdWrite to the repo root. The workflow reads this file and sends it to Slack. Always write this file, even if issue creation failed.
Format — keep to 4 lines max:
:red_circle: *{GPU|TPU} Canary failed* — {one-line summary}
*Root cause:* {category} — {1 sentence}
*Issue:* {github issue URL}
*GHA run:* {GHA_RUN_URL}
If root cause is unclear, say so: root cause unclear with your best-guess signals.