Run any Skill in Manus with one click

Get Started

triage-canary

Stars1,129

Forks133

UpdatedJune 19, 2026 at 21:26

Triage a failed canary ferry run (CI-invoked).

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

marin-community

marin-community/marin

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

SKILL.md

readonly

name	triage-canary
description	Triage a failed canary ferry run (CI-invoked).

Skill: Triage Canary

Triage a failed canary ferry run. Diagnose root cause, file a GitHub issue, write a Slack summary. Diagnosis and reporting only — no code changes, no PRs.

Inputs (environment variables)

Variable	Description
`CANARY_LANE`	`gpu` (CoreWeave) or `tpu` (GCP)
`CANARY_JOB_ID`	Iris job ID
`CANARY_RUN_ID`	W&B run ID
`IRIS_CONFIG`	Path to Iris cluster config
`IRIS_NAMESPACE`	Kubernetes namespace (CW only)
`WANDB_ENTITY`	W&B entity
`WANDB_PROJECT`	W&B project
`GHA_RUN_URL`	Full URL to the GitHub Actions run

Steps

1. Gather diagnostics

The cluster is still live. Collect signal now — it will be torn down after you.

Iris job state via .venv/bin/iris --config=$IRIS_CONFIG job list --json
GPU lane: you have kubectl at ~/.kube/coreweave-iris, namespace $IRIS_NAMESPACE (defaults to iris-ci — the canary shares this namespace with PR CI). Get pod status, controller logs, task pod logs, warning events, pod describe. Filter by iris.job_id=<CANARY_JOB_ID with '/' replaced by '.'> so you only see this canary's pods, not co-tenant CI pods. Example: kubectl -n iris-ci get pods -l iris.job_id=runner.iris-run-job-abc123.
TPU lane: use iris process logs and iris job list.
Re-run scripts/ci/validate_canary_metrics.py if you need the validation output.

2. Identify root cause

Classify into one of: infra/scheduling, training crash, metric regression, controller bug, data/storage.

Use hypothesis-driven diagnosis: state hypothesis, gather evidence, narrow. Attempt to reproduce the issue locally and minimally. Triple check that you're narrowing down on the same issue as the one that actually broke the canary.

3. File a GitHub issue

Follow the file-issue skill. Use the bug-report template.

Title: [canary-{lane}] {short failure description}
Labels: bug, agent-generated, canary
Body must include a "Canary run context" section with: lane, job ID, GHA run URL, W&B run URL, date.
Support your claims using supporting data (e.g. runtime logs)
Keep the issue concise and maximally readable for humans.
Use GFM to make the details (e.g. log traces, code to reproduce issue) optional and declutter the issue.
Use --body-file with a temp file (see file-issue skill for the pattern).

4. Write `slack_message.md`

Write to the repo root. The workflow reads this file and sends it to Slack. Always write this file, even if issue creation failed.

Format — keep to 4 lines max:

:red_circle: *{GPU|TPU} Canary failed* — {one-line summary}
*Root cause:* {category} — {1 sentence}
*Issue:* {github issue URL}
*GHA run:* {GHA_RUN_URL}

If root cause is unclear, say so: root cause unclear with your best-guess signals.

More from this repository

same repository

commit

marin-community/marin

Lint, run the pre-PR checks, commit, push, and author or update the branch's pull request in the required plain-text format. Use when committing, pushing, or creating/updating a PR.

2026-06-201.1k

change-grug

marin-community/marin

Modify or upstream a Grug/Grugformer experiment variant.

2026-06-191.1k

evaluate-zephyr-perf

marin-community/marin

Run a perf gate on a PR that touches lib/zephyr internals.

2026-06-191.1k

organize-experiments

marin-community/marin

Curate the experiment report index at docs/reports/index.md.

2026-06-191.1k

refresh-tpu-vllm-forks

marin-community/marin

Refresh Marin TPU-vLLM forks from a tpu-inference release/LKG pair, update exact SHA pins, run TPU smokes, and open the Marin PR.

2026-06-171.1k

profile-training

marin-community/marin

Profile JAX training and analyze hotspots. Use when profiling or optimizing training throughput.

2026-06-171.1k

Skill: Triage Canary

Triage a failed canary ferry run. Diagnose root cause, file a GitHub issue, write a Slack summary. Diagnosis and reporting only — no code changes, no PRs.

Inputs (environment variables)

Variable

Description

CANARY_LANE

gpu (CoreWeave) or tpu (GCP)

CANARY_JOB_ID

Iris job ID

CANARY_RUN_ID

W&B run ID

IRIS_CONFIG

Path to Iris cluster config

IRIS_NAMESPACE

Kubernetes namespace (CW only)

WANDB_ENTITY

W&B entity

WANDB_PROJECT

W&B project

GHA_RUN_URL

Full URL to the GitHub Actions run

Steps

1. Gather diagnostics

The cluster is still live. Collect signal now — it will be torn down after you.

Iris job state via .venv/bin/iris --config=$IRIS_CONFIG job list --json

GPU lane: you have kubectl at ~/.kube/coreweave-iris, namespace $IRIS_NAMESPACE (defaults to iris-ci — the canary shares this namespace with PR CI). Get pod status, controller logs, task pod logs, warning events, pod describe. Filter by iris.job_id=<CANARY_JOB_ID with '/' replaced by '.'> so you only see this canary's pods, not co-tenant CI pods. Example: kubectl -n iris-ci get pods -l iris.job_id=runner.iris-run-job-abc123.

TPU lane: use iris process logs and iris job list.

Re-run scripts/ci/validate_canary_metrics.py if you need the validation output.

2. Identify root cause

Classify into one of: infra/scheduling, training crash, metric regression, controller bug, data/storage.

3. File a GitHub issue

Follow the file-issue skill. Use the bug-report template.

Title: [canary-{lane}] {short failure description}

Labels: bug, agent-generated, canary

Body must include a "Canary run context" section with: lane, job ID, GHA run URL, W&B run URL, date.

Support your claims using supporting data (e.g. runtime logs)

Keep the issue concise and maximally readable for humans.

Use GFM to make the details (e.g. log traces, code to reproduce issue) optional and declutter the issue.

Use --body-file with a temp file (see file-issue skill for the pattern).

4. Write slack_message.md

Write to the repo root. The workflow reads this file and sends it to Slack. Always write this file, even if issue creation failed.

Format — keep to 4 lines max:

:red_circle: *{GPU|TPU} Canary failed* — {one-line summary} *Root cause:* {category} — {1 sentence} *Issue:* {github issue URL} *GHA run:* {GHA_RUN_URL}

If root cause is unclear, say so: root cause unclear with your best-guess signals.

triage-canary

Skill: Triage Canary

Inputs (environment variables)

Steps

1. Gather diagnostics

2. Identify root cause

3. File a GitHub issue

4. Write slack_message.md

More from this repository

More from this repository

Skill: Triage Canary

Inputs (environment variables)

Steps

1. Gather diagnostics

2. Identify root cause

3. File a GitHub issue

4. Write slack_message.md

4. Write `slack_message.md`

4. Write `slack_message.md`