Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

oncall-runner

Étoiles0

Forks0

Mis à jour24 juin 2026 à 03:26

Use the moment a page fires and you are the oncall — an alert ID lands in PagerDuty, a "who's oncall, this just paged" Slack ping, or you wake up to a firing alert and need to triage before you understand it. Fetches the alert by ID, runs the standard first-five-minutes checks (is it flapping, did a deploy just land, is this cause or symptom), and emits a structured finding you can paste into the incident channel. Run this before manual digging.

Installation

Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.

Exécuter dans Manus

Source

az9713

az9713/skill-best-practices

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Téléchargement

Exécuter dans Manus

Explorateur de fichiers

2 fichiers

SKILL.md

readonly

name

oncall-runner

description

Oncall Runner

Overview

When a page fires, the first five minutes are wasted the same way every time: pulling up the alert, figuring out if it already auto-resolved, checking whether a deploy just landed, and deciding whether the thing that paged is the cause or just a downstream symptom. This skill does that mechanically so you can think.

scripts/oncall.py takes an alert ID, fetches the alert, runs the standard checks, and prints a structured finding. It is deliberately read-only — it investigates and reports; it never remediates. You decide what to do with the finding.

When to Use

Use this skill when:

A page just fired and you are oncall — you have an alert ID (from PagerDuty, Opsgenie, or the alert webhook) and need a fast triage.
Someone pings the oncall channel with "this just paged, what is it?" and an alert ID or link.
You're handed a stale page — woke up to a firing alert and need to know if it's still real or already auto-resolved.

Do NOT use this skill when:

You don't have an alert ID — this skill keys off the alert, not free-text symptoms. For symptom-first investigation of the inference API, use inference-api-debugging.
You're already past triage and deep in root-causing a confirmed incident.
You need to correlate logs for a specific request — use log-correlator.

How to Use

Run the script with the alert ID:

python scripts/oncall.py --alert-id PD-48213

The script will:

Fetch the alert by ID and print its current state, severity, and the metric/threshold that tripped it.
Run the standard checks:
- Dedup / flapping — how many times has this alert fired and resolved in the lookback window? A high flap count means the underlying signal is noisy.
- Auto-resolve status — did it already clear on its own? A page that's already resolved is a different decision than one still firing.
- Recent deploy correlation — was there a deploy to the affected service in the minutes before the alert? This is the single highest-value signal.
- Cause vs symptom — does the alert sit downstream of another currently- firing alert? A 5xx page is often a symptom of a saturation or OOM page.
Emit a structured finding — severity, the checks' results, a cause/symptom call, and a recommended next step.

Paste the finding into the incident channel. If the finding says "likely symptom of ," go investigate that alert next (re-run the script with its ID).

Add --lookback 30m to widen the flap/deploy window, or --json to get machine-readable output for chaining.

Gotchas

ALWAYS check the flap count before paging humans. An alert that has fired and auto-resolved six times in the last hour is a noisy alert, not six incidents. Escalating a flapping alert as if it were a fresh incident burns the team's trust and their sleep. The script surfaces flap count first for exactly this reason — read it before you act.
An already-resolved page still deserves a look, but not a war room. Auto- resolved doesn't mean nothing happened — it may have shed load or recovered on a retry. Note it, check the deploy correlation, but don't spin up an incident for a green alert.
Deploy correlation is necessary but not sufficient. A deploy landing right before the alert is the prime suspect, but coincidences happen — a traffic spike can coincide with an unrelated deploy. The script reports the correlation; you still confirm causation (e.g. does rolling back fix it, does the timing line up to the second).
Distinguish cause from symptom before declaring root cause. The alert that paged is frequently the loudest symptom, not the cause. A latency or 5xx page often sits downstream of a saturation, OOM, or dependency alert that fired first and quieter. The script checks for upstream firing alerts — if it finds one, investigate that one, not the one that woke you.
The script is read-only on purpose. It will never ack, resolve, silence, or remediate. That's your call — automating remediation off a possibly-flapping signal is how outages get worse.

Files

SKILL.md — this file: when to trigger, how to run the script, gotchas.
scripts/oncall.py — fetches an alert by ID, runs the standard first-response checks (flap/dedup, auto-resolve, deploy correlation, cause-vs-symptom), and emits a structured finding. Run this first when a page fires.

Plus depuis ce dépôt

même dépôt

adversarial-review

az9713/skill-best-practices

Use when a change is written and "looks done" but has not had a hostile second pass before merge — especially diffs touching auth, money, migrations, concurrency, or anything the author is quietly unsure about. Spawns a fresh-eyes reviewer subagent that sees ONLY the diff and the spec, collects findings, drives fixes, and re-dispatches until findings degrade to nitpicks. Reach for this instead of self-reviewing; the author is the worst reviewer of their own diff.

2026-06-240

babysit-pr

az9713/skill-best-practices

Use when a PR is open and green-but-blocked, or red on CI for reasons that smell like flake — a timed-out test runner, a transient network 500 in a setup step, a check that passed locally but failed in CI. Reach for this whenever someone says "this PR keeps failing CI but the test is flaky", "can you babysit this PR to merge", "it's just a flaky check, retry it", or wants a PR shepherded through retries, conflict resolution, and auto-merge without sitting on it manually. Prefer this over hand-clicking "Re-run failed jobs" in the GitHub UI, which gives up no signal on flaky-vs-real and forgets to enable auto-merge.

2026-06-240

billing-lib

az9713/skill-best-practices

Use when writing or reviewing code that meters API token usage, bills accounts, issues invoices, applies credit grants, or computes balances with the internal `billing` library — especially around retries, mid-cycle plan changes, cache-read vs cache-write token pricing, or any place where double-billing or rounding drift would be a problem.

2026-06-240

checkout-verifier

az9713/skill-best-practices

Use when an API-credits checkout or paid-plan upgrade needs to be proven end-to-end against Stripe test mode — confirming a card charge actually creates the invoice and subscription in the right state, reproducing a "I paid but my credits didn't show up" report, checking that a declined or 3DS card fails the way the UI claims, or wiring a billing smoke test into CI so a checkout regression is caught before a customer's money is.

2026-06-240

cherry-pick-prod

az9713/skill-best-practices

Use when a specific fix that's already on main needs to land on a production/release branch without dragging along everything else — a hotfix to backport, a "cherry-pick this commit onto release-2.4", a "we need just that one PR on prod" request. Reach for this whenever someone wants to port one or a few commits to a release branch and open a PR for it, especially before doing it by hand in their main checkout, which pollutes their working tree and routinely leaves conflict markers committed or loses the original commit's provenance.

2026-06-240

code-style

az9713/skill-best-practices

Use when writing or editing code in this org's Python or JS/TS, especially before committing or opening a PR — and proactively the moment a diff adds an import, an except/catch, or any logging. Enforces the style rules Claude gets wrong by default: import grouping, error-wrapping (no bare except / empty catch), no leftover debug prints, explicit over clever. Runs scripts/check_style.sh (ruff, mypy --strict, eslint + grep guards) which exits nonzero so it drops into a pre-commit hook or CI.

2026-06-240

name

oncall-runner

description

Oncall Runner

Overview

When to Use

Use this skill when:

A page just fired and you are oncall — you have an alert ID (from PagerDuty, Opsgenie, or the alert webhook) and need a fast triage.
Someone pings the oncall channel with "this just paged, what is it?" and an alert ID or link.
You're handed a stale page — woke up to a firing alert and need to know if it's still real or already auto-resolved.

Do NOT use this skill when:

You don't have an alert ID — this skill keys off the alert, not free-text symptoms. For symptom-first investigation of the inference API, use inference-api-debugging.
You're already past triage and deep in root-causing a confirmed incident.
You need to correlate logs for a specific request — use log-correlator.

How to Use

Run the script with the alert ID:

python scripts/oncall.py --alert-id PD-48213

The script will:

Fetch the alert by ID and print its current state, severity, and the metric/threshold that tripped it.
Run the standard checks:
- Dedup / flapping — how many times has this alert fired and resolved in the lookback window? A high flap count means the underlying signal is noisy.
- Auto-resolve status — did it already clear on its own? A page that's already resolved is a different decision than one still firing.
- Recent deploy correlation — was there a deploy to the affected service in the minutes before the alert? This is the single highest-value signal.
- Cause vs symptom — does the alert sit downstream of another currently- firing alert? A 5xx page is often a symptom of a saturation or OOM page.
Emit a structured finding — severity, the checks' results, a cause/symptom call, and a recommended next step.

Paste the finding into the incident channel. If the finding says "likely symptom of ," go investigate that alert next (re-run the script with its ID).

Add --lookback 30m to widen the flap/deploy window, or --json to get machine-readable output for chaining.

Gotchas

ALWAYS check the flap count before paging humans. An alert that has fired and auto-resolved six times in the last hour is a noisy alert, not six incidents. Escalating a flapping alert as if it were a fresh incident burns the team's trust and their sleep. The script surfaces flap count first for exactly this reason — read it before you act.
An already-resolved page still deserves a look, but not a war room. Auto- resolved doesn't mean nothing happened — it may have shed load or recovered on a retry. Note it, check the deploy correlation, but don't spin up an incident for a green alert.
Deploy correlation is necessary but not sufficient. A deploy landing right before the alert is the prime suspect, but coincidences happen — a traffic spike can coincide with an unrelated deploy. The script reports the correlation; you still confirm causation (e.g. does rolling back fix it, does the timing line up to the second).
Distinguish cause from symptom before declaring root cause. The alert that paged is frequently the loudest symptom, not the cause. A latency or 5xx page often sits downstream of a saturation, OOM, or dependency alert that fired first and quieter. The script checks for upstream firing alerts — if it finds one, investigate that one, not the one that woke you.
The script is read-only on purpose. It will never ack, resolve, silence, or remediate. That's your call — automating remediation off a possibly-flapping signal is how outages get worse.

Files

SKILL.md — this file: when to trigger, how to run the script, gotchas.
scripts/oncall.py — fetches an alert by ID, runs the standard first-response checks (flap/dedup, auto-resolve, deploy correlation, cause-vs-symptom), and emits a structured finding. Run this first when a page fires.