تشغيل أي مهارة في Manus بنقرة واحدة

deploy-inference-gateway

النجوم٠

التفرعات٠

آخر تحديث٢٤ يونيو ٢٠٢٦ في ٠٣:٢٦

Use when shipping a change to the inference gateway and it needs to go out safely — build, smoke test, then ramp real traffic in stages while watching error rate and p99 against a baseline, rolling back automatically on regression. Reach for this whenever someone says "deploy the inference gateway", "roll out the new gateway build", "canary the gateway change", "ship this with a gradual rollout", or wants a traffic-shifted release with automatic rollback rather than flipping 100% of traffic at once. Prefer this over a manual `kubectl set image` / full-cut deploy, which gives no canary signal and no automatic rollback when latency or errors regress.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

az9713

az9713/skill-best-practices

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

مستكشف الملفات

2 ملفات

SKILL.md

readonly

name

deploy-inference-gateway

description

Deploy Inference Gateway

Overview

The inference gateway sits in front of every model request, so a bad deploy doesn't degrade one feature — it degrades all of them at once. A full-cut deploy ("set the image, pray") finds out about a regression from the on-call pager. The safer shape is a canary: shift a small slice of real traffic to the new build, compare its error rate and tail latency to the old build serving the rest, and only widen the slice if the new build is at least as good. If it regresses, roll back before most users ever touched it.

This skill runs that loop. It builds and smoke-tests the new revision, then ramps traffic through 5% → 25% → 50% → 100%, holding at each stage long enough for the metrics to be statistically meaningful, comparing the canary's error rate and p99 latency to a baseline captured from the same model version. A regression at any stage triggers an automatic rollback that drains in-flight requests before tearing the canary down, so no request is killed mid-flight. On regression it also points you at the inference-api-debugging runbook skill to triage why.

The guiding principle: a canary is only as trustworthy as its comparison. Same model version on both sides, a long-enough window, warmed weights — get those wrong and the canary will happily wave a bad build through.

When to Use

Reach for this when:

You're deploying a new inference-gateway build/revision and want a staged, metric-gated rollout instead of an all-at-once cut.
A config or routing change to the gateway needs to be validated against live traffic before it owns 100%.
You want automatic rollback wired in — if error rate or p99 regresses past threshold, the deploy should undo itself without a human in the loop.
You're codifying the team's release procedure so every gateway deploy goes out the same safe way.

Do NOT use this when:

The change is a pure-data or non-serving change (a dashboard, a docs update, a batch job) — there's no traffic to canary, so this is pure overhead.
You need an emergency rollback of an already-bad live deploy — that's a straight revert to the last-good revision, not a staged canary. Cut traffic back to the known-good build directly.
The new build serves a different model version than the baseline. Canary comparison assumes like-for-like; a model-version change makes the error/latency delta meaningless (see Gotchas). Validate model changes through the eval harness first, then deploy the serving change separately.

Running it

cd .claude/skills/deploy-inference-gateway

export GATEWAY_IMAGE="registry.internal/inference-gateway:$(git rev-parse --short HEAD)"
export GATEWAY_NAMESPACE="inference"
export BASELINE_REVISION="stable"          # the currently-live, known-good revision
export PROM_BASE="https://prom.internal"   # where error-rate / p99 metrics live

./scripts/deploy.sh

Tunable gates (all optional, shown with defaults):

export STAGES="5 25 50 100"        # traffic % ramp
export STAGE_HOLD_S=300            # hold per stage so the metric window is significant
export ERROR_RATE_REGRESSION=1.5   # rollback if canary error rate > baseline * this
export P99_REGRESSION=1.2          # rollback if canary p99 > baseline * this
export WARMUP_S=90                 # warm model weights before any traffic
export MIN_REQUESTS=500            # don't judge a stage until it has served this many

The script logs each stage transition, the canary-vs-baseline comparison at every hold, and — on regression — the rollback and drain. Exit code is non-zero if it rolled back, so it composes into CI/CD that should fail the pipeline on a bad deploy.

How the rollout loop works

Build + smoke test. Pull/verify GATEWAY_IMAGE, deploy it as a zero- traffic canary revision, and hit its health + a handful of representative inference requests directly (bypassing the load balancer). If smoke fails, the canary never sees real traffic — abort here.
Warm the weights. Model-serving processes are slow on their first requests (lazy weight load, cold caches, JIT). Send WARMUP_S of synthetic warmup to the canary before shifting any real traffic, so the first real users don't eat cold-start latency and trip a false p99 regression.
Capture the baseline. Read the baseline revision's error rate and p99 over a recent window — from the same model version the canary serves — so the comparison is apples-to-apples.
Ramp through STAGES. At each stage: shift that % of traffic to the canary, hold STAGE_HOLD_S, and once the stage has served MIN_REQUESTS, compare canary error rate and p99 to baseline. If canary_error_rate > baseline * ERROR_RATE_REGRESSION or canary_p99 > baseline * P99_REGRESSION, roll back. Otherwise advance.
Promote at 100%. Make the canary the new stable revision and retire the old one.
Rollback path (any stage). Shift traffic back to baseline, then drain: stop new traffic to the canary but let in-flight requests finish (respecting the connection drain timeout) before scaling it to zero. Emit a pointer to the inference-api-debugging runbook with the captured metrics so triage starts with data, not a blank page.

Gotchas

ALWAYS treat these as real, observed failure modes — each has either passed a bad build or rolled back a good one.

A too-short metric window lies. At 5% traffic, a few minutes may be tens of requests — not enough to distinguish a real error-rate regression from noise. Judging a stage before it has served MIN_REQUESTS produces both false rollbacks (good build, unlucky sample) and false promotions (bad build, too few samples to show it). Hold each stage until both STAGE_HOLD_S has elapsed and MIN_REQUESTS have been served before comparing.
Cold weights look like a latency regression. The canary's first requests are slow because weights load lazily and caches are cold — not because the build is worse. Skipping warmup makes the 5% stage trip a p99 regression and roll back a perfectly good deploy. Always warm (WARMUP_S) before shifting real traffic, and don't count warmup traffic in the baseline comparison.
The baseline must be the same model version. Comparing a canary serving model v7 against a baseline serving v6 measures the model change, not the deploy. The error/latency delta becomes meaningless and you'll either roll back a fine gateway change or wave through a real one. If the model version differs, stop — this skill is for serving/gateway changes; route model changes through the eval harness first.
Rollback must drain, not kill. Scaling the canary straight to zero on rollback severs in-flight inference requests — users see truncated streams and 5xxs caused by the rollback itself, which looks like the regression got worse. Always stop new traffic first, let in-flight requests finish within the drain timeout, then scale down. A clean rollback should be invisible to anyone not mid-request.
Error rate and p99 can move independently. A build can keep error rate flat while quietly regressing tail latency (a new sync call on the hot path), or spike errors while latency looks fine. Gate on both — checking only one lets the other class of regression through.
A regressing canary needs triage, not just a rollback. Rolling back stops the bleeding but loses the signal. On rollback the script captures the offending metrics and points at the inference-api-debugging runbook skill so the next step is "here's the error signature and the latency profile", not "it failed, figure it out." Don't re-deploy the same build without reading that runbook.

Files

scripts/deploy.sh — the canary rollout driver. Builds + smoke-tests the new revision as a zero-traffic canary, warms weights, captures a same-model-version baseline, ramps traffic 5/25/50/100 holding for significance at each stage, compares error rate + p99 to baseline, auto-rolls-back with in-flight drain on regression (pointing at the inference-api-debugging runbook), and promotes at 100%. Referenced by Running it above.

المزيد من هذا المستودع

نفس المستودع

adversarial-review

az9713/skill-best-practices

Use when a change is written and "looks done" but has not had a hostile second pass before merge — especially diffs touching auth, money, migrations, concurrency, or anything the author is quietly unsure about. Spawns a fresh-eyes reviewer subagent that sees ONLY the diff and the spec, collects findings, drives fixes, and re-dispatches until findings degrade to nitpicks. Reach for this instead of self-reviewing; the author is the worst reviewer of their own diff.

2026-06-240

babysit-pr

az9713/skill-best-practices

Use when a PR is open and green-but-blocked, or red on CI for reasons that smell like flake — a timed-out test runner, a transient network 500 in a setup step, a check that passed locally but failed in CI. Reach for this whenever someone says "this PR keeps failing CI but the test is flaky", "can you babysit this PR to merge", "it's just a flaky check, retry it", or wants a PR shepherded through retries, conflict resolution, and auto-merge without sitting on it manually. Prefer this over hand-clicking "Re-run failed jobs" in the GitHub UI, which gives up no signal on flaky-vs-real and forgets to enable auto-merge.

2026-06-240

billing-lib

az9713/skill-best-practices

Use when writing or reviewing code that meters API token usage, bills accounts, issues invoices, applies credit grants, or computes balances with the internal `billing` library — especially around retries, mid-cycle plan changes, cache-read vs cache-write token pricing, or any place where double-billing or rounding drift would be a problem.

2026-06-240

checkout-verifier

az9713/skill-best-practices

Use when an API-credits checkout or paid-plan upgrade needs to be proven end-to-end against Stripe test mode — confirming a card charge actually creates the invoice and subscription in the right state, reproducing a "I paid but my credits didn't show up" report, checking that a declined or 3DS card fails the way the UI claims, or wiring a billing smoke test into CI so a checkout regression is caught before a customer's money is.

2026-06-240

cherry-pick-prod

az9713/skill-best-practices

Use when a specific fix that's already on main needs to land on a production/release branch without dragging along everything else — a hotfix to backport, a "cherry-pick this commit onto release-2.4", a "we need just that one PR on prod" request. Reach for this whenever someone wants to port one or a few commits to a release branch and open a PR for it, especially before doing it by hand in their main checkout, which pollutes their working tree and routinely leaves conflict markers committed or loses the original commit's provenance.

2026-06-240

code-style

az9713/skill-best-practices

Use when writing or editing code in this org's Python or JS/TS, especially before committing or opening a PR — and proactively the moment a diff adds an import, an except/catch, or any logging. Enforces the style rules Claude gets wrong by default: import grouping, error-wrapping (no bare except / empty catch), no leftover debug prints, explicit over clever. Runs scripts/check_style.sh (ruff, mypy --strict, eslint + grep guards) which exits nonzero so it drops into a pre-commit hook or CI.

2026-06-240

name

deploy-inference-gateway

description

Deploy Inference Gateway

Overview

When to Use

Reach for this when:

You're deploying a new inference-gateway build/revision and want a staged, metric-gated rollout instead of an all-at-once cut.
A config or routing change to the gateway needs to be validated against live traffic before it owns 100%.
You want automatic rollback wired in — if error rate or p99 regresses past threshold, the deploy should undo itself without a human in the loop.
You're codifying the team's release procedure so every gateway deploy goes out the same safe way.

Do NOT use this when:

The change is a pure-data or non-serving change (a dashboard, a docs update, a batch job) — there's no traffic to canary, so this is pure overhead.
You need an emergency rollback of an already-bad live deploy — that's a straight revert to the last-good revision, not a staged canary. Cut traffic back to the known-good build directly.
The new build serves a different model version than the baseline. Canary comparison assumes like-for-like; a model-version change makes the error/latency delta meaningless (see Gotchas). Validate model changes through the eval harness first, then deploy the serving change separately.

Running it

cd .claude/skills/deploy-inference-gateway

export GATEWAY_IMAGE="registry.internal/inference-gateway:$(git rev-parse --short HEAD)"
export GATEWAY_NAMESPACE="inference"
export BASELINE_REVISION="stable"          # the currently-live, known-good revision
export PROM_BASE="https://prom.internal"   # where error-rate / p99 metrics live

./scripts/deploy.sh

Tunable gates (all optional, shown with defaults):

export STAGES="5 25 50 100"        # traffic % ramp
export STAGE_HOLD_S=300            # hold per stage so the metric window is significant
export ERROR_RATE_REGRESSION=1.5   # rollback if canary error rate > baseline * this
export P99_REGRESSION=1.2          # rollback if canary p99 > baseline * this
export WARMUP_S=90                 # warm model weights before any traffic
export MIN_REQUESTS=500            # don't judge a stage until it has served this many

How the rollout loop works

Build + smoke test. Pull/verify GATEWAY_IMAGE, deploy it as a zero- traffic canary revision, and hit its health + a handful of representative inference requests directly (bypassing the load balancer). If smoke fails, the canary never sees real traffic — abort here.
Warm the weights. Model-serving processes are slow on their first requests (lazy weight load, cold caches, JIT). Send WARMUP_S of synthetic warmup to the canary before shifting any real traffic, so the first real users don't eat cold-start latency and trip a false p99 regression.
Capture the baseline. Read the baseline revision's error rate and p99 over a recent window — from the same model version the canary serves — so the comparison is apples-to-apples.
Ramp through STAGES. At each stage: shift that % of traffic to the canary, hold STAGE_HOLD_S, and once the stage has served MIN_REQUESTS, compare canary error rate and p99 to baseline. If canary_error_rate > baseline * ERROR_RATE_REGRESSION or canary_p99 > baseline * P99_REGRESSION, roll back. Otherwise advance.
Promote at 100%. Make the canary the new stable revision and retire the old one.
Rollback path (any stage). Shift traffic back to baseline, then drain: stop new traffic to the canary but let in-flight requests finish (respecting the connection drain timeout) before scaling it to zero. Emit a pointer to the inference-api-debugging runbook with the captured metrics so triage starts with data, not a blank page.

Gotchas

ALWAYS treat these as real, observed failure modes — each has either passed a bad build or rolled back a good one.

A too-short metric window lies. At 5% traffic, a few minutes may be tens of requests — not enough to distinguish a real error-rate regression from noise. Judging a stage before it has served MIN_REQUESTS produces both false rollbacks (good build, unlucky sample) and false promotions (bad build, too few samples to show it). Hold each stage until both STAGE_HOLD_S has elapsed and MIN_REQUESTS have been served before comparing.
Cold weights look like a latency regression. The canary's first requests are slow because weights load lazily and caches are cold — not because the build is worse. Skipping warmup makes the 5% stage trip a p99 regression and roll back a perfectly good deploy. Always warm (WARMUP_S) before shifting real traffic, and don't count warmup traffic in the baseline comparison.
The baseline must be the same model version. Comparing a canary serving model v7 against a baseline serving v6 measures the model change, not the deploy. The error/latency delta becomes meaningless and you'll either roll back a fine gateway change or wave through a real one. If the model version differs, stop — this skill is for serving/gateway changes; route model changes through the eval harness first.
Rollback must drain, not kill. Scaling the canary straight to zero on rollback severs in-flight inference requests — users see truncated streams and 5xxs caused by the rollback itself, which looks like the regression got worse. Always stop new traffic first, let in-flight requests finish within the drain timeout, then scale down. A clean rollback should be invisible to anyone not mid-request.
Error rate and p99 can move independently. A build can keep error rate flat while quietly regressing tail latency (a new sync call on the hot path), or spike errors while latency looks fine. Gate on both — checking only one lets the other class of regression through.
A regressing canary needs triage, not just a rollback. Rolling back stops the bleeding but loses the signal. On rollback the script captures the offending metrics and points at the inference-api-debugging runbook skill so the next step is "here's the error signature and the latency profile", not "it failed, figure it out." Don't re-deploy the same build without reading that runbook.

Files

scripts/deploy.sh — the canary rollout driver. Builds + smoke-tests the new revision as a zero-traffic canary, warms weights, captures a same-model-version baseline, ramps traffic 5/25/50/100 holding for significance at each stage, compares error rate + p99 to baseline, auto-rolls-back with in-flight drain on regression (pointing at the inference-api-debugging runbook), and promotes at 100%. Referenced by Running it above.