| name | deploy-inference-gateway |
| description | Use when shipping a change to the inference gateway and it needs to go out safely — build, smoke test, then ramp real traffic in stages while watching error rate and p99 against a baseline, rolling back automatically on regression. Reach for this whenever someone says "deploy the inference gateway", "roll out the new gateway build", "canary the gateway change", "ship this with a gradual rollout", or wants a traffic-shifted release with automatic rollback rather than flipping 100% of traffic at once. Prefer this over a manual `kubectl set image` / full-cut deploy, which gives no canary signal and no automatic rollback when latency or errors regress. |
Deploy Inference Gateway
Overview
The inference gateway sits in front of every model request, so a bad deploy
doesn't degrade one feature — it degrades all of them at once. A full-cut deploy
("set the image, pray") finds out about a regression from the on-call pager. The
safer shape is a canary: shift a small slice of real traffic to the new build,
compare its error rate and tail latency to the old build serving the rest, and
only widen the slice if the new build is at least as good. If it regresses,
roll back before most users ever touched it.
This skill runs that loop. It builds and smoke-tests the new revision, then ramps
traffic through 5% → 25% → 50% → 100%, holding at each stage long enough for
the metrics to be statistically meaningful, comparing the canary's error rate and
p99 latency to a baseline captured from the same model version. A regression at
any stage triggers an automatic rollback that drains in-flight requests before
tearing the canary down, so no request is killed mid-flight. On regression it
also points you at the inference-api-debugging runbook skill to triage why.
The guiding principle: a canary is only as trustworthy as its comparison. Same
model version on both sides, a long-enough window, warmed weights — get those
wrong and the canary will happily wave a bad build through.
When to Use
Reach for this when:
- You're deploying a new inference-gateway build/revision and want a staged,
metric-gated rollout instead of an all-at-once cut.
- A config or routing change to the gateway needs to be validated against live
traffic before it owns 100%.
- You want automatic rollback wired in — if error rate or p99 regresses past
threshold, the deploy should undo itself without a human in the loop.
- You're codifying the team's release procedure so every gateway deploy goes out
the same safe way.
Do NOT use this when:
- The change is a pure-data or non-serving change (a dashboard, a docs update, a
batch job) — there's no traffic to canary, so this is pure overhead.
- You need an emergency rollback of an already-bad live deploy — that's a
straight revert to the last-good revision, not a staged canary. Cut traffic
back to the known-good build directly.
- The new build serves a different model version than the baseline. Canary
comparison assumes like-for-like; a model-version change makes the
error/latency delta meaningless (see Gotchas). Validate model changes through
the eval harness first, then deploy the serving change separately.
Running it
cd .claude/skills/deploy-inference-gateway
export GATEWAY_IMAGE="registry.internal/inference-gateway:$(git rev-parse --short HEAD)"
export GATEWAY_NAMESPACE="inference"
export BASELINE_REVISION="stable"
export PROM_BASE="https://prom.internal"
./scripts/deploy.sh
Tunable gates (all optional, shown with defaults):
export STAGES="5 25 50 100"
export STAGE_HOLD_S=300
export ERROR_RATE_REGRESSION=1.5
export P99_REGRESSION=1.2
export WARMUP_S=90
export MIN_REQUESTS=500
The script logs each stage transition, the canary-vs-baseline comparison at every
hold, and — on regression — the rollback and drain. Exit code is non-zero if it
rolled back, so it composes into CI/CD that should fail the pipeline on a bad
deploy.
How the rollout loop works
- Build + smoke test. Pull/verify
GATEWAY_IMAGE, deploy it as a zero-
traffic canary revision, and hit its health + a handful of representative
inference requests directly (bypassing the load balancer). If smoke fails, the
canary never sees real traffic — abort here.
- Warm the weights. Model-serving processes are slow on their first requests
(lazy weight load, cold caches, JIT). Send
WARMUP_S of synthetic warmup to
the canary before shifting any real traffic, so the first real users don't
eat cold-start latency and trip a false p99 regression.
- Capture the baseline. Read the baseline revision's error rate and p99 over
a recent window — from the same model version the canary serves — so the
comparison is apples-to-apples.
- Ramp through
STAGES. At each stage: shift that % of traffic to the
canary, hold STAGE_HOLD_S, and once the stage has served MIN_REQUESTS,
compare canary error rate and p99 to baseline. If
canary_error_rate > baseline * ERROR_RATE_REGRESSION or
canary_p99 > baseline * P99_REGRESSION, roll back. Otherwise advance.
- Promote at 100%. Make the canary the new stable revision and retire the
old one.
- Rollback path (any stage). Shift traffic back to baseline, then drain:
stop new traffic to the canary but let in-flight requests finish (respecting
the connection drain timeout) before scaling it to zero. Emit a pointer to the
inference-api-debugging runbook with the captured metrics so triage starts
with data, not a blank page.
Gotchas
ALWAYS treat these as real, observed failure modes — each has either passed a bad
build or rolled back a good one.
- A too-short metric window lies. At 5% traffic, a few minutes may be tens of
requests — not enough to distinguish a real error-rate regression from noise.
Judging a stage before it has served
MIN_REQUESTS produces both false
rollbacks (good build, unlucky sample) and false promotions (bad build, too few
samples to show it). Hold each stage until both STAGE_HOLD_S has elapsed
and MIN_REQUESTS have been served before comparing.
- Cold weights look like a latency regression. The canary's first requests
are slow because weights load lazily and caches are cold — not because the
build is worse. Skipping warmup makes the 5% stage trip a p99 regression and
roll back a perfectly good deploy. Always warm (
WARMUP_S) before shifting real
traffic, and don't count warmup traffic in the baseline comparison.
- The baseline must be the same model version. Comparing a canary serving
model
v7 against a baseline serving v6 measures the model change, not the
deploy. The error/latency delta becomes meaningless and you'll either roll back
a fine gateway change or wave through a real one. If the model version differs,
stop — this skill is for serving/gateway changes; route model changes through
the eval harness first.
- Rollback must drain, not kill. Scaling the canary straight to zero on
rollback severs in-flight inference requests — users see truncated streams and
5xxs caused by the rollback itself, which looks like the regression got
worse. Always stop new traffic first, let in-flight requests finish within the
drain timeout, then scale down. A clean rollback should be invisible to anyone
not mid-request.
- Error rate and p99 can move independently. A build can keep error rate flat
while quietly regressing tail latency (a new sync call on the hot path), or
spike errors while latency looks fine. Gate on both — checking only one lets
the other class of regression through.
- A regressing canary needs triage, not just a rollback. Rolling back stops
the bleeding but loses the signal. On rollback the script captures the offending
metrics and points at the
inference-api-debugging runbook skill so the next
step is "here's the error signature and the latency profile", not "it failed,
figure it out." Don't re-deploy the same build without reading that runbook.
Files
scripts/deploy.sh — the canary rollout driver. Builds + smoke-tests the new
revision as a zero-traffic canary, warms weights, captures a same-model-version
baseline, ramps traffic 5/25/50/100 holding for significance at each stage,
compares error rate + p99 to baseline, auto-rolls-back with in-flight drain on
regression (pointing at the inference-api-debugging runbook), and promotes at
100%. Referenced by Running it above.