| name | inference-api-debugging |
| description | Use when the inference API is paging — a p99 latency spike, a 429 surge, an elevated 5xx rate, a model returning gibberish, or a "the API is slow" Slack thread. Maps each symptom to the exact tools and query patterns that isolate the cause across metrics, logs, and traces, then emits a structured incident report. Reach for this the moment a dashboard or alert points at the highest-traffic inference API; do not hand-roll ad hoc queries. |
Inference API Debugging
Overview
The inference API is the highest-traffic surface at the company: every external
customer request and most internal traffic flows through one gateway into a fleet
of model servers backed by GPUs. When it degrades, the blast radius is everyone.
This skill exists so you do not start from a blank page during an incident. It
maps a symptom (the thing the alert or Slack thread is screaming about) to the
tools that hold the answer and the exact query patterns to run against
them, then funnels everything into a single structured report you can paste into
the incident channel.
The investigation always follows the same spine:
- Confirm the symptom — is it real, and is it API-wide or scoped to one model?
- Walk the symptom→tool map in
references/runbook.md — run the queries it
prescribes, in order.
- Separate cause from symptom — a 5xx spike is usually downstream of
something (GPU OOM, a bad deploy, a saturated rate limiter).
- Emit the structured report using the template at the end of the runbook.
When to Use
Use this skill when you see any of these symptoms on the inference API:
- p99 / p95 latency spike — the latency SLO alert fired, or customers report
"the API is slow."
- 429 surge — a jump in rate-limit rejections, quota-exceeded errors, or a
customer escalation about being throttled.
- Elevated 5xx rate — 500/502/503 above baseline, gateway error-rate alert.
- Model returning garbage — empty completions, truncated output, wrong model
responding, or a quality regression report.
- Throughput drop — tokens/sec or requests/sec fell off a cliff while latency
may look normal.
Do NOT use this skill when:
- The symptom is on a different service (billing UI, console, training cluster).
This skill's queries target the inference gateway and model-server fleet only.
- You already have a confirmed root cause and just need to remediate — go straight
to the relevant runbook/playbook for that fix.
- You are doing capacity planning or a non-incident investigation; this is a
break-glass tool, not a reporting tool.
How to Use
-
Read references/runbook.md. It contains the full symptom→tool→query
table. Find the row matching your symptom.
-
Run the prescribed queries in order. Each row gives you a metric query, a
log query, and (where relevant) a trace lookup. Run them and record the
numbers — you will need them for the report.
-
Always split aggregate vs per-model. The single most common mistake on
this API is reading the aggregate dashboard and missing that one model is
dragging the whole p99. Every latency/error query has a group by model
variant in the runbook — use it.
-
Trace the worst offender. Once you have a slow or failing model, pull a
representative trace (the runbook gives the trace query). The trace tells you
which hop — gateway, queue, model server, tokenizer — owns the time or the
error.
-
Classify cause vs symptom. Use the "common cause chains" section of the
runbook. Example: elevated 5xx → trace shows model-server hop → model-server
logs show CUDA OOM → root cause is GPU OOM surfacing as a generic 500, not a
gateway bug.
-
Emit the structured report. Fill in the template at the bottom of
references/runbook.md and post it to the incident channel.
Gotchas
- ALWAYS use p99 (or p95), never the average, for latency. Averages hide the
tail. A mean latency of 400ms can sit on top of a p99 of 30s when one model or
one GPU node is timing out. The SLO and the customer pain both live in the tail.
- ALWAYS split per-model before concluding. Aggregate metrics blend a healthy
high-volume model with a sick low-volume one. The aggregate can look fine — or
look uniformly bad — while the truth is one model. Run the
group by model
variant first, not last.
- GPU OOM shows up as a generic 5xx, not as "OOM." A CUDA out-of-memory on a
model server gets caught and returned as a plain 500/503 at the gateway. If you
see 5xx with no obvious gateway cause, go straight to model-server logs and grep
for
CUDA out of memory / OOM — do not assume the gateway is the problem.
- Label cardinality is capped — don't group by request_id or customer_id in the
metrics system. High-cardinality labels get dropped or rejected by the metrics
backend, so a
group by request_id query silently returns nothing or
partial data. Use metrics for low-cardinality dimensions (model, region,
status_code) and pivot to logs/traces for anything per-request.
- A 429 surge is not always a problem with you. It can be a single customer
hammering a key (check the per-key quota table) or an intentional limit doing
its job. Confirm whether the rejections are concentrated on one key before
paging the rate-limiter team.
Files
SKILL.md — this file: when to trigger, the investigation spine, gotchas.
references/runbook.md — the symptom→tool→query-pattern table, common cause
chains, and the structured incident report template. Read this during an
incident.