Run any Skill in Manus with one click

inference-api-debugging

Stars0

Forks0

UpdatedJune 24, 2026 at 03:26

Use when the inference API is paging — a p99 latency spike, a 429 surge, an elevated 5xx rate, a model returning gibberish, or a "the API is slow" Slack thread. Maps each symptom to the exact tools and query patterns that isolate the cause across metrics, logs, and traces, then emits a structured incident report. Reach for this the moment a dashboard or alert points at the highest-traffic inference API; do not hand-roll ad hoc queries.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

az9713

az9713/skill-best-practices

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

File Explorer

2 files

SKILL.md

readonly

name

inference-api-debugging

description

Inference API Debugging

Overview

The inference API is the highest-traffic surface at the company: every external customer request and most internal traffic flows through one gateway into a fleet of model servers backed by GPUs. When it degrades, the blast radius is everyone.

This skill exists so you do not start from a blank page during an incident. It maps a symptom (the thing the alert or Slack thread is screaming about) to the tools that hold the answer and the exact query patterns to run against them, then funnels everything into a single structured report you can paste into the incident channel.

The investigation always follows the same spine:

Confirm the symptom — is it real, and is it API-wide or scoped to one model?
Walk the symptom→tool map in references/runbook.md — run the queries it prescribes, in order.
Separate cause from symptom — a 5xx spike is usually downstream of something (GPU OOM, a bad deploy, a saturated rate limiter).
Emit the structured report using the template at the end of the runbook.

When to Use

Use this skill when you see any of these symptoms on the inference API:

p99 / p95 latency spike — the latency SLO alert fired, or customers report "the API is slow."
429 surge — a jump in rate-limit rejections, quota-exceeded errors, or a customer escalation about being throttled.
Elevated 5xx rate — 500/502/503 above baseline, gateway error-rate alert.
Model returning garbage — empty completions, truncated output, wrong model responding, or a quality regression report.
Throughput drop — tokens/sec or requests/sec fell off a cliff while latency may look normal.

Do NOT use this skill when:

The symptom is on a different service (billing UI, console, training cluster). This skill's queries target the inference gateway and model-server fleet only.
You already have a confirmed root cause and just need to remediate — go straight to the relevant runbook/playbook for that fix.
You are doing capacity planning or a non-incident investigation; this is a break-glass tool, not a reporting tool.

How to Use

Read references/runbook.md. It contains the full symptom→tool→query table. Find the row matching your symptom.
Run the prescribed queries in order. Each row gives you a metric query, a log query, and (where relevant) a trace lookup. Run them and record the numbers — you will need them for the report.
Always split aggregate vs per-model. The single most common mistake on this API is reading the aggregate dashboard and missing that one model is dragging the whole p99. Every latency/error query has a group by model variant in the runbook — use it.
Trace the worst offender. Once you have a slow or failing model, pull a representative trace (the runbook gives the trace query). The trace tells you which hop — gateway, queue, model server, tokenizer — owns the time or the error.
Classify cause vs symptom. Use the "common cause chains" section of the runbook. Example: elevated 5xx → trace shows model-server hop → model-server logs show CUDA OOM → root cause is GPU OOM surfacing as a generic 500, not a gateway bug.
Emit the structured report. Fill in the template at the bottom of references/runbook.md and post it to the incident channel.

Gotchas

ALWAYS use p99 (or p95), never the average, for latency. Averages hide the tail. A mean latency of 400ms can sit on top of a p99 of 30s when one model or one GPU node is timing out. The SLO and the customer pain both live in the tail.
ALWAYS split per-model before concluding. Aggregate metrics blend a healthy high-volume model with a sick low-volume one. The aggregate can look fine — or look uniformly bad — while the truth is one model. Run the group by model variant first, not last.
GPU OOM shows up as a generic 5xx, not as "OOM." A CUDA out-of-memory on a model server gets caught and returned as a plain 500/503 at the gateway. If you see 5xx with no obvious gateway cause, go straight to model-server logs and grep for CUDA out of memory / OOM — do not assume the gateway is the problem.
Label cardinality is capped — don't group by request_id or customer_id in the metrics system. High-cardinality labels get dropped or rejected by the metrics backend, so a group by request_id query silently returns nothing or partial data. Use metrics for low-cardinality dimensions (model, region, status_code) and pivot to logs/traces for anything per-request.
A 429 surge is not always a problem with you. It can be a single customer hammering a key (check the per-key quota table) or an intentional limit doing its job. Confirm whether the rejections are concentrated on one key before paging the rate-limiter team.

Files

SKILL.md — this file: when to trigger, the investigation spine, gotchas.
references/runbook.md — the symptom→tool→query-pattern table, common cause chains, and the structured incident report template. Read this during an incident.

More from this repository

same repository

adversarial-review

az9713/skill-best-practices

Use when a change is written and "looks done" but has not had a hostile second pass before merge — especially diffs touching auth, money, migrations, concurrency, or anything the author is quietly unsure about. Spawns a fresh-eyes reviewer subagent that sees ONLY the diff and the spec, collects findings, drives fixes, and re-dispatches until findings degrade to nitpicks. Reach for this instead of self-reviewing; the author is the worst reviewer of their own diff.

2026-06-240

babysit-pr

az9713/skill-best-practices

Use when a PR is open and green-but-blocked, or red on CI for reasons that smell like flake — a timed-out test runner, a transient network 500 in a setup step, a check that passed locally but failed in CI. Reach for this whenever someone says "this PR keeps failing CI but the test is flaky", "can you babysit this PR to merge", "it's just a flaky check, retry it", or wants a PR shepherded through retries, conflict resolution, and auto-merge without sitting on it manually. Prefer this over hand-clicking "Re-run failed jobs" in the GitHub UI, which gives up no signal on flaky-vs-real and forgets to enable auto-merge.

2026-06-240

billing-lib

az9713/skill-best-practices

Use when writing or reviewing code that meters API token usage, bills accounts, issues invoices, applies credit grants, or computes balances with the internal `billing` library — especially around retries, mid-cycle plan changes, cache-read vs cache-write token pricing, or any place where double-billing or rounding drift would be a problem.

2026-06-240

checkout-verifier

az9713/skill-best-practices

Use when an API-credits checkout or paid-plan upgrade needs to be proven end-to-end against Stripe test mode — confirming a card charge actually creates the invoice and subscription in the right state, reproducing a "I paid but my credits didn't show up" report, checking that a declined or 3DS card fails the way the UI claims, or wiring a billing smoke test into CI so a checkout regression is caught before a customer's money is.

2026-06-240

cherry-pick-prod

az9713/skill-best-practices

Use when a specific fix that's already on main needs to land on a production/release branch without dragging along everything else — a hotfix to backport, a "cherry-pick this commit onto release-2.4", a "we need just that one PR on prod" request. Reach for this whenever someone wants to port one or a few commits to a release branch and open a PR for it, especially before doing it by hand in their main checkout, which pollutes their working tree and routinely leaves conflict markers committed or loses the original commit's provenance.

2026-06-240

code-style

az9713/skill-best-practices

Use when writing or editing code in this org's Python or JS/TS, especially before committing or opening a PR — and proactively the moment a diff adds an import, an except/catch, or any logging. Enforces the style rules Claude gets wrong by default: import grouping, error-wrapping (no bare except / empty catch), no leftover debug prints, explicit over clever. Runs scripts/check_style.sh (ruff, mypy --strict, eslint + grep guards) which exits nonzero so it drops into a pre-commit hook or CI.

2026-06-240

name

inference-api-debugging

description

Inference API Debugging

Overview

The investigation always follows the same spine:

Confirm the symptom — is it real, and is it API-wide or scoped to one model?
Walk the symptom→tool map in references/runbook.md — run the queries it prescribes, in order.
Separate cause from symptom — a 5xx spike is usually downstream of something (GPU OOM, a bad deploy, a saturated rate limiter).
Emit the structured report using the template at the end of the runbook.

When to Use

Use this skill when you see any of these symptoms on the inference API:

p99 / p95 latency spike — the latency SLO alert fired, or customers report "the API is slow."
429 surge — a jump in rate-limit rejections, quota-exceeded errors, or a customer escalation about being throttled.
Elevated 5xx rate — 500/502/503 above baseline, gateway error-rate alert.
Model returning garbage — empty completions, truncated output, wrong model responding, or a quality regression report.
Throughput drop — tokens/sec or requests/sec fell off a cliff while latency may look normal.

Do NOT use this skill when:

The symptom is on a different service (billing UI, console, training cluster). This skill's queries target the inference gateway and model-server fleet only.
You already have a confirmed root cause and just need to remediate — go straight to the relevant runbook/playbook for that fix.
You are doing capacity planning or a non-incident investigation; this is a break-glass tool, not a reporting tool.

How to Use

Read references/runbook.md. It contains the full symptom→tool→query table. Find the row matching your symptom.
Run the prescribed queries in order. Each row gives you a metric query, a log query, and (where relevant) a trace lookup. Run them and record the numbers — you will need them for the report.
Always split aggregate vs per-model. The single most common mistake on this API is reading the aggregate dashboard and missing that one model is dragging the whole p99. Every latency/error query has a group by model variant in the runbook — use it.
Trace the worst offender. Once you have a slow or failing model, pull a representative trace (the runbook gives the trace query). The trace tells you which hop — gateway, queue, model server, tokenizer — owns the time or the error.
Classify cause vs symptom. Use the "common cause chains" section of the runbook. Example: elevated 5xx → trace shows model-server hop → model-server logs show CUDA OOM → root cause is GPU OOM surfacing as a generic 500, not a gateway bug.
Emit the structured report. Fill in the template at the bottom of references/runbook.md and post it to the incident channel.

Gotchas

ALWAYS use p99 (or p95), never the average, for latency. Averages hide the tail. A mean latency of 400ms can sit on top of a p99 of 30s when one model or one GPU node is timing out. The SLO and the customer pain both live in the tail.
ALWAYS split per-model before concluding. Aggregate metrics blend a healthy high-volume model with a sick low-volume one. The aggregate can look fine — or look uniformly bad — while the truth is one model. Run the group by model variant first, not last.
GPU OOM shows up as a generic 5xx, not as "OOM." A CUDA out-of-memory on a model server gets caught and returned as a plain 500/503 at the gateway. If you see 5xx with no obvious gateway cause, go straight to model-server logs and grep for CUDA out of memory / OOM — do not assume the gateway is the problem.
Label cardinality is capped — don't group by request_id or customer_id in the metrics system. High-cardinality labels get dropped or rejected by the metrics backend, so a group by request_id query silently returns nothing or partial data. Use metrics for low-cardinality dimensions (model, region, status_code) and pivot to logs/traces for anything per-request.
A 429 surge is not always a problem with you. It can be a single customer hammering a key (check the per-key quota table) or an intentional limit doing its job. Confirm whether the rejections are concentrated on one key before paging the rate-limiter team.

Files

SKILL.md — this file: when to trigger, the investigation spine, gotchas.
references/runbook.md — the symptom→tool→query-pattern table, common cause chains, and the structured incident report template. Read this during an incident.