with one click
debug-ml-inference
// Debug ML inference issues — latency spikes, wrong predictions, event loop blocking
// Debug ML inference issues — latency spikes, wrong predictions, event loop blocking
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | debug-ml-inference |
| description | Debug ML inference issues — latency spikes, wrong predictions, event loop blocking |
| allowed-tools | ["Read","Grep","Glob","Bash(kubectl:*)","Bash(grep:*)","Bash(curl:*)"] |
| when_to_use | Use when an ML service has inference errors, high latency, or incorrect predictions. Examples: 'inference is slow', 'predictions are wrong', '5xx from predict endpoint', 'latency spike in Grafana', 'SHAP errors' |
| argument-hint | <service-name> [symptom description] |
| arguments | ["service-name"] |
| authorization_mode | {"collect_traces":"AUTO","diagnose":"AUTO","propose_fix":"AUTO","apply_fix_dev":"AUTO","apply_fix_staging":"CONSULT","apply_fix_prod":"STOP","escalation_triggers":[{"p1_alert_active":"STOP"},{"error_budget_exhausted":"CONSULT"},{"root_cause_unclear":"CONSULT"}]} |
Systematically diagnose and fix ML inference issues in production FastAPI services.
$service-name: Name of the ML service to debug (e.g., bankchurn)Identify the root cause of the inference issue and either fix it or provide a specific remediation plan with commands. Every check must produce evidence (command + output).
Run this diagnostic before deep debugging — most inference issues match one of these patterns:
| # | Check | Command | Pass If |
|---|---|---|---|
| D-01 | Multiple workers | grep -rn "workers" $service-name/Dockerfile $service-name/k8s/ | --workers absent or exactly 1 |
| D-02 | Memory HPA | grep -n "memory" $service-name/k8s/base/*hpa* | Empty output |
| D-03 | Sync predict | grep -rn "\.predict|predict_proba" $service-name/app/ | Direct model calls only inside sync helpers delegated by run_in_executor |
| D-04 | TreeExplainer | grep -rn "TreeExplainer" $service-name/ | None, or only in try/fallback |
| D-05 | == pinning | grep "==" $service-name/requirements.txt | No ML packages with == |
| D-06 | Suspiciously high metric | Check MLflow: primary > 0.99? | Below 0.99 |
| D-07 | SHAP background | Check background has both classes | Both high/low probs |
| D-08 | Uniform PSI bins | grep -n "np.linspace|uniform" $service-name/src/*/monitoring/ | Uses np.percentile |
| D-09 | Missing heartbeat | grep -n "heartbeat" $service-name/k8s/ monitoring/ | Alert rule exists |
| D-10 | tfstate in git | git ls-files | grep tfstate | Empty output |
| D-11 | Model in Docker | grep -n "COPY.*model|ADD.*model" $service-name/Dockerfile | No matches |
| D-12 | No quality gates | grep -rn "quality_gate|should_promote" $service-name/src/ | Gate logic exists |
| D-21/D-22 | Blocking prediction logs | grep -rn "log_prediction" $service-name/app/ | Logging is fire-and-forget and errors are swallowed |
| D-23 | Probe split | grep -rn '"/health"|"/ready"' $service-name/app/ $service-name/k8s/ | /health is liveness, /ready gates on model + warm-up |
| D-24 | SHAP rebuild per request | grep -rn "KernelExplainer" $service-name/app/ | Built once during artifact load/warm-up, not inside endpoint |
Success criteria: All serving checks run, plus any D-13..D-32 checks that match the symptom. Any violation is fixed or documented before proceeding to deep debugging.
Before treating the symptom as service-specific, verify the scaffolded contract still holds:
pytest tests/test_fastapi_template_contract.py -v
curl -f http://${ENDPOINT}/health
curl -f http://${ENDPOINT}/ready
curl -s http://${ENDPOINT}/metrics | head
If auth is enabled, include the same X-API-Key or Bearer token used by
clients when testing /predict and /model/info. A failure here means
the service has drifted from the template contract; fix that first.
Classify the issue into one of:
Success criteria: Symptom classified with supporting evidence (logs, metrics, or user report).
The #1 cause of ML inference latency in FastAPI is blocking the event loop.
grep -r "run_in_executor" $service-name/app/
grep -r "sync_predict\|_sync_predict" $service-name/app/
If model.predict() is called directly in an async def endpoint → wrap it:
loop = asyncio.get_running_loop()
return await loop.run_in_executor(_inference_executor, partial(_sync_predict, data))
Success criteria: Confirmed predict calls are wrapped in run_in_executor, or fix applied.
grep -r "workers" $service-name/Dockerfile k8s/base/$service-name-deployment.yaml
If --workers N where N > 1 under K8s → change to 1 worker. HPA handles scaling.
Success criteria: Confirmed single worker (or fix applied).
grep -r "joblib.load\|pickle.load\|load_model" $service-name/app/
Model should be loaded ONCE at startup (lifespan handler or module level), not per-request.
Success criteria: Model load is confirmed at startup, not inside endpoint functions.
If /predict?explain=true is slow:
nsamples parameter in KernelExplainer (default 2*K+2048 can be excessive)Success criteria: SHAP configuration verified or issue identified with fix.
kubectl top pod -l app=$service-name -n ml-services
kubectl describe pod -l app=$service-name -n ml-services | grep -A5 "Limits\|Requests"
If CPU is at limit → HPA should scale. If not scaling → check HPA target.
Success criteria: Resource usage confirmed within limits or bottleneck identified.
kubectl logs -l app=$service-name -n ml-services --tail=100 | grep -i "SchemaError\|validation"
SchemaError means input data violates the Pandera schema → upstream data change.
Success criteria: No validation errors in logs, or schema mismatch identified.
--workers N as a fix; always use ThreadPoolExecutor + HPA| Issue | Anti-Pattern | Fix | ADR |
|---|---|---|---|
| CPU thrashing | uvicorn --workers N | Single worker + HPA | D-01 |
| HPA stuck | Memory-based metric | CPU-only HPA | D-02 |
| Event loop block | model.predict() in async | run_in_executor | D-03 |
| SHAP errors | TreeExplainer on ensemble | KernelExplainer | D-04 |
The skill is complete when ALL of the following hold:
tests/regression/ to prevent recurrenceops/audit.jsonl with the timeline
(symptom → diagnosis → fix or plan)/rollback or /incident was made explicitA "fix" without evidence is a guess. The skill blocks until evidence is collected.
.windsurf/workflows/incident.mdrollback (if a deploy is the cause)performance-degradation-rca (if metric regression precedes)