| name | new-inference-service |
| description | Use when standing up a brand-new inference microservice — a model-serving endpoint, a GPU-backed scoring API, a new FastAPI + Temporal worker — and it must match the org's production conventions out of the gate. Triggers on "new inference service", "spin up a serving endpoint", "scaffold a model API", or any greenfield service that will run on the GPU fleet and needs health probes, auth, structured logging, and graceful Temporal shutdown wired correctly before the first commit. |
New Inference Service
Overview
Every inference service at the org shares a non-negotiable spine: liveness and
readiness probes the orchestrator scrapes, auth middleware on every route,
structured JSON logging the log pipeline can parse, a Temporal worker for
long-running activities, and a deploy manifest that requests GPUs correctly.
Hand-rolling this gets one of these subtly wrong every time — a missing
/readyz means traffic hits a cold model, a worker that doesn't drain on
SIGTERM drops in-flight activities mid-deploy.
This skill scaffolds the whole spine from templates with one command, so the
service starts life production-shaped. The scaffold encodes the requirements
that code review would otherwise have to catch by hand.
When to Use
Reach for this when:
- You're creating a NEW service that serves model inference or scores requests
on the GPU fleet, and nothing exists yet.
- An existing service needs to be split and one half becomes its own deployable.
- You want the FastAPI app, Temporal worker, Dockerfile, and deploy manifest to
agree on names, ports, and probe paths without you reconciling them by hand.
Do NOT use this for:
- Adding a route to a service that already exists — just write the route; the
spine is already there.
- A pure batch job with no HTTP surface and no Temporal worker — this scaffold
carries weight you don't need.
- Non-GPU stateless CRUD apps — use
create-app instead; it's the right
template for internal apps without model-serving concerns.
Running it
python .claude/skills/new-inference-service/scripts/scaffold.py \
--name embeddings-router \
--dest services/embeddings-router
--name becomes the service identifier everywhere (app title, log
service field, Temporal task queue, k8s labels, image name). It must be
lowercase-hyphenated; the script rejects anything else so the name stays valid
as both a Python module hint and a DNS label.
This creates:
services/embeddings-router/
app/
main.py # FastAPI app: /healthz, /readyz, auth middleware, logging
worker.py # Temporal worker with graceful drain on SIGTERM
logging_config.py # structured JSON logging setup
Dockerfile
deploy.yaml # k8s Deployment + Service, GPU resource requests
requirements.txt
Then: review deploy.yaml for the GPU count and model-volume mount, fill in the
actual inference logic in main.py / worker.py where marked # TODO:, and
the spine is already correct.
What the scaffold guarantees
/healthz (liveness) and /readyz (readiness) are distinct. /healthz
returns 200 as soon as the process is up. /readyz returns 200 only after the
model/dependencies have loaded — it flips a ready flag the model-load
routine sets. The orchestrator routes traffic on /readyz, restarts on
/healthz. Collapsing them sends traffic to a service still loading weights.
- Auth middleware wraps every route except the probes. The probes must stay
unauthenticated so the orchestrator (which has no token) can scrape them; the
template allowlists exactly
/healthz and /readyz and rejects everything
else without a valid token.
- Logging is structured JSON from line one, with
service, trace_id, and
level fields, emitted to stdout — the only thing the log pipeline ingests.
- The Temporal worker drains in-flight activities on SIGTERM before exiting,
so a rolling deploy doesn't kill an activity that's halfway through.
deploy.yaml requests GPUs explicitly (nvidia.com/gpu) and pins probe
paths to the ones the app actually serves.
See assets/ for the templates the scaffold instantiates; each is real code,
not a stub, with # TODO: markers only where your model logic goes.
Gotchas
ALWAYS treat these as real deploy-time failures — each has taken a service down
or caused a silent outage before.
/healthz and /readyz are NOT the same endpoint. If /readyz returns
200 before the model is loaded, the orchestrator marks the pod ready and routes
real traffic into a service that 500s every request until weights finish
loading. The readiness flag must be set by the model-load code path, not at
process start. The scaffold wires this; if you simplify it to "one health
endpoint," you reintroduce the cold-traffic outage.
- GPU resource requests must be in the deploy manifest, or you land on a CPU
node. Without
resources.limits['nvidia.com/gpu'], the scheduler happily
places the pod on a GPU-less node and inference silently runs 50x slower (or
CUDA init crashes the pod in a loop). The request AND the limit must be set —
GPUs are non-overcommittable, so request must equal limit.
- Auth middleware must allowlist the probes, not the other way around. The
safe default is deny: every route needs a token EXCEPT
/healthz and
/readyz. If you instead allowlist your business routes and forget one, that
route is unauthenticated and exposed. The template denies by default and names
the two probe exceptions explicitly — keep it that way.
- Graceful shutdown must drain Temporal activities, not just close the HTTP
server. On SIGTERM, FastAPI stopping is not enough — the Temporal worker is a
separate loop. If it exits immediately, any activity it was running is
abandoned and Temporal will retry it (or time it out) elsewhere, which at best
duplicates work and at worst double-bills. The worker template catches SIGTERM,
stops polling for new activities, and waits for in-flight ones to finish within
the grace period.
- The structured logger must be configured before the first log call. If any
module logs at import time before
logging_config.setup() runs, that line goes
out as unstructured text and the log pipeline drops it. The scaffold calls
setup() at the very top of main.py; don't move it below other imports that
log.
--name is load-bearing in five places. It's the app title, the log
service field, the Temporal task queue, the k8s labels, and the image name.
Renaming the directory later without re-running the scaffold leaves the task
queue and labels pointing at the old name — workers won't pick up tasks. Pick
the name once, up front.
Files
scripts/scaffold.py — CLI that validates --name (lowercase-hyphenated),
creates the destination tree, and instantiates every template in assets/
with the name substituted. Referenced by Running it above.
assets/service_main.py.tmpl — FastAPI app template: structured-logging setup
first, auth middleware that denies by default and allowlists /healthz +
/readyz, distinct liveness/readiness probes with a model-load readiness flag.
assets/temporal_worker.py.tmpl — Temporal worker template with SIGTERM
handling that stops polling and drains in-flight activities before exit.
assets/logging_config.py.tmpl — structured JSON logging setup (setup()),
called first in both main.py and worker.py.
assets/Dockerfile.tmpl — CUDA-base image, non-root user, single stdout log
stream, runs the app and worker.
assets/deploy.yaml.tmpl — k8s Deployment + Service: explicit GPU
request==limit, probe paths matching the app, env pinned via the deploy
pipeline.
assets/requirements.txt.tmpl — pinned FastAPI, uvicorn, Temporal SDK, and the
internal logging/auth libs.