تشغيل أي مهارة في Manus بنقرة واحدة

new-inference-service

النجوم٠

التفرعات٠

آخر تحديث٢٤ يونيو ٢٠٢٦ في ٠٣:٢٦

Use when standing up a brand-new inference microservice — a model-serving endpoint, a GPU-backed scoring API, a new FastAPI + Temporal worker — and it must match the org's production conventions out of the gate. Triggers on "new inference service", "spin up a serving endpoint", "scaffold a model API", or any greenfield service that will run on the GPU fleet and needs health probes, auth, structured logging, and graceful Temporal shutdown wired correctly before the first commit.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

az9713

az9713/skill-best-practices

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

مستكشف الملفات

8 ملفات

SKILL.md

readonly

name

new-inference-service

description

New Inference Service

Overview

Every inference service at the org shares a non-negotiable spine: liveness and readiness probes the orchestrator scrapes, auth middleware on every route, structured JSON logging the log pipeline can parse, a Temporal worker for long-running activities, and a deploy manifest that requests GPUs correctly. Hand-rolling this gets one of these subtly wrong every time — a missing /readyz means traffic hits a cold model, a worker that doesn't drain on SIGTERM drops in-flight activities mid-deploy.

This skill scaffolds the whole spine from templates with one command, so the service starts life production-shaped. The scaffold encodes the requirements that code review would otherwise have to catch by hand.

When to Use

Reach for this when:

You're creating a NEW service that serves model inference or scores requests on the GPU fleet, and nothing exists yet.
An existing service needs to be split and one half becomes its own deployable.
You want the FastAPI app, Temporal worker, Dockerfile, and deploy manifest to agree on names, ports, and probe paths without you reconciling them by hand.

Do NOT use this for:

Adding a route to a service that already exists — just write the route; the spine is already there.
A pure batch job with no HTTP surface and no Temporal worker — this scaffold carries weight you don't need.
Non-GPU stateless CRUD apps — use create-app instead; it's the right template for internal apps without model-serving concerns.

Running it

python .claude/skills/new-inference-service/scripts/scaffold.py \
  --name embeddings-router \
  --dest services/embeddings-router

--name becomes the service identifier everywhere (app title, log service field, Temporal task queue, k8s labels, image name). It must be lowercase-hyphenated; the script rejects anything else so the name stays valid as both a Python module hint and a DNS label.

This creates:

services/embeddings-router/
  app/
    main.py            # FastAPI app: /healthz, /readyz, auth middleware, logging
    worker.py          # Temporal worker with graceful drain on SIGTERM
    logging_config.py  # structured JSON logging setup
  Dockerfile
  deploy.yaml          # k8s Deployment + Service, GPU resource requests
  requirements.txt

Then: review deploy.yaml for the GPU count and model-volume mount, fill in the actual inference logic in main.py / worker.py where marked # TODO:, and the spine is already correct.

What the scaffold guarantees

/healthz (liveness) and /readyz (readiness) are distinct. /healthz returns 200 as soon as the process is up. /readyz returns 200 only after the model/dependencies have loaded — it flips a ready flag the model-load routine sets. The orchestrator routes traffic on /readyz, restarts on /healthz. Collapsing them sends traffic to a service still loading weights.
Auth middleware wraps every route except the probes. The probes must stay unauthenticated so the orchestrator (which has no token) can scrape them; the template allowlists exactly /healthz and /readyz and rejects everything else without a valid token.
Logging is structured JSON from line one, with service, trace_id, and level fields, emitted to stdout — the only thing the log pipeline ingests.
The Temporal worker drains in-flight activities on SIGTERM before exiting, so a rolling deploy doesn't kill an activity that's halfway through.
deploy.yaml requests GPUs explicitly (nvidia.com/gpu) and pins probe paths to the ones the app actually serves.

See assets/ for the templates the scaffold instantiates; each is real code, not a stub, with # TODO: markers only where your model logic goes.

Gotchas

ALWAYS treat these as real deploy-time failures — each has taken a service down or caused a silent outage before.

/healthz and /readyz are NOT the same endpoint. If /readyz returns 200 before the model is loaded, the orchestrator marks the pod ready and routes real traffic into a service that 500s every request until weights finish loading. The readiness flag must be set by the model-load code path, not at process start. The scaffold wires this; if you simplify it to "one health endpoint," you reintroduce the cold-traffic outage.
GPU resource requests must be in the deploy manifest, or you land on a CPU node. Without resources.limits['nvidia.com/gpu'], the scheduler happily places the pod on a GPU-less node and inference silently runs 50x slower (or CUDA init crashes the pod in a loop). The request AND the limit must be set — GPUs are non-overcommittable, so request must equal limit.
Auth middleware must allowlist the probes, not the other way around. The safe default is deny: every route needs a token EXCEPT /healthz and /readyz. If you instead allowlist your business routes and forget one, that route is unauthenticated and exposed. The template denies by default and names the two probe exceptions explicitly — keep it that way.
Graceful shutdown must drain Temporal activities, not just close the HTTP server. On SIGTERM, FastAPI stopping is not enough — the Temporal worker is a separate loop. If it exits immediately, any activity it was running is abandoned and Temporal will retry it (or time it out) elsewhere, which at best duplicates work and at worst double-bills. The worker template catches SIGTERM, stops polling for new activities, and waits for in-flight ones to finish within the grace period.
The structured logger must be configured before the first log call. If any module logs at import time before logging_config.setup() runs, that line goes out as unstructured text and the log pipeline drops it. The scaffold calls setup() at the very top of main.py; don't move it below other imports that log.
--name is load-bearing in five places. It's the app title, the log service field, the Temporal task queue, the k8s labels, and the image name. Renaming the directory later without re-running the scaffold leaves the task queue and labels pointing at the old name — workers won't pick up tasks. Pick the name once, up front.

Files

scripts/scaffold.py — CLI that validates --name (lowercase-hyphenated), creates the destination tree, and instantiates every template in assets/ with the name substituted. Referenced by Running it above.
assets/service_main.py.tmpl — FastAPI app template: structured-logging setup first, auth middleware that denies by default and allowlists /healthz + /readyz, distinct liveness/readiness probes with a model-load readiness flag.
assets/temporal_worker.py.tmpl — Temporal worker template with SIGTERM handling that stops polling and drains in-flight activities before exit.
assets/logging_config.py.tmpl — structured JSON logging setup (setup()), called first in both main.py and worker.py.
assets/Dockerfile.tmpl — CUDA-base image, non-root user, single stdout log stream, runs the app and worker.
assets/deploy.yaml.tmpl — k8s Deployment + Service: explicit GPU request==limit, probe paths matching the app, env pinned via the deploy pipeline.
assets/requirements.txt.tmpl — pinned FastAPI, uvicorn, Temporal SDK, and the internal logging/auth libs.

المزيد من هذا المستودع

نفس المستودع

adversarial-review

az9713/skill-best-practices

Use when a change is written and "looks done" but has not had a hostile second pass before merge — especially diffs touching auth, money, migrations, concurrency, or anything the author is quietly unsure about. Spawns a fresh-eyes reviewer subagent that sees ONLY the diff and the spec, collects findings, drives fixes, and re-dispatches until findings degrade to nitpicks. Reach for this instead of self-reviewing; the author is the worst reviewer of their own diff.

2026-06-240

babysit-pr

az9713/skill-best-practices

Use when a PR is open and green-but-blocked, or red on CI for reasons that smell like flake — a timed-out test runner, a transient network 500 in a setup step, a check that passed locally but failed in CI. Reach for this whenever someone says "this PR keeps failing CI but the test is flaky", "can you babysit this PR to merge", "it's just a flaky check, retry it", or wants a PR shepherded through retries, conflict resolution, and auto-merge without sitting on it manually. Prefer this over hand-clicking "Re-run failed jobs" in the GitHub UI, which gives up no signal on flaky-vs-real and forgets to enable auto-merge.

2026-06-240

billing-lib

az9713/skill-best-practices

Use when writing or reviewing code that meters API token usage, bills accounts, issues invoices, applies credit grants, or computes balances with the internal `billing` library — especially around retries, mid-cycle plan changes, cache-read vs cache-write token pricing, or any place where double-billing or rounding drift would be a problem.

2026-06-240

checkout-verifier

az9713/skill-best-practices

Use when an API-credits checkout or paid-plan upgrade needs to be proven end-to-end against Stripe test mode — confirming a card charge actually creates the invoice and subscription in the right state, reproducing a "I paid but my credits didn't show up" report, checking that a declined or 3DS card fails the way the UI claims, or wiring a billing smoke test into CI so a checkout regression is caught before a customer's money is.

2026-06-240

cherry-pick-prod

az9713/skill-best-practices

Use when a specific fix that's already on main needs to land on a production/release branch without dragging along everything else — a hotfix to backport, a "cherry-pick this commit onto release-2.4", a "we need just that one PR on prod" request. Reach for this whenever someone wants to port one or a few commits to a release branch and open a PR for it, especially before doing it by hand in their main checkout, which pollutes their working tree and routinely leaves conflict markers committed or loses the original commit's provenance.

2026-06-240

code-style

az9713/skill-best-practices

Use when writing or editing code in this org's Python or JS/TS, especially before committing or opening a PR — and proactively the moment a diff adds an import, an except/catch, or any logging. Enforces the style rules Claude gets wrong by default: import grouping, error-wrapping (no bare except / empty catch), no leftover debug prints, explicit over clever. Runs scripts/check_style.sh (ruff, mypy --strict, eslint + grep guards) which exits nonzero so it drops into a pre-commit hook or CI.

2026-06-240

name

new-inference-service

description

New Inference Service

Overview

When to Use

Reach for this when:

You're creating a NEW service that serves model inference or scores requests on the GPU fleet, and nothing exists yet.
An existing service needs to be split and one half becomes its own deployable.
You want the FastAPI app, Temporal worker, Dockerfile, and deploy manifest to agree on names, ports, and probe paths without you reconciling them by hand.

Do NOT use this for:

Adding a route to a service that already exists — just write the route; the spine is already there.
A pure batch job with no HTTP surface and no Temporal worker — this scaffold carries weight you don't need.
Non-GPU stateless CRUD apps — use create-app instead; it's the right template for internal apps without model-serving concerns.

Running it

python .claude/skills/new-inference-service/scripts/scaffold.py \
  --name embeddings-router \
  --dest services/embeddings-router

This creates:

services/embeddings-router/
  app/
    main.py            # FastAPI app: /healthz, /readyz, auth middleware, logging
    worker.py          # Temporal worker with graceful drain on SIGTERM
    logging_config.py  # structured JSON logging setup
  Dockerfile
  deploy.yaml          # k8s Deployment + Service, GPU resource requests
  requirements.txt

Then: review deploy.yaml for the GPU count and model-volume mount, fill in the actual inference logic in main.py / worker.py where marked # TODO:, and the spine is already correct.

What the scaffold guarantees

/healthz (liveness) and /readyz (readiness) are distinct. /healthz returns 200 as soon as the process is up. /readyz returns 200 only after the model/dependencies have loaded — it flips a ready flag the model-load routine sets. The orchestrator routes traffic on /readyz, restarts on /healthz. Collapsing them sends traffic to a service still loading weights.
Auth middleware wraps every route except the probes. The probes must stay unauthenticated so the orchestrator (which has no token) can scrape them; the template allowlists exactly /healthz and /readyz and rejects everything else without a valid token.
Logging is structured JSON from line one, with service, trace_id, and level fields, emitted to stdout — the only thing the log pipeline ingests.
The Temporal worker drains in-flight activities on SIGTERM before exiting, so a rolling deploy doesn't kill an activity that's halfway through.
deploy.yaml requests GPUs explicitly (nvidia.com/gpu) and pins probe paths to the ones the app actually serves.

See assets/ for the templates the scaffold instantiates; each is real code, not a stub, with # TODO: markers only where your model logic goes.

Gotchas

ALWAYS treat these as real deploy-time failures — each has taken a service down or caused a silent outage before.

/healthz and /readyz are NOT the same endpoint. If /readyz returns 200 before the model is loaded, the orchestrator marks the pod ready and routes real traffic into a service that 500s every request until weights finish loading. The readiness flag must be set by the model-load code path, not at process start. The scaffold wires this; if you simplify it to "one health endpoint," you reintroduce the cold-traffic outage.
GPU resource requests must be in the deploy manifest, or you land on a CPU node. Without resources.limits['nvidia.com/gpu'], the scheduler happily places the pod on a GPU-less node and inference silently runs 50x slower (or CUDA init crashes the pod in a loop). The request AND the limit must be set — GPUs are non-overcommittable, so request must equal limit.
Auth middleware must allowlist the probes, not the other way around. The safe default is deny: every route needs a token EXCEPT /healthz and /readyz. If you instead allowlist your business routes and forget one, that route is unauthenticated and exposed. The template denies by default and names the two probe exceptions explicitly — keep it that way.
Graceful shutdown must drain Temporal activities, not just close the HTTP server. On SIGTERM, FastAPI stopping is not enough — the Temporal worker is a separate loop. If it exits immediately, any activity it was running is abandoned and Temporal will retry it (or time it out) elsewhere, which at best duplicates work and at worst double-bills. The worker template catches SIGTERM, stops polling for new activities, and waits for in-flight ones to finish within the grace period.
The structured logger must be configured before the first log call. If any module logs at import time before logging_config.setup() runs, that line goes out as unstructured text and the log pipeline drops it. The scaffold calls setup() at the very top of main.py; don't move it below other imports that log.
--name is load-bearing in five places. It's the app title, the log service field, the Temporal task queue, the k8s labels, and the image name. Renaming the directory later without re-running the scaffold leaves the task queue and labels pointing at the old name — workers won't pick up tasks. Pick the name once, up front.

Files

scripts/scaffold.py — CLI that validates --name (lowercase-hyphenated), creates the destination tree, and instantiates every template in assets/ with the name substituted. Referenced by Running it above.
assets/service_main.py.tmpl — FastAPI app template: structured-logging setup first, auth middleware that denies by default and allowlists /healthz + /readyz, distinct liveness/readiness probes with a model-load readiness flag.
assets/temporal_worker.py.tmpl — Temporal worker template with SIGTERM handling that stops polling and drains in-flight activities before exit.
assets/logging_config.py.tmpl — structured JSON logging setup (setup()), called first in both main.py and worker.py.
assets/Dockerfile.tmpl — CUDA-base image, non-root user, single stdout log stream, runs the app and worker.
assets/deploy.yaml.tmpl — k8s Deployment + Service: explicit GPU request==limit, probe paths matching the app, env pinned via the deploy pipeline.
assets/requirements.txt.tmpl — pinned FastAPI, uvicorn, Temporal SDK, and the internal logging/auth libs.