ワンクリックで
ai-infrastructure-modal
Serverless GPU compute platform for AI model deployment — web endpoints, GPU functions, model serving, and TypeScript client patterns
メニュー
Serverless GPU compute platform for AI model deployment — web endpoints, GPU functions, model serving, and TypeScript client patterns
Hugging Face Inference SDK patterns for TypeScript/Node.js — InferenceClient setup, chat completion, text generation, streaming, embeddings, image generation, audio transcription, translation, summarization, and Inference Endpoints
LiteLLM proxy server setup, TypeScript client patterns via OpenAI SDK, model routing, fallbacks, load balancing, spend tracking, virtual keys, and production deployment
Local LLM inference with the Ollama JavaScript client -- chat, streaming, tool calling, vision, embeddings, structured output, model management, and OpenAI-compatible endpoint
Replicate SDK patterns for TypeScript/Node.js -- client setup, predictions, streaming, webhooks, file handling, model versioning, deployments, and training
Together AI SDK patterns for TypeScript — client setup, chat completions, streaming, structured output, function calling, embeddings, image generation, fine-tuning, and OpenAI-compatible endpoints
LLM observability with Langfuse — OpenTelemetry-based tracing, evaluations, prompt management, datasets, and production best practices
| name | ai-infrastructure-modal |
| description | Serverless GPU compute platform for AI model deployment — web endpoints, GPU functions, model serving, and TypeScript client patterns |
Quick Guide: Modal is a serverless GPU compute platform where you define Python functions with decorators and Modal handles containers, scaling, and GPU provisioning. TypeScript apps interact with Modal via HTTP endpoints (calling
@modal.fastapi_endpointor@modal.asgi_appfunctions) or themodalnpm SDK (calling functions directly via gRPC). Define container images, secrets, and volumes as code -- no YAML config files. Usemodal deployfor production,modal servefor dev.
<critical_requirements>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST define Modal functions in Python -- the TypeScript SDK can call functions and manage resources but cannot define them)
(You MUST use @modal.fastapi_endpoint (not the old @modal.web_endpoint) for simple web endpoints -- renamed in Modal 1.0)
(You MUST use modal.Volume for model weight caching -- @modal.build is deprecated in Modal 1.0)
(You MUST never hardcode secrets in Modal code -- use modal.Secret.from_name() and access via os.environ)
(You MUST bind to 0.0.0.0 (not 127.0.0.1) when using @modal.web_server)
</critical_requirements>
Auto-detection: Modal, modal, modal.App, modal.Image, modal.Volume, modal.Secret, modal.gpu, modal.fastapi_endpoint, modal.asgi_app, modal.web_server, modal.Cron, modal.Period, modal deploy, modal serve, MODAL_TOKEN_ID, MODAL_TOKEN_SECRET, ModalClient
When to use:
modal npm SDKKey patterns covered:
@modal.fastapi_endpoint, @modal.asgi_app, @modal.web_server) for HTTP accessmodal npm SDK)When NOT to use:
Modal eliminates infrastructure management for GPU workloads. Everything is code -- container images, GPU allocation, secrets, volumes, scaling rules. There are no YAML configs, Dockerfiles, or Kubernetes manifests.
Core principles:
modal npm SDK for direct function invocation without HTTP overhead.modal deploy creates a named, persistent deployment with stable URLs. modal serve creates ephemeral dev endpoints.The most common pattern: define a Python endpoint on Modal, call it from TypeScript via fetch.
# inference.py
import modal
app = modal.App("my-inference-api")
image = modal.Image.debian_slim().uv_pip_install(["fastapi[standard]", "transformers", "torch"])
@app.function(image=image, gpu="A10G")
@modal.fastapi_endpoint(method="POST")
def predict(payload: dict):
# GPU-accelerated inference
text = payload["text"]
result = run_model(text)
return {"prediction": result}
const response = await fetch(MODAL_ENDPOINT, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(input),
signal: AbortSignal.timeout(REQUEST_TIMEOUT_MS), // Essential for cold starts
});
Key requirements: Named constant for URL (not hardcoded at call sites), Content-Type: application/json header (FastAPI rejects without it), AbortSignal.timeout() to handle cold start delays, typed request/response interfaces.
See examples/core.md for a complete TypeScript client with error handling and typed interfaces.
Modal supports proxy auth tokens that protect endpoints without spinning up containers for unauthorized requests.
@app.function(image=image, gpu="A10G")
@modal.fastapi_endpoint(method="POST", requires_proxy_auth=True)
def predict_secure(payload: dict):
return {"prediction": run_model(payload["text"])}
headers: {
"Content-Type": "application/json",
"Modal-Key": process.env.MODAL_PROXY_KEY, // Proxy auth token
"Modal-Secret": process.env.MODAL_PROXY_SECRET,
},
Why good: Auth handled at Modal's proxy layer (no container spin-up for bad requests), env vars for credentials. Add explicit 401 handling in your error logic.
See examples/core.md for a complete authenticated TypeScript client with error handling.
For TypeScript apps that need to call Modal functions without HTTP overhead. Requires Node 22+.
import { ModalClient } from "modal";
const modal = new ModalClient(); // Create once, reuse
const fn = await modal.functions.fromName("my-inference-api", "predict");
const result = await fn.remote([text]); // sync call
const call = await fn.spawn([text]); // async (fire-and-forget)
const later = await call.get(); // retrieve result later
Why good: No HTTP serialization overhead, typed SDK, supports async spawn for long-running jobs
See examples/core.md for complete TypeScript SDK patterns including error handling and fire-and-forget job IDs.
When to use: Backend-to-Modal calls where you control the Node.js runtime (Node 22+). Not for browser or edge runtimes.
Modal functions define their compute environment inline.
import modal
app = modal.App("gpu-inference")
# Container image with ML dependencies
inference_image = (
modal.Image.debian_slim(python_version="3.11")
.uv_pip_install(["torch==2.5.0", "transformers==4.47.0", "accelerate"])
.apt_install(["libgl1"])
)
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
@app.function(
image=inference_image,
gpu="A100", # GPU type: "T4", "A10G", "A100", "H100", etc.
secrets=[modal.Secret.from_name("huggingface-secret")],
volumes={"/models": modal.Volume.from_name("model-cache", create_if_missing=True)},
min_containers=1, # Keep warm to avoid cold starts
scaledown_window=300, # Seconds before scaling to zero
)
def generate(prompt: str) -> str:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load from volume cache
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/models")
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, cache_dir="/models")
# ... generate and return
Why good: Pinned dependency versions, volume-based model caching (avoids re-download), min_containers for warm starts, secrets for HF token
# BAD: No version pinning, no volume cache, model re-downloads every cold start
@app.function(gpu="A100")
def generate(prompt: str):
from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct")
return pipe(prompt)[0]["generated_text"]
Why bad: Unpinned deps break reproducibility, no volume means multi-GB model download on every cold start (30-60s+ delay), no secret for gated models
For endpoints that need routing, middleware, or multiple routes.
import modal
from fastapi import FastAPI
app = modal.App("my-api")
web_app = FastAPI()
@web_app.post("/predict")
async def predict(payload: dict):
return {"result": "prediction"}
@web_app.get("/health")
async def health():
return {"status": "ok"}
@app.function(image=modal.Image.debian_slim().uv_pip_install(["fastapi"]))
@modal.asgi_app()
def serve():
return web_app
Why good: Full FastAPI capabilities (routing, middleware, validation), multiple endpoints under one function
# Creating secrets via CLI
# modal secret create my-api-keys API_KEY=sk-xxx DB_URL=postgres://...
@app.function(
secrets=[
modal.Secret.from_name("my-api-keys"),
modal.Secret.from_name("huggingface-secret"),
]
)
def my_function():
import os
api_key = os.environ["API_KEY"] # Injected by Modal
hf_token = os.environ["HF_TOKEN"]
Why good: Secrets created via dashboard or CLI, referenced by name in code, accessed as standard env vars, multiple secrets composable
@app.function(
schedule=modal.Cron("0 2 * * *"), # 2 AM daily
image=inference_image,
gpu="A10G",
volumes={"/data": modal.Volume.from_name("training-data")},
)
def nightly_batch_inference():
# Process accumulated data
# Write results to volume
pass
@app.function(schedule=modal.Period(hours=6))
def periodic_health_check():
# Check model freshness, data quality, etc.
pass
Why good: modal.Cron for precise scheduling, modal.Period for intervals. Scheduled functions cannot accept arguments -- use volumes or secrets for input data.
<decision_framework>
Does your TypeScript app need to call Modal?
+-- Via HTTP (most common)
| +-- Single endpoint? -> @modal.fastapi_endpoint
| +-- Multiple routes? -> @modal.asgi_app with FastAPI
| +-- Non-Python server (vLLM, TGI)? -> @modal.web_server(port=8000)
| +-- Need auth? -> Add requires_proxy_auth=True
+-- Via SDK (direct gRPC)
| +-- Node 22+ backend? -> npm install modal, use ModalClient
| +-- Browser/edge? -> Use HTTP endpoints instead
+-- Async job?
+-- Fire-and-forget? -> SDK spawn() + later get()
+-- Webhook callback? -> Modal calls your endpoint on completion
What are you serving?
+-- Simple function -> @modal.fastapi_endpoint (auto-wraps in FastAPI)
+-- Full web app -> @modal.asgi_app (FastAPI, Starlette, FastHTML)
+-- Legacy sync app -> @modal.wsgi_app (Flask, Django)
+-- Custom server binary -> @modal.web_server(port=8000) (vLLM, TGI, Ollama)
How should TypeScript call Modal?
+-- Browser/edge runtime? -> HTTP (fetch)
+-- Server-side Node 22+? -> Either works
| +-- Need simplicity? -> HTTP
| +-- Need speed (no serialization overhead)? -> SDK
| +-- Need async spawn? -> SDK
+-- Multiple providers? -> HTTP (vendor-agnostic)
</decision_framework>
<red_flags>
High Priority Issues:
@modal.web_endpoint instead of @modal.fastapi_endpoint (renamed in Modal 1.0)@modal.build for downloading model weights (deprecated -- use modal.Volume instead).lookup() for object references (deprecated -- use .from_name())modal.Secret.from_name() + os.environ)@modal.web_server to 127.0.0.1 instead of 0.0.0.0 (endpoint unreachable)Medium Priority Issues:
uv_pip_install() (breaks reproducibility)min_containers for latency-sensitive endpoints (2-4s cold starts)signal: AbortSignal.timeout() on TypeScript fetch calls (hangs on cold starts)modal.Period when you need exact times (use modal.Cron -- Period resets on redeploy)Common Mistakes:
modal serve (ephemeral dev) with modal deploy (persistent production)messages parameter with @modal.fastapi_endpoint (it is not OpenAI -- it is a plain HTTP endpoint)Content-Type: application/json header from TypeScript (FastAPI endpoints may reject the request)modal npm SDK in browser or edge runtimes (requires Node 22+, native modules)Gotchas & Edge Cases:
modal serve URLs get a -dev suffix to avoid production conflictshttps://<workspace>--<app-name>-<function-name>.modal.runversion=2)modal.Cron maintains schedule across redeploys; modal.Period resetsstr, int, bool, bytesImage.add_local_python_source() (automounting removed in 1.0)</red_flags>
<critical_reminders>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST define Modal functions in Python -- the TypeScript SDK can call functions and manage resources but cannot define them)
(You MUST use @modal.fastapi_endpoint (not the old @modal.web_endpoint) for simple web endpoints -- renamed in Modal 1.0)
(You MUST use modal.Volume for model weight caching -- @modal.build is deprecated in Modal 1.0)
(You MUST never hardcode secrets in Modal code -- use modal.Secret.from_name() and access via os.environ)
(You MUST bind to 0.0.0.0 (not 127.0.0.1) when using @modal.web_server)
Failure to follow these rules will produce broken deployments, security vulnerabilities, or unreachable endpoints.
</critical_reminders>