mit einem Klick
ai-infrastructure-replicate
Replicate SDK patterns for TypeScript/Node.js -- client setup, predictions, streaming, webhooks, file handling, model versioning, deployments, and training
Menü
Replicate SDK patterns for TypeScript/Node.js -- client setup, predictions, streaming, webhooks, file handling, model versioning, deployments, and training
Hugging Face Inference SDK patterns for TypeScript/Node.js — InferenceClient setup, chat completion, text generation, streaming, embeddings, image generation, audio transcription, translation, summarization, and Inference Endpoints
LiteLLM proxy server setup, TypeScript client patterns via OpenAI SDK, model routing, fallbacks, load balancing, spend tracking, virtual keys, and production deployment
Serverless GPU compute platform for AI model deployment — web endpoints, GPU functions, model serving, and TypeScript client patterns
Local LLM inference with the Ollama JavaScript client -- chat, streaming, tool calling, vision, embeddings, structured output, model management, and OpenAI-compatible endpoint
Together AI SDK patterns for TypeScript — client setup, chat completions, streaming, structured output, function calling, embeddings, image generation, fine-tuning, and OpenAI-compatible endpoints
LLM observability with Langfuse — OpenTelemetry-based tracing, evaluations, prompt management, datasets, and production best practices
| name | ai-infrastructure-replicate |
| description | Replicate SDK patterns for TypeScript/Node.js -- client setup, predictions, streaming, webhooks, file handling, model versioning, deployments, and training |
Quick Guide: Use the
replicatenpm package to run open-source ML models on serverless GPUs. Usereplicate.run()for synchronous execution that returns output directly,replicate.stream()for SSE-based streaming, orreplicate.predictions.create()for async background jobs with webhook notifications. Models are referenced asowner/model(uses latest version) orowner/model:version(pinned). File outputs areFileOutputobjects implementingReadableStream. Cold starts are expected for infrequently-used models -- use deployments withmin_instancesto keep models warm.
<critical_requirements>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST never hardcode API tokens -- always use environment variables via process.env.REPLICATE_API_TOKEN)
(You MUST handle FileOutput objects for models that return files -- do not assume outputs are plain strings or URLs)
(You MUST validate webhooks using validateWebhook() from the replicate package -- never trust unverified webhook payloads)
(You MUST account for cold starts when running infrequently-used models -- use deployments for latency-sensitive applications)
(You MUST specify model versions (owner/model:version) in production to ensure reproducible results -- unversioned references use the latest, which can change)
</critical_requirements>
Auto-detection: Replicate, replicate, replicate.run, replicate.stream, replicate.predictions, replicate.deployments, replicate.trainings, replicate.models, FileOutput, validateWebhook, REPLICATE_API_TOKEN, serverless GPU, cold start, webhook_events_filter
When to use:
Key patterns covered:
replicate.run(), replicate.predictions.create(), replicate.wait())replicate.stream() with SSE events)owner/model vs owner/model:version)FileOutput, file uploads, Buffer inputs)When NOT to use:
Replicate provides serverless GPU infrastructure for running open-source ML models. You send inputs, Replicate allocates GPU hardware, runs the model, and returns outputs. No Docker, no CUDA drivers, no GPU provisioning.
Core principles:
replicate.com/explore. Run any public model with just its identifier.owner/model:abc123...) to guarantee identical behavior across deploys.replicate.run() for synchronous wait, replicate.stream() for real-time SSE output, replicate.predictions.create() for fire-and-forget with webhooks.FileOutput objects for file outputs.Initialize the Replicate client. It auto-reads REPLICATE_API_TOKEN from the environment.
// lib/replicate.ts -- basic setup
import Replicate from "replicate";
const replicate = new Replicate();
export { replicate };
// lib/replicate.ts -- explicit auth + custom user agent
import Replicate from "replicate";
const replicate = new Replicate({
auth: process.env.REPLICATE_API_TOKEN, // Auto-reads from env if omitted
userAgent: "my-app/1.0.0",
});
export { replicate };
Why good: Minimal setup, env var auto-detected, explicit auth optional but useful for clarity
// BAD: Hardcoded token
const replicate = new Replicate({
auth: "r8_abc123...",
});
Why bad: Hardcoded API token is a security risk, will leak in version control
See: examples/core.md for full constructor options, error handling patterns
Use replicate.run() for synchronous execution. Returns the model output directly.
// Run an image generation model
const [output] = await replicate.run("black-forest-labs/flux-schnell", {
input: {
prompt: "a serene mountain landscape at sunset",
},
});
// output is a FileOutput object for image models
console.log(output.url()); // URL of generated image
// Run an LLM -- output is a string for text models
const output = await replicate.run("meta/meta-llama-3-70b-instruct", {
input: {
prompt: "Explain TypeScript generics in 3 sentences.",
max_tokens: 512,
},
});
console.log(output); // Text response
Why good: Simple API, returns output directly, destructuring works for array outputs (images)
// BAD: Not pinning version in production
const output = await replicate.run("community-user/experimental-model", {
input: { prompt: "hello" },
});
Why bad: Community models without version pinning can change behavior unexpectedly when authors push updates
See: examples/core.md for version pinning, predictions.create() + wait(), and progress callbacks
Use replicate.stream() for real-time SSE output from language models.
const stream = replicate.stream("meta/meta-llama-3-70b-instruct", {
input: {
prompt: "Write a short poem about TypeScript.",
max_tokens: 512,
},
});
for await (const event of stream) {
if (event.event === "output") {
process.stdout.write(event.data);
}
}
Why good: Progressive output for better UX, event-based with typed event and data fields
// BAD: Using replicate.run() for user-facing LLM output
const output = await replicate.run("meta/meta-llama-3-70b-instruct", {
input: { prompt: "Write a long essay..." },
});
// User waits for entire generation to complete before seeing anything
Why bad: No progressive feedback, user sees a blank screen for seconds
See: examples/streaming-webhooks.md for event types, error handling, cancellation
Models are referenced as owner/model (latest version) or owner/model:sha256hash (pinned version).
// Development: use latest version for convenience
const output = await replicate.run("stability-ai/sdxl", {
input: { prompt: "a cat" },
});
// Production: pin to a specific version for reproducibility
const VERSION_HASH =
"39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b";
const output = await replicate.run(`stability-ai/sdxl:${VERSION_HASH}`, {
input: { prompt: "a cat" },
});
Why good: Pinned version guarantees identical behavior, hash is immutable
See: examples/core.md for listing model versions, getting version details
Models that output files return FileOutput objects implementing ReadableStream.
import { writeFile } from "node:fs/promises";
const [output] = await replicate.run("black-forest-labs/flux-schnell", {
input: { prompt: "a sunset over mountains" },
});
// FileOutput has .url() and .blob() methods
console.log(output.url()); // Underlying URL
// Save to disk
const blob = await output.blob();
const buffer = Buffer.from(await blob.arrayBuffer());
await writeFile("./output.png", buffer);
// File inputs: pass URLs, Buffers, or ReadStreams
import { readFile } from "node:fs/promises";
const imageBuffer = await readFile("./input.png");
const output = await replicate.run("some-user/image-model", {
input: {
image: imageBuffer, // Auto-uploaded (max 100 MiB)
},
});
Why good: FileOutput is a ReadableStream, works with Node.js stream APIs, .url() for the underlying URL
// BAD: Treating file output as a plain URL string
const [output] = await replicate.run("black-forest-labs/flux-schnell", {
input: { prompt: "hello" },
});
const url = output; // WRONG: output is a FileOutput object, not a string
Why bad: FileOutput is an object, not a string -- use .url() to get the URL
See: examples/core.md for file uploads, large file handling, encoding strategies
Use replicate.predictions.create() for background jobs with webhook notifications.
const prediction = await replicate.predictions.create({
model: "owner/model", // OR version: "sha256hash" for pinned version
input: { prompt: "a painting of a cat" },
webhook: "https://my.app/webhooks/replicate",
webhook_events_filter: ["completed"],
});
console.log(prediction.id); // Use to track status
console.log(prediction.status); // "starting"
// Webhook signature validation (CRITICAL for security)
import { validateWebhook } from "replicate";
async function handleWebhook(request: Request): Promise<Response> {
const secret = process.env.REPLICATE_WEBHOOK_SIGNING_SECRET;
const isValid = await validateWebhook(request, secret);
if (!isValid) {
return new Response("Invalid signature", { status: 401 });
}
const prediction = await request.json();
// Process prediction.output safely
return new Response("OK", { status: 200 });
}
Why good: Decoupled processing, secure signature validation, filtered events reduce noise
See: examples/streaming-webhooks.md for webhook event types, polling alternative
Deployments give you a private, fixed endpoint with custom hardware and scaling.
// Create a prediction on a deployment (no cold start if min_instances > 0)
const prediction = await replicate.deployments.predictions.create(
"my-org/my-deployment",
{
input: { prompt: "hello world" },
},
);
const result = await replicate.wait(prediction);
console.log(result.output);
Why good: Predictable latency with min_instances, private endpoint, custom hardware selection
See: examples/deployments-training.md for creating/managing deployments, training API
Catch API errors with status codes. The SDK auto-retries on 429 and 5xx errors (5 retries by default with exponential backoff).
try {
const output = await replicate.run("owner/model", {
input: { prompt: "hello" },
});
} catch (error) {
if (error instanceof Error) {
console.error(`Replicate error: ${error.message}`);
// Check for specific HTTP status codes in the error
if ("status" in error) {
const status = (error as { status: number }).status;
if (status === 401) {
throw new Error("Invalid API token. Check REPLICATE_API_TOKEN.");
}
if (status === 422) {
console.error("Invalid input parameters");
}
if (status === 429) {
console.error(
"Rate limited -- SDK auto-retries (5 attempts) exhausted",
);
}
}
}
throw error;
}
Why good: Checks error type, handles specific status codes, re-throws unexpected errors
See: examples/core.md for full error handling example with status code handling
Frequent model with varying load -> Use deployments with min_instances >= 1
One-off batch jobs -> Use predictions.create() with webhooks (no waiting)
Popular public models -> Usually warm, replicate.run() is fine
Custom/niche models -> Expect 30s-5min cold start on first run
min_instances: 1 to eliminate cold startsreplicate.stream() for LLMs -- progressive output feels faster than waiting for full completionreplicate.predictions.cancel() -- stops billing immediately<decision_framework>
Is this a user-facing LLM response?
+-- YES -> Use replicate.stream() for real-time SSE output
+-- NO -> Do you need the result immediately?
+-- YES -> Use replicate.run() (blocks until complete)
+-- NO -> Use replicate.predictions.create() + webhook
+-- Need to poll instead? -> Use replicate.wait(prediction)
Are you in development/prototyping?
+-- YES -> Use owner/model (latest version, convenient)
+-- NO -> Are you in production?
+-- YES -> Use owner/model:version_hash (pinned, reproducible)
+-- Does the model change frequently?
+-- YES -> Pin version, test updates explicitly
+-- NO -> Either format works, prefer pinned
Do you need consistent low latency?
+-- YES -> Create a deployment with min_instances >= 1
+-- NO -> Do you need custom hardware (A100, H100)?
+-- YES -> Create a deployment with specific hardware
+-- NO -> Use replicate.run() / replicate.stream() directly
(Replicate auto-allocates hardware)
Are you running open-source models on serverless GPUs?
+-- YES -> Use Replicate SDK
+-- NO -> Are you calling proprietary APIs (OpenAI, Anthropic)?
+-- YES -> Not this skill's scope -- use provider-specific SDKs
+-- NO -> Do you need to switch between multiple providers?
+-- YES -> Not this skill's scope -- use a unified provider SDK
+-- NO -> Do you want to self-host models?
+-- YES -> Not this skill's scope -- consider Cog or vLLM
+-- NO -> Replicate SDK is appropriate
</decision_framework>
<red_flags>
High Priority Issues:
REPLICATE_API_TOKEN in source code (security breach risk)FileOutput as a string (it is a ReadableStream object -- use .url() or .blob())validateWebhook() (allows forged webhook payloads)replicate.run() for long-running models in request handlers (blocks the response, can timeout)Medium Priority Issues:
owner/model uses latest, which can change without notice)Buffer instead of hosting them at a URL (100 MiB limit on uploads)Common Mistakes:
replicate.run() (returns output directly) with replicate.predictions.create() (returns a prediction object with status/id)const output = await replicate.run(...) instead of const [output] = await replicate.run(...) (image models return arrays)replicate.stream() with models that do not support streaming (only language models with SSE support)replicate.predictions.create() accepts either a version hash or a model string (owner/model) -- use version for pinned reproducibility, model for latest-version conveniencereplicate.stream() (events are lost)Gotchas & Edge Cases:
replicate.stream() returns ServerSentEvent objects with .event ("output", "error", "done") and .data (string) propertieswebhook_events_filter accepts ["start", "output", "logs", "completed"] -- use ["completed"] unless you need intermediate status updatesPrefer: wait header enables sync mode on the HTTP API (up to 60s), but replicate.run() already handles this automaticallyreplicate.wait() polls the API until the prediction completes -- use webhooks for production to avoid polling overheadFileOutput.url() returns the underlying URL, but these URLs are temporary -- download or persist the file before it expires</red_flags>
<critical_reminders>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST never hardcode API tokens -- always use environment variables via process.env.REPLICATE_API_TOKEN)
(You MUST handle FileOutput objects for models that return files -- do not assume outputs are plain strings or URLs)
(You MUST validate webhooks using validateWebhook() from the replicate package -- never trust unverified webhook payloads)
(You MUST account for cold starts when running infrequently-used models -- use deployments for latency-sensitive applications)
(You MUST specify model versions (owner/model:version) in production to ensure reproducible results -- unversioned references use the latest, which can change)
Failure to follow these rules will produce insecure, unreliable, or unpredictable AI integrations.
</critical_reminders>