| name | ai-infrastructure-huggingface-inference |
| description | Hugging Face Inference SDK patterns for TypeScript/Node.js — InferenceClient setup, chat completion, text generation, streaming, embeddings, image generation, audio transcription, translation, summarization, and Inference Endpoints |
Hugging Face Inference Patterns
Quick Guide: Use @huggingface/inference (v4+) to access 200k+ ML models on the Hugging Face Hub. Use InferenceClient with chatCompletion() for OpenAI-compatible chat, textGeneration() for raw text completion, chatCompletionStream() for streaming, featureExtraction() for embeddings, textToImage() for image generation, and automaticSpeechRecognition() for audio transcription. Set provider to route through inference providers (Cerebras, Together, Groq, etc.) or use endpointUrl for dedicated Inference Endpoints.
<critical_requirements>
CRITICAL: Before Using This Skill
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)
(You MUST always pass an access token to InferenceClient -- never deploy without authentication)
(You MUST use chatCompletion() / chatCompletionStream() for conversational LLM tasks -- these follow the OpenAI-compatible message format)
(You MUST handle errors using InferenceClientError and its subclasses -- never use bare catch blocks without error type checking)
(You MUST specify a model parameter for every inference call -- there is no default model)
(You MUST never hardcode access tokens -- always use environment variables via process.env.HF_TOKEN)
</critical_requirements>
Auto-detection: Hugging Face, huggingface, @huggingface/inference, InferenceClient, HfInference, hf.chatCompletion, hf.textGeneration, hf.featureExtraction, hf.textToImage, hf.automaticSpeechRecognition, hf.translation, hf.summarization, hf.textToSpeech, chatCompletionStream, textGenerationStream, HF_TOKEN, inference provider, Inference Endpoints
When to use:
- Accessing any of the 200k+ models hosted on the Hugging Face Hub
- Running chat completion with open-source LLMs (Qwen, Mistral, Llama, etc.)
- Generating embeddings with sentence-transformer models for semantic search
- Generating images from text prompts (FLUX, Stable Diffusion)
- Transcribing audio with automatic speech recognition models
- Running translation, summarization, text classification, or NER tasks
- Deploying models on dedicated Inference Endpoints for production use
- Using third-party inference providers (Cerebras, Together, Groq, Replicate, etc.) through a unified API
Key patterns covered:
- InferenceClient initialization and configuration
- Chat Completion API (OpenAI-compatible messages format, streaming)
- Text generation (raw completion, streaming)
- Embeddings via feature extraction
- Image generation (text-to-image)
- Audio transcription (automatic speech recognition)
- Translation, summarization, and text classification
- Inference Endpoints (dedicated deployments)
- Inference Providers (routing through third-party services)
- Error handling with typed error classes
When NOT to use:
- If you only use OpenAI models -- use the OpenAI SDK directly
- If you need a provider-agnostic unified SDK with structured outputs and tool calling -- use a higher-level AI SDK
- If you need to fine-tune or train models -- use the
@huggingface/hub package or Python transformers
Examples Index
- Core: Setup, Chat & Text Generation -- Client init, chat completion, text generation, streaming, error handling
- Tasks: Embeddings, Vision, Audio & NLP -- Feature extraction, image generation, speech recognition, translation, summarization, classification
- Quick API Reference -- Method signatures, error types, provider list, model recommendations
Philosophy
The @huggingface/inference SDK provides a unified TypeScript client for accessing hundreds of thousands of ML models through multiple backends: serverless Inference Providers, dedicated Inference Endpoints, and local servers.
Core principles:
- Model-agnostic access -- One client, any model on the Hub. Swap models by changing the
model parameter without code changes.
- Provider flexibility -- Route inference through 20+ providers (Cerebras, Together, Groq, Replicate, etc.) with a single
provider parameter, or deploy your own Inference Endpoints.
- Task-oriented API -- Methods map to ML tasks (
chatCompletion, textToImage, automaticSpeechRecognition), not raw HTTP endpoints.
- OpenAI-compatible chat --
chatCompletion() uses the OpenAI message format (role + content), making migration between providers easy.
- Streaming as async generators --
chatCompletionStream() and textGenerationStream() return AsyncGenerator, consumed with for await...of.
Core Patterns
Pattern 1: Client Setup
Initialize with your Hugging Face access token. The token is required for authenticated access.
import { InferenceClient } from "@huggingface/inference";
const client = new InferenceClient(process.env.HF_TOKEN);
export { client };
const ENDPOINT_URL =
"https://your-endpoint.us-east-1.aws.endpoints.huggingface.cloud/v1/";
const client = new InferenceClient(process.env.HF_TOKEN, {
endpointUrl: ENDPOINT_URL,
});
export { client };
Why good: Token from env var, named constant for endpoint URL, named export
const hf = new InferenceClient("hf_abc123xyz");
export default hf;
Why bad: Hardcoded token is a security risk, default export violates conventions
See: examples/core.md for provider routing, local endpoints, and endpoint helper
Pattern 2: Chat Completion (OpenAI-Compatible)
Use chatCompletion() for conversational LLM tasks. Follows the OpenAI message format.
const MAX_TOKENS = 512;
const TEMPERATURE = 0.1;
const response = await client.chatCompletion({
model: "Qwen/Qwen3-32B",
provider: "cerebras",
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Explain TypeScript generics." },
],
max_tokens: MAX_TOKENS,
temperature: TEMPERATURE,
});
console.log(response.choices[0].message.content);
Why good: Named constants for parameters, explicit model and provider, system message for behavior
const response = await client.chatCompletion({
messages: [{ role: "user", content: "do something" }],
max_tokens: 512,
temperature: 0.1,
});
Why bad: Missing required model, magic numbers, vague prompt, no system instruction
See: examples/core.md for multi-turn conversations and provider selection
Pattern 3: Streaming Chat Completion
Use chatCompletionStream() for streaming responses. Returns an AsyncGenerator.
const MAX_TOKENS = 512;
let fullResponse = "";
for await (const chunk of client.chatCompletionStream({
model: "Qwen/Qwen3-32B",
provider: "cerebras",
messages: [{ role: "user", content: "Explain async/await in TypeScript." }],
max_tokens: MAX_TOKENS,
})) {
if (chunk.choices && chunk.choices.length > 0) {
const content = chunk.choices[0].delta.content;
if (content) {
process.stdout.write(content);
fullResponse += content;
}
}
}
console.log();
Why good: Async generator consumed with for await, progressive output, null checks on chunk data
for await (const chunk of client.chatCompletionStream({
model: "...",
messages: [],
})) {
process.stdout.write(chunk.choices[0].delta.content);
}
Why bad: No null check -- choices may be empty, content may be null between chunks
See: examples/core.md for text generation streaming
Pattern 4: Text Generation (Raw Completion)
Use textGeneration() for prompt continuation without the chat message format.
const MAX_NEW_TOKENS = 250;
const result = await client.textGeneration({
model: "mistralai/Mixtral-8x7B-v0.1",
provider: "together",
inputs: "The key benefits of TypeScript are",
parameters: { max_new_tokens: MAX_NEW_TOKENS },
});
console.log(result.generated_text);
Why good: Named constant, clear prompt, explicit provider, direct access to generated_text
See: examples/core.md for streaming text generation
Pattern 5: Embeddings (Feature Extraction)
Use featureExtraction() for generating vector embeddings for semantic search and RAG.
const embeddings = await client.featureExtraction({
model: "sentence-transformers/all-MiniLM-L6-v2",
inputs: "That is a happy person",
});
Why good: Purpose-built embedding model, simple input/output
See: examples/tasks.md for batch embeddings and cosine similarity
Pattern 6: Image Generation (Text-to-Image)
Use textToImage() to generate images from text prompts. Returns a Blob.
const imageBlob = await client.textToImage({
model: "black-forest-labs/FLUX.1-dev",
inputs: "a serene mountain landscape at sunset",
provider: "replicate",
});
Why good: Explicit model and provider, descriptive prompt
See: examples/tasks.md for saving images, image-to-image, and output formats
Pattern 7: Audio Transcription
Use automaticSpeechRecognition() for speech-to-text.
import { readFileSync } from "node:fs";
const result = await client.automaticSpeechRecognition({
model: "facebook/wav2vec2-large-960h-lv60-self",
data: readFileSync("audio/recording.flac"),
});
console.log(result.text);
Why good: Uses data parameter with file buffer, outputs .text
See: examples/tasks.md for Whisper models, audio classification, and text-to-speech
Pattern 8: Error Handling
Always catch InferenceClientError and its subclasses. Re-throw unexpected errors.
import {
InferenceClientError,
InferenceClientInputError,
InferenceClientProviderApiError,
InferenceClientProviderOutputError,
InferenceClientHubApiError,
} from "@huggingface/inference";
try {
const result = await client.chatCompletion({
model: "Qwen/Qwen3-32B",
messages: [{ role: "user", content: "Hello" }],
});
} catch (error) {
if (error instanceof InferenceClientProviderApiError) {
console.error("Provider API error:", error.message);
console.error("Request:", error.request);
console.error("Response:", error.response);
} else if (error instanceof InferenceClientHubApiError) {
console.error("Hub API error:", error.message);
} else if (error instanceof InferenceClientProviderOutputError) {
console.error("Malformed provider response:", error.message);
} else if (error instanceof InferenceClientInputError) {
console.error("Invalid input:", error.message);
} else if (error instanceof InferenceClientError) {
console.error("Inference error:", error.message);
} else {
throw error;
}
}
Why good: Specific error types for each failure mode, request/response details for debugging, re-throws unexpected errors
See: examples/core.md for full error handling patterns
<decision_framework>
Decision Framework
Which Method to Use
What is your task?
+-- Conversational LLM (messages) -> chatCompletion() / chatCompletionStream()
+-- Raw text continuation -> textGeneration() / textGenerationStream()
+-- Embeddings for search/RAG -> featureExtraction()
+-- Image from text prompt -> textToImage()
+-- Speech to text -> automaticSpeechRecognition()
+-- Text to speech -> textToSpeech()
+-- Language translation -> translation()
+-- Summarize long text -> summarization()
+-- Classify text -> textClassification()
+-- Named entity recognition -> tokenClassification()
+-- Classify image -> imageClassification()
+-- Detect objects -> objectDetection()
+-- Caption an image -> imageToText()
+-- Answer questions from context -> questionAnswering()
Chat Completion vs Text Generation
Do you have a conversation with roles (system/user/assistant)?
+-- YES -> chatCompletion() / chatCompletionStream()
| Uses OpenAI-compatible message format
| Supports system messages, multi-turn
+-- NO -> Do you want to continue/complete a text prompt?
+-- YES -> textGeneration() / textGenerationStream()
| Takes raw text input via 'inputs'
+-- NO -> Use a task-specific method instead
Serverless vs Dedicated
What are your deployment needs?
+-- Prototyping / low volume -> Serverless Inference Providers (provider: "auto")
| Free tier available, shared infrastructure, may have cold starts
+-- Production / high volume -> Inference Endpoints (endpointUrl)
| Dedicated GPU, autoscaling, scale-to-zero, private infrastructure
+-- Local development -> Local endpoint (endpointUrl: "http://localhost:8080")
Works with llama.cpp, Ollama, vLLM, TGI, LiteLLM
When to Use This SDK vs Others
Do you need access to 200k+ open-source models?
+-- YES -> Use @huggingface/inference
+-- NO -> Do you only use OpenAI models?
+-- YES -> Not this skill's scope -- use the OpenAI SDK directly
+-- NO -> Do you need structured outputs / tool calling?
+-- YES -> Not this skill's scope -- use a higher-level AI SDK
+-- NO -> @huggingface/inference works for most ML tasks
</decision_framework>
<red_flags>
RED FLAGS
High Priority Issues:
- Hardcoding access tokens instead of using environment variables (security breach risk)
- Using bare
catch blocks without checking InferenceClientError types (hides API errors, loses debug info)
- Omitting the
model parameter -- always specify the model explicitly for predictable behavior (the SDK can pick a recommended model if omitted, but this is unreliable for production)
- Not consuming
chatCompletionStream() / textGenerationStream() generators (tokens are silently lost)
- Using
textGeneration() for conversational tasks instead of chatCompletion() (wrong API shape)
Medium Priority Issues:
- Not specifying
provider when a specific provider is needed (default "auto" picks based on your HF settings, which may not be optimal)
- Not checking
chunk.choices length before accessing chunk.choices[0] in streaming (may throw on empty chunks)
- Using
request() / streamingRequest() directly -- these are deprecated, use task-specific methods
- Ignoring
max_tokens / max_new_tokens limits (output may be truncated or excessively long)
- Not handling model loading time for serverless inference (cold models return 503, then load)
Common Mistakes:
- Confusing
chatCompletion() parameters with textGeneration() parameters -- chat uses messages + max_tokens, text generation uses inputs + parameters.max_new_tokens
- Using
inputs parameter with chatCompletion() -- it uses messages, not inputs
- Using
messages parameter with textGeneration() -- it uses inputs, not messages
- Forgetting that
textToImage() returns a Blob, not a URL or Buffer
- Treating
featureExtraction() output as always a flat array -- shape depends on the model (can be nested arrays for batch inputs)
- Not passing
data (binary) for audio/image tasks, or passing a string path instead of the actual file buffer
Gotchas & Edge Cases:
- The
provider: "auto" default selects providers based on your HF account settings at hf.co/settings/inference-providers -- not by availability or speed. Set an explicit provider for predictable routing.
- Serverless models may need time to load (cold start). First requests to a cold model may return 503 errors while the model warms up. The SDK handles retries, but initial requests can be slow.
- When using Inference Endpoints with
endpointUrl, the model parameter is often ignored because the endpoint serves a specific model.
chatCompletion() is OpenAI-API compatible -- it works with any OpenAI-compatible endpoint, not just Hugging Face.
HfInference is still exported for backward compatibility but InferenceClient is the current class name.
- Third-party provider API keys can be passed as the
accessToken -- when authenticated with a non-HF key, requests go directly to the provider instead of through HF's routing layer.
- Tree-shakeable imports (
import { textGeneration } from "@huggingface/inference") require passing accessToken as a parameter instead of constructor.
textToImage() supports multiple output types: blob (default), url, dataUrl, or json via the options outputType parameter.
- Translation requires
parameters.src_lang and parameters.tgt_lang for many-to-many models like mbart-large-50-many-to-many-mmt.
</red_flags>
<critical_reminders>
CRITICAL REMINDERS
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)
(You MUST always pass an access token to InferenceClient -- never deploy without authentication)
(You MUST use chatCompletion() / chatCompletionStream() for conversational LLM tasks -- these follow the OpenAI-compatible message format)
(You MUST handle errors using InferenceClientError and its subclasses -- never use bare catch blocks without error type checking)
(You MUST specify a model parameter for every inference call -- there is no default model)
(You MUST never hardcode access tokens -- always use environment variables via process.env.HF_TOKEN)
Failure to follow these rules will produce insecure, unreliable, or silently failing AI integrations.
</critical_reminders>