بنقرة واحدة
ai-infrastructure-ollama
Local LLM inference with the Ollama JavaScript client -- chat, streaming, tool calling, vision, embeddings, structured output, model management, and OpenAI-compatible endpoint
القائمة
Local LLM inference with the Ollama JavaScript client -- chat, streaming, tool calling, vision, embeddings, structured output, model management, and OpenAI-compatible endpoint
Hugging Face Inference SDK patterns for TypeScript/Node.js — InferenceClient setup, chat completion, text generation, streaming, embeddings, image generation, audio transcription, translation, summarization, and Inference Endpoints
LiteLLM proxy server setup, TypeScript client patterns via OpenAI SDK, model routing, fallbacks, load balancing, spend tracking, virtual keys, and production deployment
Serverless GPU compute platform for AI model deployment — web endpoints, GPU functions, model serving, and TypeScript client patterns
Replicate SDK patterns for TypeScript/Node.js -- client setup, predictions, streaming, webhooks, file handling, model versioning, deployments, and training
Together AI SDK patterns for TypeScript — client setup, chat completions, streaming, structured output, function calling, embeddings, image generation, fine-tuning, and OpenAI-compatible endpoints
LLM observability with Langfuse — OpenTelemetry-based tracing, evaluations, prompt management, datasets, and production best practices
| name | ai-infrastructure-ollama |
| description | Local LLM inference with the Ollama JavaScript client -- chat, streaming, tool calling, vision, embeddings, structured output, model management, and OpenAI-compatible endpoint |
Quick Guide: Use the
ollamanpm package to run LLMs locally. Useollama.chat()for conversations andollama.generate()for single prompts. Enable streaming withstream: trueand iterate withfor await. Useformatwith a JSON schema (viazodToJsonSchema) for structured outputs. Usetoolsarray for function calling. Useollama.embed()for embeddings. Models run on your machine -- no API keys required for local use, but be aware of model loading time and memory usage.
<critical_requirements>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST use ollama.chat() for conversations and ollama.generate() for single-prompt completions -- they have different parameter shapes)
(You MUST handle model loading delays -- the first request after a model is loaded takes significantly longer due to model initialization)
(You MUST use zodToJsonSchema() from zod-to-json-schema for structured outputs -- do NOT manually construct JSON schemas)
(You MUST accumulate streamed thinking, content, and tool_calls fields to maintain conversation history in multi-turn interactions)
(You MUST never assume a model is already pulled -- check with ollama.list() or handle errors from missing models gracefully)
</critical_requirements>
Auto-detection: Ollama, ollama, ollama.chat, ollama.generate, ollama.embed, ollama.pull, ollama.list, ollama.show, ollama.delete, ollama.ps, ollama.abort, ollama.create, keep_alive, zodToJsonSchema, OLLAMA_HOST, llama3, mistral, qwen, gemma, phi, deepseek, local LLM
When to use:
Key patterns covered:
ollama.chat) and text generation (ollama.generate)for await and accumulated stateformat + zodToJsonSchematools array and multi-turn tool loopsimages parameterollama.embed()When NOT to use:
The Ollama JavaScript library is a thin client over Ollama's local REST API (default http://127.0.0.1:11434). It provides direct access to locally-running open-source LLMs with zero cloud dependencies.
Core principles:
ollama.chat() and ollama.generate() are the two primary methods. The default import is a pre-configured singleton client; create custom instances with new Ollama() for non-default hosts.stream: true to get an AsyncGenerator./v1/chat/completions and /v1/embeddings endpoints, allowing the OpenAI SDK to connect with baseURL: 'http://localhost:11434/v1'.The default import is a pre-configured singleton pointing to http://127.0.0.1:11434.
// lib/ollama.ts -- default client (most common)
import ollama from "ollama";
// Use directly -- connects to localhost:11434
const response = await ollama.chat({
model: "llama3.1",
messages: [{ role: "user", content: "Hello" }],
});
// lib/ollama.ts -- custom client for non-default host
import { Ollama } from "ollama";
const ollama = new Ollama({
host: "http://192.168.1.100:11434",
});
export { ollama };
Why good: Minimal setup, default client requires zero configuration, custom client for remote servers
// BAD: Hardcoding host inline everywhere
import { Ollama } from "ollama";
const response = await new Ollama({ host: "http://192.168.1.100:11434" }).chat({
model: "llama3.1",
messages: [{ role: "user", content: "Hello" }],
});
Why bad: Creates a new client instance per request, no reuse, host scattered across codebase
See: examples/core.md for cloud API setup, custom headers, browser usage
Multi-turn conversations with message history. You manage the messages array.
import ollama from "ollama";
const response = await ollama.chat({
model: "llama3.1",
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Explain TypeScript generics." },
],
});
console.log(response.message.content);
Why good: Clear message roles, system message for behavior control, direct content access
// BAD: Not checking response, no system message
const res = await ollama.chat({
model: "llama3.1",
messages: [{ role: "user", content: "do something" }],
});
Why bad: No system instruction means unpredictable behavior, vague prompt
See: examples/core.md for multi-turn conversations, model options
Single-prompt completions without message history. Simpler than chat for one-shot tasks.
import ollama from "ollama";
const response = await ollama.generate({
model: "llama3.1",
prompt: "Write a haiku about TypeScript.",
system: "You are a creative writer.",
});
console.log(response.response);
Why good: Simpler API for one-shot tasks, system parameter instead of message array
See: examples/core.md for generate with images, suffix, raw mode
Set stream: true to get an AsyncGenerator. Iterate with for await.
import ollama from "ollama";
const stream = await ollama.chat({
model: "llama3.1",
messages: [{ role: "user", content: "Explain async/await." }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.message.content);
}
console.log(); // newline
Why good: Progressive output for better UX, memory-efficient for long responses
// BAD: Not consuming the stream
const stream = await ollama.chat({
model: "llama3.1",
messages: [{ role: "user", content: "Hello" }],
stream: true,
});
// Stream never consumed -- response is lost
Why bad: Stream must be consumed via iteration, otherwise the response is silently lost
See: examples/core.md for generate streaming, abort, thinking mode streaming
Use format with a JSON schema to constrain model output. Use zodToJsonSchema() from zod-to-json-schema to convert Zod schemas.
import ollama from "ollama";
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";
const Country = z.object({
name: z.string(),
capital: z.string(),
languages: z.array(z.string()),
});
type Country = z.infer<typeof Country>;
const response = await ollama.chat({
model: "llama3.1",
messages: [
{
role: "user",
content: "Tell me about France. Respond in JSON.",
},
],
format: zodToJsonSchema(Country),
});
const country: Country = Country.parse(JSON.parse(response.message.content));
console.log(country.capital); // "Paris"
Why good: Type-safe output via Zod, JSON schema constrains model output, parse validates response
// BAD: Using format: 'json' without a schema
const response = await ollama.chat({
model: "llama3.1",
messages: [{ role: "user", content: "Tell me about France as JSON" }],
format: "json",
});
// No schema enforcement -- model can return any JSON shape
Why bad: format: 'json' only ensures valid JSON syntax, not structure -- use a JSON schema for reliable extraction
See: examples/structured-output.md for vision extraction, complex schemas
Define tools the model can request. Handle tool_calls in responses and feed results back.
import ollama from "ollama";
const tools = [
{
type: "function" as const,
function: {
name: "get_weather",
description: "Get the current weather for a city",
parameters: {
type: "object",
required: ["city"],
properties: {
city: { type: "string", description: "City name" },
},
},
},
},
];
const response = await ollama.chat({
model: "llama3.1",
messages: [{ role: "user", content: "What is the weather in Tokyo?" }],
tools,
});
if (response.message.tool_calls?.length) {
for (const toolCall of response.message.tool_calls) {
console.log(`Call: ${toolCall.function.name}`);
console.log(`Args:`, toolCall.function.arguments);
}
}
Why good: Standard tool schema format, checks for tool_calls before processing, arguments already parsed (not stringified JSON)
See: examples/tools.md for multi-turn tool loops, streaming tool calls, parallel tools
Use ollama.embed() for text embeddings. Supports single or batch inputs.
import ollama from "ollama";
const EMBEDDING_MODEL = "nomic-embed-text";
const response = await ollama.embed({
model: EMBEDDING_MODEL,
input: [
"TypeScript is a typed superset of JavaScript.",
"Rust is a systems programming language.",
],
});
console.log(`Vectors: ${response.embeddings.length}`);
console.log(`Dimensions: ${response.embeddings[0].length}`);
Why good: Batch multiple inputs in one call, named constant for model, returns array of number arrays
See: examples/embeddings-vision.md for semantic search, cosine similarity
Pull, list, show, and delete models programmatically.
import ollama from "ollama";
// List available models
const models = await ollama.list();
for (const model of models.models) {
console.log(`${model.name} (${model.size} bytes)`);
}
// Pull a model with progress streaming
const stream = await ollama.pull({ model: "llama3.1", stream: true });
for await (const progress of stream) {
console.log(
`${progress.status}: ${progress.completed ?? 0}/${progress.total ?? 0}`,
);
}
// Show model details
const info = await ollama.show({ model: "llama3.1" });
console.log(`Parameters: ${info.details.parameter_size}`);
console.log(`Quantization: ${info.details.quantization_level}`);
// Delete a model
await ollama.delete({ model: "old-model" });
Why good: Streaming progress for large downloads, programmatic model lifecycle, detailed model metadata
See: examples/model-management.md for copy, create, running models (ps)
Ollama exposes /v1/chat/completions and /v1/embeddings that work with the OpenAI SDK.
import OpenAI from "openai";
const OLLAMA_BASE_URL = "http://localhost:11434/v1";
const client = new OpenAI({
baseURL: OLLAMA_BASE_URL,
apiKey: "ollama", // Required by SDK but unused by Ollama
});
const completion = await client.chat.completions.create({
model: "llama3.1",
messages: [{ role: "user", content: "Why is the sky blue?" }],
});
console.log(completion.choices[0].message.content);
Why good: Drop-in replacement for OpenAI SDK code, named constant for URL, easy to switch between local and cloud
When to use: When migrating existing OpenAI SDK code to local models, or when you want to use OpenAI SDK tooling (structured outputs, streaming helpers) with local models
When not to use: For new Ollama-native code, prefer the ollama package directly -- it exposes Ollama-specific features (model management, keep_alive, thinking mode) that the OpenAI compat layer does not
See: reference.md for supported and unsupported OpenAI features
<decision_framework>
ollama.chat() vs ollama.generate()Need multi-turn conversation history?
+-- YES -> ollama.chat() (messages array with roles)
+-- NO -> Is it a single prompt completion?
+-- YES -> ollama.generate() (simpler API)
+-- NO -> ollama.chat() (default choice for most use cases)
Do you have existing OpenAI SDK code to migrate?
+-- YES -> Use OpenAI SDK with baseURL: 'http://localhost:11434/v1'
+-- NO -> Do you need Ollama-specific features?
+-- YES -> Use ollama package (model management, keep_alive, thinking, abort)
+-- NO -> Either works, prefer ollama package for new code
Do you need structured data from the model?
+-- YES -> Use format parameter with zodToJsonSchema()
| +-- Always include "respond in JSON" in the prompt
| +-- Always parse and validate with Zod after receiving response
+-- NO -> Omit format parameter (plain text response)
What is your task?
+-- General chat / coding -> llama3.1 (8B for speed, 70B for quality)
+-- Fast + small -> phi4-mini, gemma3 (smaller memory footprint)
+-- Code generation -> qwen2.5-coder, deepseek-coder-v2
+-- Vision/multimodal -> llama3.2-vision, gemma3
+-- Embeddings -> nomic-embed-text, all-minilm
+-- Tool calling -> llama3.1, qwen3, mistral
+-- Reasoning/thinking -> qwen3 (with think: true), deepseek-r1
</decision_framework>
<red_flags>
High Priority Issues:
keep_alive to keep models warmformat: 'json' instead of a JSON schema -- only ensures valid JSON syntax, not structure; always use zodToJsonSchema() for reliable extractionthinking, content, and tool_calls from all chunks to maintain historyMedium Priority Issues:
keep_alive for latency-sensitive applications -- models unload after 5 minutes of inactivity by default; set keep_alive: '30m' or keep_alive: -1 (indefinite) for persistent sessionsOllama() instances per request instead of reusing a singleton clientAbortError when using ollama.abort() -- listening threads throw when streams are cancelledCommon Mistakes:
ollama.chat() and ollama.generate() parameters -- chat uses messages[], generate uses prompt and systemollama.embeddings() (deprecated) instead of ollama.embed() -- the newer embed() method supports batch inputsimages parameter -- Ollama accepts file paths, Uint8Array, or base64-encoded strings, but not remote URLsJSON.parse() the response content when using structured output -- Ollama returns JSON as a string in message.content, not a parsed objectGotchas & Edge Cases:
toolCall.function.arguments is an object, not a stringkeep_alive parameter accepts both duration strings ('5m', '1h') and numbers (seconds) -- 0 unloads immediately, -1 keeps loaded indefinitelyollama.abort() cancels ALL active streams for that client instance -- for individual stream cancellation, use stream.abort() on the returned stream objectllama3.1:8b, llama3.1:70b) -- omitting the tag uses the default (usually smallest)think parameter enables extended reasoning but only works with models that support it (Qwen 3, DeepSeek R1) -- it adds a thinking field to the response alongside contentollama/browser instead of ollama -- the default import uses Node.js-specific APIsAuthorization: Bearer header and setting host: 'https://ollama.com'total_duration, eval_count, eval_duration (in nanoseconds) -- calculate tokens/second with eval_count / eval_duration * 1e9</red_flags>
<critical_reminders>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST use ollama.chat() for conversations and ollama.generate() for single-prompt completions -- they have different parameter shapes)
(You MUST handle model loading delays -- the first request after a model is loaded takes significantly longer due to model initialization)
(You MUST use zodToJsonSchema() from zod-to-json-schema for structured outputs -- do NOT manually construct JSON schemas)
(You MUST accumulate streamed thinking, content, and tool_calls fields to maintain conversation history in multi-turn interactions)
(You MUST never assume a model is already pulled -- check with ollama.list() or handle errors from missing models gracefully)
Failure to follow these rules will produce unreliable, poorly-typed, or broken local LLM integrations.
</critical_reminders>