with one click
with one click
Format-specific document extraction workflows
Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.
Chunking, embeddings, and RAG pipeline integration
Document extraction pipeline architecture and patterns
Plugin architecture, registration, and trait patterns
| description | REST API server and MCP protocol integration |
| name | api-server-mcp |
| priority | critical |
Axum server design for document extraction endpoints, middleware, async processing, and Model Context Protocol integration for AI agents
Location: crates/kreuzberg/src/api/, crates/kreuzberg-cli/
Kreuzberg provides a dual REST API + MCP server built with Axum + Tokio.
Request Flow:
HTTP Client / AI Agent (Claude)
|
[Transport Layer]
āāā REST API (Axum HTTP)
āāā MCP Protocol (HTTP or Stdio)
|
[Middleware Layer]
āāā CORS, Request Logging (TraceLayer)
āāā Request/Response size limits
āāā Rate limiting (optional)
|
[Router]
āāā REST Endpoints
ā āāā POST /extract - File upload extraction
ā āāā POST /extract-url - URL-based extraction
ā āāā GET /formats - List supported formats
ā āāā GET /health - Server health check
ā āāā POST /batch - Batch document processing
ā āāā GET /cache/stats - Cache statistics
ā āāā DELETE /cache - Clear extraction cache
āāā MCP Endpoints
ā āāā POST /mcp/tools - List available tools
ā āāā POST /mcp/tools/call - Call a tool
ā āāā GET /mcp/resources - List resources
ā āāā GET /mcp/resources/:uri - Read resource
ā āāā GET /mcp/prompts - List prompts
ā āāā GET /mcp/prompts/:name - Get prompt
|
[Handler / Tool Layer]
āāā extract_handler / extract_file tool
āāā batch_handler / batch_extract tool
āāā health_handler / get_capabilities tool
āāā format_handler
|
[Extraction Core]
āāā Format detection
āāā Extraction pipeline
āāā Post-processing (chunking, embeddings)
āāā Result formatting
|
JSON Response / MCP ToolResult
Location: crates/kreuzberg/src/api/server.rs
Server initialization pattern: Create ApiState (holds ExtractionConfig + ExtractionCache), build Axum Router with all REST + MCP routes, apply middleware layers (body limits, CORS, tracing), serve via tokio::net::TcpListener.
Key middleware layers applied in order:
DefaultBodyLimit::max(100MB) + RequestBodyLimitLayer -- configurable via env varsCorsLayer::permissive() -- restrict in production via CORS_ALLOWED_ORIGINSTraceLayer::new_for_http() -- request/response loggingLocation: crates/kreuzberg/src/api/handlers.rs
| Handler | Method | Description |
|---|---|---|
extract_handler | POST /extract | Multipart upload: parse file + optional config JSON, check cache, call extract_bytes(), cache result |
extract_url_handler | POST /extract-url | Fetch URL via reqwest, extract bytes |
batch_handler | POST /batch | Parallel extraction with Semaphore-limited concurrency (default: CPU count) |
health_handler | GET /health | Report status, version, uptime, feature availability (OCR, embeddings), cache stats |
formats_handler | GET /formats | Return supported format categories (office, pdf, images, web, email, archives, academic) |
cache_stats_handler | GET /cache/stats | Hit/miss counts and hit rate |
cache_clear_handler | DELETE /cache | Clear LRU cache |
Location: crates/kreuzberg/src/cache/mod.rs
LRU cache keyed by SHA256(file_content), stores Arc<ExtractionResult>. Default 1000 entries. Thread-safe via RwLock. Tracks hit/miss counters with AtomicU64 for stats endpoint.
Location: crates/kreuzberg/src/api/error.rs
ApiError enum maps to HTTP status codes:
MissingFile -> 400, FileNotFound -> 404OnnxRuntimeMissing / TesseractMissing -> 503 (with remediation message)PayloadTooLarge -> 413ExtractionFailed / InvalidConfig / UnsupportedFormat -> 500Location: crates/kreuzberg/src/mcp/server.rs
The MCP server allows Claude and other AI agents to call Kreuzberg extraction functions through the Model Context Protocol.
Three tools are registered:
| Tool | Purpose | Required Params |
|---|---|---|
extract_file | Extract text/tables/metadata from documents (75+ formats) | file_path |
batch_extract | Extract from multiple documents in parallel | file_paths[] |
get_capabilities | List supported formats, features, backends | (none) |
Tool registration pattern (example: extract_file):
// Define Tool with name, description, JSON Schema inputSchema
// Register with server.register_tool(tool, handler_fn)
// Handler: parse params -> build ExtractionConfig -> call extract_file() -> return ToolResult as JSON
extract_file optional params: format, extract_tables, extract_images, ocr_enabled, extract_metadata, chunking_preset, generate_embeddings.
Three resources provide static information to agents:
kreuzberg://formats -- Supported format list as JSONkreuzberg://features -- Cross-binding feature matrix (from FEATURE_MATRIX.md)kreuzberg://api-reference -- Generated API documentationTwo prompts guide agent extraction workflows:
extract_for_rag -- Document type-specific RAG extraction guidance (research paper, contract, report). Recommends chunking preset and embedding config.batch_document_processing -- Optimal concurrency, grouping, and error handling for batch workflows./mcp/ prefix{
"mcpServers": {
"kreuzberg": {
"command": "kreuzberg-mcp",
"env": {
"KREUZBERG_API_BASE": "http://localhost:8000",
"KREUZBERG_MCP_TRANSPORT": "stdio"
}
}
}
}
ToolError variants: FileNotFound, UnsupportedFormat, ExtractionFailed, OnnxRuntimeMissing, TesseractMissing, Timeout. Each maps to an MCP ToolResultError with descriptive code and message.
See .env.example for all configurable variables. Key categories:
KREUZBERG_HOST, KREUZBERG_PORTKREUZBERG_MAX_REQUEST_BODY_BYTES (default 100MB), KREUZBERG_MAX_MULTIPART_FIELD_BYTESKREUZBERG_ENABLE_OCR, KREUZBERG_ENABLE_EMBEDDINGS, KREUZBERG_ENABLE_KEYWORDSKREUZBERG_CACHE_ENABLED, KREUZBERG_CACHE_SIZECORS_ALLOWED_ORIGINS (comma-separated)KREUZBERG_MCP_HOST, KREUZBERG_MCP_PORT, KREUZBERG_MCP_TRANSPORT (stdio/http)RUST_LOG=kreuzberg=info,tower_http=debug