원클릭으로
원클릭으로
Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.
REST API server and MCP protocol integration
Chunking, embeddings, and RAG pipeline integration
Document extraction pipeline architecture and patterns
Plugin architecture, registration, and trait patterns
| name | format-specific-extraction |
| description | Format-specific document extraction workflows |
| priority | high |
ZIP archive → Security validation → XML parsing → Text + tables + metadata
ZipBombValidator::new(limits).validate(&mut archive)?word/document.xml, ppt/slides/*.xml, content.xml)quick-xml::Reader (streaming) + DepthValidator + StringGrowthValidatorcrate::extraction::office_metadata::extract_metadata()extractors/docx.rs, extractors/pptx.rs, extractors/odt.rsBytes → pdf_oxide → Per-page text + OCR fallback → Tables → Metadata
pdf_oxide::PdfDocument::from_bytes(content)?config.force_ocr || !has_searchable_text()config.pages enabled#[cfg(feature = "pdf")]extractors/pdf/mod.rsValidate → Extract metadata → Extract plaintext files only
ZipBombValidator BEFORE any extractionbuild_archive_result() helperextractors/archive.rs, extraction/archive/*.rsDetect format from MIME → Parse → Pretty-print → Metadata
Single StructuredExtractor handles multiple MIME types. Parse with format-specific library, pretty-print to text.
See: extractors/structured.rs
Parse headers → Extract body (text/html) → Process attachments
See: extraction/email.rs, extractors/email.rs
| Helper | Location | Purpose |
|---|---|---|
office_metadata::extract_metadata() | extraction/office.rs | Office XML metadata |
cells_to_markdown() | extraction/mod.rs | Convert cell grid to GFM table |
build_archive_result() | extraction/archive/mod.rs | Standard archive result |
EXT_TO_MIME in core/mime.rsDocumentExtractor traitsupported_mime_types() and priority() (default: 50)extractors/mod.rs → register_default_extractors()#[cfg(feature = "my-format")]