ワンクリックで
extraction-pipeline-patterns
Document extraction pipeline architecture and patterns
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
メニュー
Document extraction pipeline architecture and patterns
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
REST API server and MCP protocol integration
Chunking, embeddings, and RAG pipeline integration
Plugin architecture, registration, and trait patterns
Format-specific document extraction workflows
| description | Document extraction pipeline architecture and patterns |
| name | extraction-pipeline-patterns |
| priority | critical |
Xberg's format detection -> extraction -> fallback orchestration for 75+ file formats
The extraction pipeline (crates/xberg/src/core/pipeline.rs, crates/xberg/src/extraction/) orchestrates:
core/pipeline.rs)Location: crates/xberg/src/core/mime.rs, crates/xberg/src/core/formats.rs
Pattern: detect via magic bytes, validate extension alignment (prevent spoofing), route to extractor. Multiple extractors for same format -> choose highest confidence/specificity.
// Pseudocode: core/mime.rs
match (magic_bytes(content), extension) {
(Some(fmt), Some(ext)) if aligned -> Ok(fmt),
(Some(fmt), Some(ext)) if misaligned -> Err(FormatMismatch),
(Some(fmt), None) -> Ok(fmt), // magic bytes only
(None, Some(ext)) -> Ok(from_extension(ext)),
_ -> Err(UnknownFormat),
}
| Category | Extractors | Key Modules |
|---|---|---|
| Office | DOCX, XLSX, XLSM, XLSB, XLS, PPTX, ODP, ODS | extraction/{docx,excel,pptx}.rs |
| Standard + encrypted, password attempts | pdf/ subdirectory (13 files) | |
| Images | PNG, JPG, TIFF, WebP, JP2, SVG (OCR-enabled) | extraction/image.rs + ocr/ |
| Web | HTML, XHTML, XML, SVG (DOM parsing) | extraction/html.rs (67KB - complex table handling) |
| EML, MSG (headers, body, attachments, threading) | extraction/email.rs | |
| Archives | ZIP, TAR, GZ, 7Z (recursive extraction) | extraction/archive.rs (31KB) |
| Markdown | MD, TXT, RST, Org Mode, RTF | extraction/markdown.rs |
| Academic | LaTeX, BibTeX, JATS, Jupyter, DocBook | extraction/{structured,xml}.rs |
// Pseudocode: extraction/mod.rs
let format = detect_format(source.bytes, source.extension);
let result = match format {
Pdf -> extract_pdf(source, config),
Docx -> extract_docx(source, config),
Image -> extract_image_with_ocr_fallback(source, config),
Archive -> extract_archive_recursive(source, config),
_ -> extract_with_plugin(format, source, config),
};
run_pipeline(result, config) // post-processing always runs
is_encrypted=true in metadata on failureLocation: crates/xberg/src/core/config.rs, crates/xberg/src/core/config_validation.rs
ExtractionConfig holds format-specific configs (pdf, image, html, office), fallback orchestration (fallback), and post-processing (postprocessor, chunking, keywords). See struct definition in config.rs.
Location: crates/xberg/src/plugins/
Plugin registry loaded at startup, cached for zero-cost lookup.
Location: Cargo.toml (workspace), crates/xberg/Cargo.toml, FEATURE_MATRIX.md
20+ features across 9 language bindings. Key feature groups:
| Group | Features | Notes |
|---|---|---|
| OCR | tesseract (default), tesseract-static, ocr-minimal | Mutually exclusive recommendation |
| Formats | pdf, pdf-minimal, office, office-minimal | |
| AI/ML | embeddings (requires ONNX), keywords-yake, keywords-rake, language-detection | |
| Server | api (Axum), mcp, tokio-runtime, lite-runtime | |
| Bindings | python-bindings, ruby-bindings, php-bindings, node-bindings, wasm |
Conditional compilation: modules gated with #[cfg(feature = "...")]. Runtime validate_config() warns if requested feature not compiled in.
ocr-minimal + tesseract should error at compile timerun_pipeline() for validators/hooks