بنقرة واحدة
extraction-pipeline-patterns
Document extraction pipeline architecture and patterns
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
القائمة
Document extraction pipeline architecture and patterns
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
| description | Document extraction pipeline architecture and patterns |
| name | extraction-pipeline-patterns |
| priority | critical |
Xberg's format detection -> extraction -> fallback orchestration for 75+ file formats
The extraction pipeline (crates/xberg/src/core/pipeline.rs, crates/xberg/src/extraction/) orchestrates:
core/pipeline.rs)Location: crates/xberg/src/core/mime.rs, crates/xberg/src/core/formats.rs
Pattern: detect via magic bytes, validate extension alignment (prevent spoofing), route to extractor. Multiple extractors for same format -> choose highest confidence/specificity.
// Pseudocode: core/mime.rs
match (magic_bytes(content), extension) {
(Some(fmt), Some(ext)) if aligned -> Ok(fmt),
(Some(fmt), Some(ext)) if misaligned -> Err(FormatMismatch),
(Some(fmt), None) -> Ok(fmt), // magic bytes only
(None, Some(ext)) -> Ok(from_extension(ext)),
_ -> Err(UnknownFormat),
}
| Category | Extractors | Key Modules |
|---|---|---|
| Office | DOCX, XLSX, XLSM, XLSB, XLS, PPTX, ODP, ODS | extraction/{docx,excel,pptx}.rs |
| Standard + encrypted, password attempts | pdf/ subdirectory (13 files) | |
| Images | PNG, JPG, TIFF, WebP, JP2, SVG (OCR-enabled) | extraction/image.rs + ocr/ |
| Web | HTML, XHTML, XML, SVG (DOM parsing) | extraction/html.rs (67KB - complex table handling) |
| EML, MSG (headers, body, attachments, threading) | extraction/email.rs | |
| Archives | ZIP, TAR, GZ, 7Z (recursive extraction) | extraction/archive.rs (31KB) |
| Markdown | MD, TXT, RST, Org Mode, RTF | extraction/markdown.rs |
| Academic | LaTeX, BibTeX, JATS, Jupyter, DocBook | extraction/{structured,xml}.rs |
// Pseudocode: extraction/mod.rs
let format = detect_format(source.bytes, source.extension);
let result = match format {
Pdf -> extract_pdf(source, config),
Docx -> extract_docx(source, config),
Image -> extract_image_with_ocr_fallback(source, config),
Archive -> extract_archive_recursive(source, config),
_ -> extract_with_plugin(format, source, config),
};
run_pipeline(result, config) // post-processing always runs
is_encrypted=true in metadata on failureLocation: crates/xberg/src/core/config.rs, crates/xberg/src/core/config_validation.rs
ExtractionConfig holds format-specific configs (pdf, image, html, office), fallback orchestration (fallback), and post-processing (postprocessor, chunking, keywords). See struct definition in config.rs.
Location: crates/xberg/src/plugins/
Plugin registry loaded at startup, cached for zero-cost lookup.
Location: Cargo.toml (workspace), crates/xberg/Cargo.toml, FEATURE_MATRIX.md
20+ features across 9 language bindings. Key feature groups:
| Group | Features | Notes |
|---|---|---|
| OCR | tesseract (default), tesseract-static, ocr-minimal | Mutually exclusive recommendation |
| Formats | pdf, pdf-minimal, office, office-minimal | |
| AI/ML | embeddings (requires ONNX), keywords-yake, keywords-rake, language-detection | |
| Server | api (Axum), mcp, tokio-runtime, lite-runtime | |
| Bindings | python-bindings, ruby-bindings, php-bindings, node-bindings, wasm |
Conditional compilation: modules gated with #[cfg(feature = "...")]. Runtime validate_config() warns if requested feature not compiled in.
ocr-minimal + tesseract should error at compile timerun_pipeline() for validators/hooksREST API server and MCP protocol integration
Chunking, embeddings, and RAG pipeline integration
Plugin architecture, registration, and trait patterns
Format-specific document extraction workflows