| name | format-specific-extraction |
| description | Format-specific document extraction workflows |
| priority | high |
Format-Specific Extraction Workflows
Office XML (DOCX/PPTX/ODT)
ZIP archive → Security validation → XML parsing → Text + tables + metadata
ZipBombValidator::new(limits).validate(&mut archive)?
- Extract XML files from archive (
word/document.xml, ppt/slides/*.xml, content.xml)
- Parse with
quick-xml::Reader (streaming) + DepthValidator + StringGrowthValidator
- Extract metadata via
crate::extraction::office_metadata::extract_metadata()
- See:
extractors/docx.rs, extractors/pptx.rs, extractors/odt.rs
PDF
Bytes → pdf_oxide → Per-page text + OCR fallback → Tables → Metadata
pdf_oxide::PdfDocument::from_bytes(content)?
- Check if needs OCR:
config.force_ocr || !has_searchable_text()
- Extract text per page, tables if
config.pages enabled
- Feature-gated:
#[cfg(feature = "pdf")]
- See:
extractors/pdf/mod.rs
Archives (ZIP/TAR/7z/GZIP)
Validate → Extract metadata → Extract plaintext files only
ZipBombValidator BEFORE any extraction
- Extract metadata (file list, sizes)
- Extract text content from plaintext files
- Use
build_archive_result() helper
- See:
extractors/archive.rs, extraction/archive/*.rs
Structured Text (JSON/YAML/TOML/XML)
Detect format from MIME → Parse → Pretty-print → Metadata
Single StructuredExtractor handles multiple MIME types. Parse with format-specific library, pretty-print to text.
See: extractors/structured.rs
Email (EML/MSG)
Parse headers → Extract body (text/html) → Process attachments
See: extraction/email.rs, extractors/email.rs
Common Helpers
| Helper | Location | Purpose |
|---|
office_metadata::extract_metadata() | extraction/office.rs | Office XML metadata |
cells_to_markdown() | extraction/mod.rs | Convert cell grid to GFM table |
build_archive_result() | extraction/archive/mod.rs | Standard archive result |
Adding a New Format
- Add MIME type to
EXT_TO_MIME in core/mime.rs
- Create extractor implementing
DocumentExtractor trait
- Set
supported_mime_types() and priority() (default: 50)
- Register in
extractors/mod.rs → register_default_extractors()
- Feature-gate if optional:
#[cfg(feature = "my-format")]
- Apply security validators for user content
- Add tests with fixture files