| name | office-to-md |
| description | Convert Office documents (Word, Excel, PowerPoint, PDF) to Markdown format. ONLY use this skill when the user explicitly requests to CONVERT, TRANSFORM or PARSE a specific office file into Markdown. Do NOT trigger for general questions, documentation reading, or discussions about files. |
Office Document to Markdown Converter
Convert various Office document formats to structured Markdown with text, table, and image extraction.
File Description
enhanced_parser.py - Core document parser
doc_converter.py - DOC to DOCX converter (requires LibreOffice)
requirements.txt - Python dependencies
Install Dependencies
pip install -r requirements.txt
Additional Dependencies for DOC Format
.doc format requires LibreOffice:
sudo apt install libreoffice
brew install --cask libreoffice
Quick Start
Python Code
from enhanced_parser import EnhancedDocumentParser
parser = EnhancedDocumentParser(
image_base_url="http://localhost:5000",
image_save_dir="./static/images",
filter_headers_footers=True
)
result = parser.parse_document("document.docx")
if result["success"]:
print(result["markdown"])
print(f"Extracted {result['images_count']} images")
Start API Service
python app.py
Supported Formats
| Format | Extensions | Notes |
|---|
| Word | .docx, .doc | .doc requires LibreOffice |
| Excel | .xlsx, .xls | Supports multiple worksheets and date formats |
| PowerPoint | .pptx | Extracts slide text and images |
| PDF | .pdf | Auto-detects tables and images |
Features
Word Documents
- Automatic heading level detection
- Convert tables to Markdown tables
- Extract inline images
- Filter headers and footers
- Preserve list formatting
Excel Workbooks
- Support for multiple worksheets
- Automatic date format detection (prevents display as numbers)
- Convert to Markdown tables
- Extract embedded images
PowerPoint Presentations
- Extract content by slide
- Extract images and text boxes
- Preserve slide order
PDF Documents
- Auto-detect tables (line detection + text position detection)
- Extract page images
- Intelligently identify headings and lists
- Output content in original order
Advanced Options
DOC Conversion
python doc_converter.py
PDF Table Strategy
parser = EnhancedDocumentParser(
pdf_table_strategy="lines_strict"
)
Image Processing
parser = EnhancedDocumentParser(
image_base_url="https://your-domain.com",
image_save_dir="./static/images"
)
Return Format
{
"success": true,
"markdown": "# Document Title\n\nContent...",
"images_count": 2,
"images": [
{
"filename": "uuid.png",
"url": "http://localhost:5000/static/images/uuid.png",
"size": 12345
}
],
"file_type": "docx",
"file_info": {
"name": "document.docx",
"size": 45678,
"paragraphs": 50,
"tables": 3
}
}
Common Issues
DOC Conversion Failed
- Ensure LibreOffice is installed
- Run
python doc_converter.py to test configuration
Dates Display as Numbers
- Excel parsing automatically handles date formats
- Ensure you're using the latest version of enhanced_parser.py
PDF Table Recognition Inaccurate
- Try different pdf_table_strategy parameters
- Use "lines_strict" for standard tables
- Use "text" for complex tables
File Limitations
- Maximum file size: 160MB
- Supported extensions: docx, doc, pdf, xlsx, xls, pptx
- Automatic cleanup of temporary files