| name | doc-reader |
| description | PDF/DOCX/XLSX/image document intelligence — text extraction, table parsing, OCR, financial statement analysis, contract clause detection, document classification, and format conversion. Use when reading, analyzing, extracting data from, or converting documents. |
| triggers | ["read pdf","extract text from document","parse tables from pdf","ocr image","analyze contract","financial statement","convert document","extract tables","read docx","read excel","document analysis"] |
| category | research |
| maturity | stable |
| tags | ["pdf-extraction","ocr","table-parsing","docx","document-intelligence"] |
Doc Reader — Document Intelligence
Extract text, tables, and insights from any document format. Handles native text PDFs, scanned documents (OCR), Word files, spreadsheets, images, and HTML.
Quick Reference
SCRIPTS=${CLAUDE_SKILLS_DIR:-$HOME/.claude-agent/.claude/skills}/doc-reader/scripts
python3 $SCRIPTS/extract.py document.pdf
python3 $SCRIPTS/extract.py document.pdf --mode tables
python3 $SCRIPTS/extract.py document.pdf --mode summary
python3 $SCRIPTS/extract.py scanned.pdf --mode ocr --ocr-lang eng+nor
python3 $SCRIPTS/analyze.py document.pdf --extract-figures
python3 $SCRIPTS/convert.py report.pdf report.md
python3 $SCRIPTS/convert.py data.xlsx data.csv
Scripts
1. extract.py — Universal Text & Table Extraction
The core extraction engine. Supports PDF, DOCX, XLSX, CSV, images, HTML, and plain text.
Extraction Modes
| Mode | Description |
|---|
text | Full text extraction (default). Auto-falls back to OCR for scanned pages |
tables | Table extraction only (as markdown, CSV, or JSON) |
meta | Document metadata (author, dates, page count, file size) |
summary | Text + tables + metadata combined |
ocr | Force OCR on all pages (for fully scanned documents) |
layout | Layout-preserving text extraction via pdftotext (columns, spacing) |
images | Extract embedded images from PDF |
Usage
python3 extract.py report.pdf
python3 extract.py contract.docx
python3 extract.py financials.xlsx
python3 extract.py annual-report.pdf --pages 1-5,8,12-15
python3 extract.py financials.pdf --mode tables --table-format json
python3 extract.py scan.pdf --mode ocr --ocr-lang eng+nor
python3 extract.py report.pdf --mode summary --output /tmp/report-summary.txt
python3 extract.py report.pdf --mode summary --json --output /tmp/report.json
python3 extract.py data.xlsx --sheet "Revenue" --mode tables
python3 extract.py newspaper.pdf --mode layout
python3 extract.py huge-doc.pdf --max-pages 20
Supported Formats
| Format | Extensions | Text | Tables | OCR | Metadata |
|---|
| PDF | .pdf | ✅ PyMuPDF + pdftotext | ✅ pdfplumber + tabula | ✅ Tesseract | ✅ |
| Word | .docx, .doc | ✅ python-docx | ✅ | ❌ | ✅ |
| Excel | .xlsx, .xls | ✅ openpyxl | ✅ | ❌ | ✅ |
| CSV | .csv | ✅ | ✅ | ❌ | ⚠️ |
| Images | .png, .jpg, .tiff, .bmp | ✅ OCR | ❌ | ✅ | ✅ Pillow |
| HTML | .html, .htm | ✅ BeautifulSoup | ❌ | ❌ | ⚠️ |
| Text | .txt, .md, .json, .xml | ✅ | ❌ | ❌ | ⚠️ |
PDF Extraction Strategy
The extractor uses a multi-strategy approach:
- PyMuPDF (primary) — fastest, best for native text PDFs
- pdftotext (layout mode) — preserves column layouts, spacing
- Tesseract OCR (fallback) — auto-triggered when a page has <20 chars of text
- pdfplumber (tables) — best general-purpose table extractor
- tabula-java (tables alt) — better for bordered/ruled tables
2. analyze.py — Document Intelligence
Higher-level analysis on extracted content.
python3 analyze.py document.pdf
python3 analyze.py statement.pdf --type financial
python3 analyze.py agreement.pdf --type contract
python3 analyze.py document.pdf --extract-figures
python3 analyze.py document.pdf --json --output /tmp/analysis.json
Document Classification
Automatically detects:
- Financial statements — balance sheets, income statements, cash flow
- Contracts — agreements, NDAs, terms of service
- Research papers — academic papers with abstract/methodology/results
- Invoices — bills with line items and totals
- Reports — business reports with executive summaries
- Resumes — CVs with experience/education sections
Financial Analysis
For financial documents, extracts:
- Key metrics (revenue, net income, EBITDA, EPS, total assets)
- Financial table detection and summarization
- Period identification (quarterly, annual)
Contract Analysis
For legal documents, extracts:
- Parties involved
- Key dates
- Clause inventory (confidentiality, termination, governing law, IP, etc.)
- Governing law jurisdiction
- Obligation detection
Figure Extraction
Extracts structured data from any document:
- Monetary values — $1,234.56, €500K, USD 5,000, NOK 10 000
- Percentages — 15.5%, -2.3%
- Dates — 2024-01-15, January 15 2024, 15/01/2024
- Emails, URLs, phone numbers
- Numbers with units — 500 MW, 1.2M barrels, 50 kg
3. convert.py — Format Conversion
Convert between document formats.
python3 convert.py report.pdf report.md
python3 convert.py report.pdf report.txt
python3 convert.py report.pdf report.html
python3 convert.py report.pdf page.png
python3 convert.py report.pdf report.json
python3 convert.py doc.docx doc.md
python3 convert.py doc.docx doc.pdf
python3 convert.py doc.docx doc.txt
python3 convert.py data.xlsx data.csv
python3 convert.py data.xlsx data.json
python3 convert.py data.xlsx data.md
python3 convert.py data.csv data.json
python3 convert.py data.csv data.xlsx
python3 convert.py scan.png scan.txt
python3 convert.py photo.jpg photo.pdf
python3 convert.py doc.md doc.pdf
python3 convert.py doc.md doc.html
python3 convert.py doc.md doc.docx
python3 convert.py scan.pdf text.md --pages 1-10 --ocr-lang eng+nor
python3 convert.py data.xlsx out.csv --sheet "Q4 Revenue"
python3 convert.py report.pdf images.png --dpi 300
Conversion Matrix
| From ↓ / To → | .txt | .md | .pdf | .html | .csv | .json | .png | .docx |
|---|
| .pdf | ✅ | ✅ | — | ✅ | ❌ | ✅ | ✅ | ❌ |
| .docx | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | — |
| .xlsx | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
| .csv | ✅ | ✅ | ❌ | ❌ | — | ✅ | ❌ | ❌ |
| .png/.jpg | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | — | ❌ |
| .html | ✅ | ✅ | ✅ | — | ❌ | ❌ | ❌ | ❌ |
| .md | ❌ | — | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
Dependencies
System packages (installed):
tesseract-ocr + tesseract-ocr-eng + tesseract-ocr-nor — OCR engine
poppler-utils — pdftotext, pdftohtml
pandoc — universal document converter
ghostscript — PDF processing
Python packages (installed):
pymupdf — PDF rendering, text extraction, image extraction
pdfplumber — PDF table extraction
python-docx — Word document parsing
openpyxl — Excel file parsing
tabula-py — Java-based PDF table extraction
Pillow — Image processing
beautifulsoup4 — HTML parsing
OCR languages installed: English (eng), Norwegian (nor)
To add more OCR languages:
sudo apt-get install tesseract-ocr-<lang>
Agent Workflow
When to use which script:
- "Read this PDF" →
extract.py <file> (default text mode)
- "Get the tables from this document" →
extract.py <file> --mode tables
- "What kind of document is this?" →
analyze.py <file>
- "Extract all dollar amounts" →
analyze.py <file> --extract-figures
- "Is this contract safe to sign?" →
analyze.py <file> --type contract
- "Parse this financial statement" →
analyze.py <file> --type financial
- "Convert this PDF to markdown" →
convert.py <file>.pdf <file>.md
- "OCR this scanned document" →
extract.py <file> --mode ocr
- "What's on pages 5-10?" →
extract.py <file> --pages 5-10
For large documents:
- Use
--pages to extract specific sections
- Use
--max-pages 20 to limit processing
- Use
--mode meta first to check page count before full extraction
- Use
--json output for programmatic post-processing
For scanned/image-heavy PDFs:
- Try
--mode text first (auto-OCR fallback)
- If quality is poor, use
--mode ocr to force full OCR
- Add
--ocr-lang eng+nor for multi-language documents
- Use
--mode layout for column-heavy layouts