| name | edgeparse |
| description | Use this skill whenever the user wants to extract structured content from PDF files — reading or extracting text, tables, headings, images, and bounding boxes from PDFs, or converting PDFs to Markdown, JSON, HTML, or plain text. EdgeParse is a zero-dependency Rust-native tool (no JVM, no GPU, no OCR models required). Use this skill when the user has a .pdf file and wants to parse it, extract its text or tables, convert it to Markdown or JSON, get bounding boxes for content elements, process multiple PDFs in batch, or pipe PDF content to an LLM pipeline. |
| compatibility | Requires edgeparse CLI installed (via Homebrew, cargo, or binary download). See Step 0 for installation. |
| license | Apache-2.0 |
| metadata | {"author":"raphaelmansuy","version":"0.1.0","homepage":"https://github.com/raphaelmansuy/edgeparse"} |
EdgeParse Skill
Parse PDFs into structured content (Markdown, JSON with bounding boxes, HTML, plain text) using EdgeParse — a fast, zero-dependency Rust-native PDF extraction engine.
Initial Setup
When this skill is invoked, respond with:
I'm ready to use EdgeParse to extract content from your PDF(s).
Before we begin, please confirm that `edgeparse` is installed:
edgeparse --version # should print "edgeparse 0.1.0"
If it's not installed, I'll help you install it first.
Please provide:
1. One or more PDF files to process
2. The desired output format: markdown (default), json, html, or text
3. Any options: specific pages, table detection method, image extraction, etc.
4. What you'd like to do with the extracted content (read it, chunk it for RAG, analyse tables, etc.)
Then wait for the user's input.
Step 0 — Install EdgeParse (if needed)
macOS (Homebrew — recommended)
brew tap raphaelmansuy/tap
brew install edgeparse
Any platform (Rust toolchain)
cargo install edgeparse-cli
Python wrapper
pip install edgeparse
python -c "import edgeparse; print(edgeparse.version())"
Node.js wrapper
npm install edgeparse
node -e "const {version} = require('edgeparse'); console.log(version())"
Verify installation:
edgeparse --version
Step 1 — Produce the CLI Command or Script
Convert a Single PDF
edgeparse document.pdf
edgeparse document.pdf -f markdown -o output/
edgeparse document.pdf -f json -o output/
edgeparse document.pdf -f html -o output/
edgeparse document.pdf -f text -o output/
Specific Pages
edgeparse document.pdf --pages "1" -f markdown
edgeparse document.pdf --pages "1-5" -f json
edgeparse document.pdf --pages "1,3,5-7,10" -f markdown
Batch Processing
edgeparse *.pdf -f markdown -o output/
edgeparse report.pdf paper.pdf -f markdown,json -o output/
Extract Tables
edgeparse document.pdf -f json
edgeparse document.pdf -f json --table-method cluster
Extract Images
edgeparse document.pdf -f markdown --image-output external --image-dir output/images/
edgeparse document.pdf -f markdown-with-images --image-output embedded
Password-Protected PDF
edgeparse document.pdf -f markdown --password "mypassword"
Reading Order
edgeparse document.pdf -f markdown --reading-order xycut
edgeparse document.pdf -f markdown --reading-order off
Multiple Output Formats in One Pass
edgeparse document.pdf -f markdown,json -o output/
Step 2 — Use the Python SDK (when scripting)
import edgeparse
md = edgeparse.convert("document.pdf", format="markdown")
import json
doc = json.loads(edgeparse.convert("document.pdf", format="json"))
headings = [e for e in doc["kids"] if e["type"] == "heading"]
for h in headings:
print(f'H{h["heading level"]} {h["content"]}')
for e in doc["kids"]:
if e["type"] == "table":
for row in e["rows"]:
print(" | ".join(row))
out_path = edgeparse.convert_file(
"document.pdf",
output_dir="output/",
format="markdown",
pages="1-5",
)
from concurrent.futures import ThreadPoolExecutor
import glob
def process(path):
return (path, edgeparse.convert(path, format="markdown"))
with ThreadPoolExecutor(max_workers=4) as pool:
results = list(pool.map(process, glob.glob("*.pdf")))
Step 3 — Use the Node.js SDK (when scripting)
const { convert, version } = require('edgeparse');
const md = convert('document.pdf', { format: 'markdown' });
const doc = JSON.parse(convert('document.pdf', { format: 'json' }));
const headings = doc.kids.filter(e => e.type === 'heading');
headings.forEach(h => console.log(`H${h['heading level']} ${h.content}`));
const { Worker } = require('worker_threads');
const paths = ['a.pdf', 'b.pdf', 'c.pdf'];
const results = paths.map(p => convert(p, { format: 'markdown' }));
Step 4 — Key Options Reference
Output Formats
| Format | Description |
|---|
markdown | Standard Markdown with GFM tables (default) |
markdown-with-html | Markdown with HTML table fallback for complex tables |
markdown-with-images | Markdown with embedded or linked image references |
json | Structured JSON with bounding boxes and element types |
html | Full HTML5 document with semantic elements |
text | Plain UTF-8 text, reading order preserved |
Layout Options
| Option | Default | Description |
|---|
--reading-order | xycut | xycut (XY-Cut++ algorithm) or off |
--table-method | default | default (ruling lines) or cluster (borderless) |
--keep-line-breaks | false | Preserve original line breaks in paragraphs |
--use-struct-tree | false | Use tagged PDF structure tree when available |
--include-header-footer | false | Include page headers and footers |
Image Options
| Option | Default | Description |
|---|
--image-output | off | off, embedded (base64), or external (files) |
--image-format | png | png or jpeg |
--image-dir | — | Output directory for extracted images |
Content Safety
| Option | Description |
|---|
--content-safety-off all | Disable all AI safety filters |
--content-safety-off hidden-text | Allow hidden text extraction |
--content-safety-off tiny | Allow tiny-text extraction |
Step 5 — Understanding JSON Output
The JSON format gives agents the richest representation for reasoning over document structure.
Document Envelope
{
"file name": "document.pdf",
"number of pages": 15,
"author": "...",
"title": "...",
"creation date": "D:20250101T000000Z",
"modification date": "D:20250101T000000Z",
"kids": [ ]
}
Element Types
type | Key fields |
|---|
paragraph | content, font, font size, text color |
heading | content, level (Title/H1-H6), heading level (int) |
table | rows (2D string array, first row is header) |
list | items (string array) |
image | image path (when --image-output external) |
caption | content |
formula | content |
Bounding Box
All elements have "bounding box": [x0, y0, x1, y1]:
- Origin: bottom-left of page (PDF coordinate system)
- Y axis: increases upward
- Units: PDF points (72 pt = 1 inch)
- A4 page: 595 × 842 pt · US Letter: 612 × 792 pt
x0, y0_pdf, x1, y1_pdf = element["bounding box"]
page_height = 842
y_top = page_height - y1_pdf
y_bottom = page_height - y0_pdf
Recipes for Common Agent Tasks
Recipe A — Chunk PDF for RAG
import edgeparse, json
doc = json.loads(edgeparse.convert("paper.pdf", format="json"))
chunks = [
{"text": e["content"], "page": e["page number"], "type": e["type"]}
for e in doc["kids"]
if e["type"] in ("paragraph", "heading", "caption")
]
Recipe B — Extract All Tables to CSV
import edgeparse, json, csv
doc = json.loads(edgeparse.convert("report.pdf", format="json"))
for i, e in enumerate(doc["kids"]):
if e["type"] == "table":
with open(f"table_{i}.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(e["rows"])
Recipe C — Convert Folder of PDFs to Markdown
edgeparse *.pdf -f markdown -o markdown_output/ --quiet
import edgeparse, glob, os
os.makedirs("out", exist_ok=True)
for path in glob.glob("*.pdf"):
try:
md = edgeparse.convert(path, format="markdown")
with open(f"out/{os.path.splitext(os.path.basename(path))[0]}.md", "w") as f:
f.write(md)
print(f"✓ {path}")
except RuntimeError as e:
print(f"✗ {path}: {e}")
Recipe D — Heading Outline of a Document
import edgeparse, json
doc = json.loads(edgeparse.convert("paper.pdf", format="json"))
for e in doc["kids"]:
if e["type"] == "heading":
indent = " " * (e.get("heading level", 1) - 1)
print(f'{indent}{"#" * e.get("heading level", 1)} {e["content"]}')
Recipe E — First Page Only (quick preview)
edgeparse document.pdf --pages "1" -f markdown
Supported Input
EdgeParse supports born-digital PDFs only (embedded text). It does not include built-in OCR. For scanned PDFs or image-only PDFs, add --hybrid docling-fast to delegate to a Docling backend:
edgeparse scanned.pdf -f markdown \
--hybrid docling-fast \
--hybrid-url http://localhost:8080 \
--hybrid-fallback
Troubleshooting
| Problem | Solution |
|---|
command not found: edgeparse | Install via Homebrew: brew tap raphaelmansuy/tap && brew install edgeparse |
Python: ModuleNotFoundError: edgeparse | pip install edgeparse |
| Node.js: addon not found | npm install edgeparse |
| Empty output | PDF may be scanned/image-only — use --hybrid docling-fast |
| Garbled text | CJK font — ensure latest edgeparse version |
| Password error | Add --password "yourpassword" |
| Tables not detected | Try --table-method cluster for borderless tables |