Run any Skill in Manus with one click

$pwd:

edgeparse

Name: Edgeparse
Author: raphaelmansuy

// Use this skill whenever the user wants to extract structured content from PDF files — reading or extracting text, tables, headings, images, and bounding boxes from PDFs, or converting PDFs to Markdown, JSON, HTML, or plain text. EdgeParse is a zero-dependency Rust-native tool (no JVM, no GPU, no OCR models required). Use this skill when the user has a .pdf file and wants to parse it, extract its text or tables, convert it to Markdown or JSON, get bounding boxes for content elements, process multiple PDFs in batch, or pipe PDF content to an LLM pipeline.

Run Skill in Manus

$ git log --oneline --stat

stars:1

forks:0

updated:March 23, 2026 at 07:23

SKILL.md

readonly

package.json

"author": "raphaelmansuy"

"repository": "raphaelmansuy/run-edgeparse"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Data Entry KeyersOffice and Administrative Support Occupations43-9021L4

Run any Skill with one click

name	edgeparse
description	Use this skill whenever the user wants to extract structured content from PDF files — reading or extracting text, tables, headings, images, and bounding boxes from PDFs, or converting PDFs to Markdown, JSON, HTML, or plain text. EdgeParse is a zero-dependency Rust-native tool (no JVM, no GPU, no OCR models required). Use this skill when the user has a .pdf file and wants to parse it, extract its text or tables, convert it to Markdown or JSON, get bounding boxes for content elements, process multiple PDFs in batch, or pipe PDF content to an LLM pipeline.
compatibility	Requires edgeparse CLI installed (via Homebrew, cargo, or binary download). See Step 0 for installation.
license	Apache-2.0
metadata	{"author":"raphaelmansuy","version":"0.1.0","homepage":"https://github.com/raphaelmansuy/edgeparse"}

EdgeParse Skill

Parse PDFs into structured content (Markdown, JSON with bounding boxes, HTML, plain text) using EdgeParse — a fast, zero-dependency Rust-native PDF extraction engine.

Initial Setup

When this skill is invoked, respond with:

I'm ready to use EdgeParse to extract content from your PDF(s).

Before we begin, please confirm that `edgeparse` is installed:

  edgeparse --version     # should print "edgeparse 0.1.0"

If it's not installed, I'll help you install it first.

Please provide:
1. One or more PDF files to process
2. The desired output format: markdown (default), json, html, or text
3. Any options: specific pages, table detection method, image extraction, etc.
4. What you'd like to do with the extracted content (read it, chunk it for RAG, analyse tables, etc.)

Then wait for the user's input.

Step 0 — Install EdgeParse (if needed)

macOS (Homebrew — recommended)

brew tap raphaelmansuy/tap
brew install edgeparse

Any platform (Rust toolchain)

cargo install edgeparse-cli

Python wrapper

pip install edgeparse
python -c "import edgeparse; print(edgeparse.version())"

Node.js wrapper

npm install edgeparse
node -e "const {version} = require('edgeparse'); console.log(version())"

Verify installation:

edgeparse --version   # edgeparse 0.1.0

Step 1 — Produce the CLI Command or Script

Convert a Single PDF

# Markdown output (default)
edgeparse document.pdf

# Explicit format, write to output/
edgeparse document.pdf -f markdown -o output/

# JSON with bounding boxes (ideal for RAG)
edgeparse document.pdf -f json -o output/

# HTML
edgeparse document.pdf -f html -o output/

# Plain text
edgeparse document.pdf -f text -o output/

Specific Pages

# Single page
edgeparse document.pdf --pages "1" -f markdown

# Range
edgeparse document.pdf --pages "1-5" -f json

# Non-contiguous pages
edgeparse document.pdf --pages "1,3,5-7,10" -f markdown

Batch Processing

# All PDFs in current directory
edgeparse *.pdf -f markdown -o output/

# Multiple explicit files, multiple formats
edgeparse report.pdf paper.pdf -f markdown,json -o output/

Extract Tables

# Default: ruling-line detection (best for tables with visible borders)
edgeparse document.pdf -f json

# Cluster method (best for borderless/whitespace-separated tables)
edgeparse document.pdf -f json --table-method cluster

Extract Images

# Extract images as external files
edgeparse document.pdf -f markdown --image-output external --image-dir output/images/

# Embed images as base64 in Markdown (self-contained)
edgeparse document.pdf -f markdown-with-images --image-output embedded

Password-Protected PDF

edgeparse document.pdf -f markdown --password "mypassword"

Reading Order

# XY-Cut++ (default, best for multi-column PDFs)
edgeparse document.pdf -f markdown --reading-order xycut

# Disable (use raw PDF element order)
edgeparse document.pdf -f markdown --reading-order off

Multiple Output Formats in One Pass

# Produce both Markdown and JSON in one run
edgeparse document.pdf -f markdown,json -o output/

Step 2 — Use the Python SDK (when scripting)

import edgeparse

# Convert to Markdown
md = edgeparse.convert("document.pdf", format="markdown")

# Convert to JSON with bounding boxes
import json
doc = json.loads(edgeparse.convert("document.pdf", format="json"))

# Extract headings
headings = [e for e in doc["kids"] if e["type"] == "heading"]
for h in headings:
    print(f'H{h["heading level"]} {h["content"]}')

# Extract tables
for e in doc["kids"]:
    if e["type"] == "table":
        for row in e["rows"]:
            print(" | ".join(row))

# Write output file
out_path = edgeparse.convert_file(
    "document.pdf",
    output_dir="output/",
    format="markdown",
    pages="1-5",
)

# Batch processing with threading
from concurrent.futures import ThreadPoolExecutor
import glob

def process(path):
    return (path, edgeparse.convert(path, format="markdown"))

with ThreadPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(process, glob.glob("*.pdf")))

Step 3 — Use the Node.js SDK (when scripting)

const { convert, version } = require('edgeparse');

// Convert to Markdown
const md = convert('document.pdf', { format: 'markdown' });

// Convert to JSON and parse
const doc = JSON.parse(convert('document.pdf', { format: 'json' }));

// Extract headings
const headings = doc.kids.filter(e => e.type === 'heading');
headings.forEach(h => console.log(`H${h['heading level']} ${h.content}`));

// Batch with Promise.all
const { Worker } = require('worker_threads');
const paths = ['a.pdf', 'b.pdf', 'c.pdf'];
const results = paths.map(p => convert(p, { format: 'markdown' }));

Step 4 — Key Options Reference

Output Formats

Format	Description
`markdown`	Standard Markdown with GFM tables (default)
`markdown-with-html`	Markdown with HTML table fallback for complex tables
`markdown-with-images`	Markdown with embedded or linked image references
`json`	Structured JSON with bounding boxes and element types
`html`	Full HTML5 document with semantic elements
`text`	Plain UTF-8 text, reading order preserved

Layout Options

Option	Default	Description
`--reading-order`	`xycut`	`xycut` (XY-Cut++ algorithm) or `off`
`--table-method`	`default`	`default` (ruling lines) or `cluster` (borderless)
`--keep-line-breaks`	false	Preserve original line breaks in paragraphs
`--use-struct-tree`	false	Use tagged PDF structure tree when available
`--include-header-footer`	false	Include page headers and footers

Image Options

Option	Default	Description
`--image-output`	`off`	`off`, `embedded` (base64), or `external` (files)
`--image-format`	`png`	`png` or `jpeg`
`--image-dir`	—	Output directory for extracted images

Content Safety

Option	Description
`--content-safety-off all`	Disable all AI safety filters
`--content-safety-off hidden-text`	Allow hidden text extraction
`--content-safety-off tiny`	Allow tiny-text extraction

Step 5 — Understanding JSON Output

The JSON format gives agents the richest representation for reasoning over document structure.

Document Envelope

{
  "file name": "document.pdf",
  "number of pages": 15,
  "author": "...",
  "title": "...",
  "creation date": "D:20250101T000000Z",
  "modification date": "D:20250101T000000Z",
  "kids": [ /* elements */ ]
}

Element Types

`type`	Key fields
`paragraph`	`content`, `font`, `font size`, `text color`
`heading`	`content`, `level` (Title/H1-H6), `heading level` (int)
`table`	`rows` (2D string array, first row is header)
`list`	`items` (string array)
`image`	`image path` (when `--image-output external`)
`caption`	`content`
`formula`	`content`

Bounding Box

All elements have "bounding box": [x0, y0, x1, y1]:

Origin: bottom-left of page (PDF coordinate system)
Y axis: increases upward
Units: PDF points (72 pt = 1 inch)
A4 page: 595 × 842 pt · US Letter: 612 × 792 pt

# Convert to top-left origin (screen coordinates)
x0, y0_pdf, x1, y1_pdf = element["bounding box"]
page_height = 842  # A4, or use doc["page height"] if available
y_top    = page_height - y1_pdf   # distance from top of page
y_bottom = page_height - y0_pdf

Recipes for Common Agent Tasks

Recipe A — Chunk PDF for RAG

import edgeparse, json

doc = json.loads(edgeparse.convert("paper.pdf", format="json"))
chunks = [
    {"text": e["content"], "page": e["page number"], "type": e["type"]}
    for e in doc["kids"]
    if e["type"] in ("paragraph", "heading", "caption")
]
# Feed chunks to your embedding model

Recipe B — Extract All Tables to CSV

import edgeparse, json, csv

doc = json.loads(edgeparse.convert("report.pdf", format="json"))
for i, e in enumerate(doc["kids"]):
    if e["type"] == "table":
        with open(f"table_{i}.csv", "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(e["rows"])

Recipe C — Convert Folder of PDFs to Markdown

# CLI: fastest approach
edgeparse *.pdf -f markdown -o markdown_output/ --quiet

# Python: with per-file error handling
import edgeparse, glob, os

os.makedirs("out", exist_ok=True)
for path in glob.glob("*.pdf"):
    try:
        md = edgeparse.convert(path, format="markdown")
        with open(f"out/{os.path.splitext(os.path.basename(path))[0]}.md", "w") as f:
            f.write(md)
        print(f"✓ {path}")
    except RuntimeError as e:
        print(f"✗ {path}: {e}")

Recipe D — Heading Outline of a Document

import edgeparse, json

doc = json.loads(edgeparse.convert("paper.pdf", format="json"))
for e in doc["kids"]:
    if e["type"] == "heading":
        indent = "  " * (e.get("heading level", 1) - 1)
        print(f'{indent}{"#" * e.get("heading level", 1)} {e["content"]}')

Recipe E — First Page Only (quick preview)

edgeparse document.pdf --pages "1" -f markdown

Supported Input

EdgeParse supports born-digital PDFs only (embedded text). It does not include built-in OCR. For scanned PDFs or image-only PDFs, add --hybrid docling-fast to delegate to a Docling backend:

# Requires a Docling server running separately
edgeparse scanned.pdf -f markdown \
  --hybrid docling-fast \
  --hybrid-url http://localhost:8080 \
  --hybrid-fallback

Troubleshooting

Problem	Solution
`command not found: edgeparse`	Install via Homebrew: `brew tap raphaelmansuy/tap && brew install edgeparse`
Python: `ModuleNotFoundError: edgeparse`	`pip install edgeparse`
Node.js: addon not found	`npm install edgeparse`
Empty output	PDF may be scanned/image-only — use `--hybrid docling-fast`
Garbled text	CJK font — ensure latest `edgeparse` version
Password error	Add `--password "yourpassword"`
Tables not detected	Try `--table-method cluster` for borderless tables

name	edgeparse
description	Use this skill whenever the user wants to extract structured content from PDF files — reading or extracting text, tables, headings, images, and bounding boxes from PDFs, or converting PDFs to Markdown, JSON, HTML, or plain text. EdgeParse is a zero-dependency Rust-native tool (no JVM, no GPU, no OCR models required). Use this skill when the user has a .pdf file and wants to parse it, extract its text or tables, convert it to Markdown or JSON, get bounding boxes for content elements, process multiple PDFs in batch, or pipe PDF content to an LLM pipeline.
compatibility	Requires edgeparse CLI installed (via Homebrew, cargo, or binary download). See Step 0 for installation.
license	Apache-2.0
metadata	{"author":"raphaelmansuy","version":"0.1.0","homepage":"https://github.com/raphaelmansuy/edgeparse"}