ワンクリックでManusで任意のスキルを実行

$pwd:

edgeparse

Name: Edgeparse
Author: raphaelmansuy

// Extract structured content from any PDF for AI agents, RAG pipelines, and Copilot Skills. Use this skill whenever the user wants to read, analyze, or reason about a PDF document; needs to feed document content to an LLM; mentions PDF extraction, parsing, or conversion; wants tables, headings, or bounding boxes from a PDF; is building a RAG pipeline; or asks an agent to process a document. Install with: pip install edgeparse

Manusで実行

$ git log --oneline --stat

stars:115

forks:15

updated:2026年3月23日 13:22

ファイルエクスプローラー

3 ファイル

SKILL.md

readonly

package.json

"author": "raphaelmansuy"

"repository": "raphaelmansuy/edgeparse"

GitHub リポジトリを開く Creator のリポジトリを見る

$ install --global

$ download --local

Manusで実行

$ useful --forSOC

その他コンピュータ職コンピュータ・数学職15-1299L4

name

edgeparse

description

Extract structured content from any PDF for AI agents, RAG pipelines, and Copilot Skills. Use this skill whenever the user wants to read, analyze, or reason about a PDF document; needs to feed document content to an LLM; mentions PDF extraction, parsing, or conversion; wants tables, headings, or bounding boxes from a PDF; is building a RAG pipeline; or asks an agent to process a document. Install with: pip install edgeparse

EdgeParse Skill

Enables AI agents to extract clean, structured content from any PDF — headings, tables, paragraphs, lists, bounding boxes — deterministically, without ML dependencies or GPU requirements.

Install: pip install edgeparse · Node.js: npm install edgeparse
Speed: ~0.023 s/doc (Apple M4 Max, 200-doc benchmark)

When to reach for this skill

Activate when the workflow involves:

Reading or analyzing a PDF document on behalf of a user
Building a RAG pipeline that ingests PDFs
Feeding PDF content to an LLM for summarization, Q&A, or synthesis
Extracting tables from financial reports, research papers, or invoices
Processing a batch of documents for indexing or search
An agent tool that must "open" a PDF and return its contents

Quick start

import edgeparse

# Convert any PDF to Markdown — best for LLM context windows
text = edgeparse.convert("report.pdf", format="markdown")

# Convert to JSON with bounding boxes and full structure
import json
doc = json.loads(edgeparse.convert("report.pdf", format="json"))

# Plain text (fast, minimal)
plain = edgeparse.convert("report.pdf", format="text")

The format parameter controls output:

Value	Best for
`"markdown"`	LLM context — headings, tables, lists in Markdown
`"json"`	Bounding boxes, citations, structured element metadata
`"html"`	Web rendering, semantic HTML5
`"text"`	Simple full-text search, minimal output

Core API

`edgeparse.convert()`

result: str = edgeparse.convert(
    input_path,             # str or Path — required
    format="markdown",      # output format (see table above)
    pages=None,             # e.g. "1-5" or "1,3,7-10" — specific pages only
    password=None,          # for password-protected PDFs
    reading_order="xycut",  # "xycut" (spatial sort, default) or "off"
    table_method="default", # "default" (ruling-line) or "cluster" (borderless)
    image_output="off",     # "off", "embedded" (base64), "external" (files)
)

Returns the extracted content as a string. Raises FileNotFoundError for missing files and ValueError for corrupt PDFs or bad options.

`edgeparse.convert_file()`

out_path: str = edgeparse.convert_file(
    input_path,
    output_dir="output",    # write output file to this directory
    format="markdown",
    pages=None,
    password=None,
)

Writes the output file and returns its path.

Common patterns

Feed a PDF to an LLM

import edgeparse
import anthropic

doc = edgeparse.convert("report.pdf", format="markdown")

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": f"Analyze this document and summarize the key findings:\n\n{doc}"
    }]
)
print(response.content[0].text)

RAG pipeline — chunk with metadata

import edgeparse, json

raw = edgeparse.convert("paper.pdf", format="json")
doc = json.loads(raw)

chunks = []
for el in doc["elements"]:
    if el["type"] in ("paragraph", "heading", "table"):
        chunks.append({
            "text": el["text"],
            "metadata": {
                "page":    el["page_number"],
                "type":    el["type"],
                "bbox":    el["bounding_box"],   # for citation highlights
                "order":   el["reading_order"],
            }
        })

# Now embed chunks["text"] and store chunks["metadata"] in your vector store

Batch processing

import edgeparse
from pathlib import Path

results = {}
for pdf in Path("documents/").glob("*.pdf"):
    try:
        results[pdf.name] = edgeparse.convert(str(pdf), format="markdown")
    except Exception as e:
        results[pdf.name] = f"ERROR: {e}"

Extract specific pages only

# Pages 1–5
text = edgeparse.convert("report.pdf", format="markdown", pages="1-5")

# Non-contiguous pages
text = edgeparse.convert("report.pdf", format="markdown", pages="1,3,7-10")

Borderless table extraction

Many financial reports and invoices use tables without ruling lines. Use table_method="cluster" to handle them:

text = edgeparse.convert(
    "earnings.pdf",
    format="markdown",
    table_method="cluster"   # spatial clustering for borderless tables
)

Password-protected PDF

text = edgeparse.convert("secure.pdf", format="markdown", password="mypassword")

Node.js usage

import { convert } from 'edgeparse';

const markdown = convert('report.pdf', { format: 'markdown' });
const json     = convert('report.pdf', { format: 'json' });

// With options
const result = convert('report.pdf', {
    format:       'markdown',
    pages:        '1-5',
    readingOrder: 'xycut',
    tableMethod:  'cluster',
});

JSON output schema

When format="json", the output is a JSON string with shape:

{
  "page_count": 10,
  "title": "Document Title",
  "elements": [
    {
      "type": "heading",
      "level": 1,
      "text": "Introduction",
      "page_number": 1,
      "reading_order": 0,
      "bounding_box": { "x0": 72, "y0": 144, "x1": 540, "y1": 180 }
    },
    {
      "type": "table",
      "text": "| Col A | Col B |\n|-------|-------|\n| val1  | val2  |",
      "page_number": 2,
      "bounding_box": { "x0": 72, "y0": 200, "x1": 540, "y1": 350 }
    },
    {
      "type": "paragraph",
      "text": "This is body text...",
      "page_number": 1,
      "reading_order": 2,
      "bounding_box": { "x0": 72, "y0": 190, "x1": 540, "y1": 220 }
    }
  ]
}

Element type values: heading, paragraph, table, list, list_item, figure, caption, header, footer.

Error handling

import edgeparse

try:
    text = edgeparse.convert("report.pdf", format="markdown")
except FileNotFoundError:
    # PDF file not found — check the path
    pass
except ValueError as e:
    # Invalid format, corrupt PDF, wrong password, or bad page range
    print(f"Extraction failed: {e}")

For more detail

Read these reference files when the SKILL.md body isn't enough:

references/api.md — complete Python + Node.js API with all parameters and types
references/patterns.md — LangChain, LlamaIndex, MCP tool, CrewAI, and async batch patterns