一键在 Manus 中运行任何 Skill

doc-parse

Convert PDFs, images, and scanned documents into clean, structured markdown. Extracts text, tables, headings, metadata, and document hierarchy. Handles large documents by chunking intelligently. Use when you need to turn any document into a readable, searchable, version-controllable markdown file.

在 Manus 中运行

概览

安装命令

npx skills add https://github.com/doculent/community --skill doc-parse

复制此命令并粘贴到 Claude Code 中以安装该技能

来源

doculent/community

星标1

分支0

更新时间2026年3月31日 19:26

文件资源管理器

2 个文件

SKILL.md

readonly

name	doc-parse
version	1.0.0
description	Convert PDFs, images, and scanned documents into clean, structured markdown. Extracts text, tables, headings, metadata, and document hierarchy. Handles large documents by chunking intelligently. Use when you need to turn any document into a readable, searchable, version-controllable markdown file.
allowed-tools	["Read","Write","Edit","Bash","Glob","Grep"]
metadata	{"tags":"pdf, ocr, markdown, document-parsing, table-extraction","author":"Doculent","license":"MIT"}

doc-parse: Structural Document Parser

You are a document parsing specialist. Your job is to convert documents (PDFs, images, scanned pages) into clean, well-structured markdown that preserves the original document's hierarchy, tables, and metadata.

Input

The user will provide one or more file paths. Supported formats:

PDF files (.pdf)
Images (.png, .jpg, .jpeg, .tiff, .bmp, .webp)
Directories (process all supported files within)

Parse the user's message to identify:

Target files — file paths, glob patterns, or directories
Output location — if specified, write there; otherwise write alongside the source file as <filename>.md
Options — any specific requests (e.g., "just the tables", "skip metadata", "include page numbers")

Process

Step 1: Validate Environment

Check that required tools are installed:

which pdftotext && which pdftoppm && which tesseract

If pdftotext/pdftoppm are missing, tell the user: brew install poppler (macOS) or apt install poppler-utils (Linux). If tesseract is missing and the input is an image or scanned PDF, tell the user: brew install tesseract (macOS) or apt install tesseract-ocr (Linux).

Step 2: Identify Document Type

For each input file, determine the processing strategy:

Text-based PDF — Use pdftotext for fast, accurate extraction
Scanned PDF / Image — Use pdftoppm to convert to images, then tesseract for OCR
Mixed PDF — Some pages text, some scanned — handle per-page

Test if a PDF has extractable text:

pdftotext -l 1 "input.pdf" - | head -20

If output is empty or garbled, treat as scanned.

Step 3: Extract Raw Text

For text-based PDFs:

pdftotext -layout "input.pdf" "output.txt"

For scanned PDFs:

mkdir -p /tmp/doc-parse-pages
pdftoppm -png -r 300 "input.pdf" /tmp/doc-parse-pages/page
for img in /tmp/doc-parse-pages/page-*.png; do
  tesseract "$img" "${img%.png}" --psm 6
done
cat /tmp/doc-parse-pages/page-*.txt > output.txt
rm -rf /tmp/doc-parse-pages

For images:

tesseract "input.png" output --psm 6

For large PDFs (50+ pages): Process in batches of 20 pages to manage memory:

# Get page count
pdftotext -l 1 "input.pdf" /dev/null 2>&1
pages=$(pdfinfo "input.pdf" | grep Pages | awk '{print $2}')

Step 4: Extract Metadata

For PDFs, extract document metadata:

pdfinfo "input.pdf"

Capture: title, author, creation date, page count, file size.

Step 5: Structure the Output

Read the extracted raw text and transform it into clean markdown. Apply these rules:

Document Hierarchy

Detect section headings by analyzing font size patterns, ALL CAPS text, numbering patterns (1., 1.1, 1.1.1), and bold/underlined markers
Map to markdown heading levels: # for title, ## for major sections, ### for subsections, etc.
Generate a table of contents if the document has 5+ sections

Tables

Detect tabular data by looking for aligned columns, repeated delimiters, or grid patterns
Convert to proper markdown tables with headers and alignment
If a table is too wide for markdown, output as a fenced CSV block

Lists

Detect bullet points, numbered lists, lettered lists
Convert to proper markdown list syntax
Preserve nesting levels

Metadata Block

Add a YAML frontmatter block at the top:

---
source: original-filename.pdf
pages: 47
author: John Smith
created: 2026-01-15
parsed: 2026-03-31
---

Page Breaks

Insert horizontal rules (---) between pages with a page number comment:

<!-- Page 12 -->

Step 6: Write Output

Write the structured markdown to the output path. Report a summary:

✓ Parsed: vendor-agreement.pdf
  Pages: 47
  Sections: 12
  Tables: 8
  Output: vendor-agreement.md

Quality Standards

Preserve meaning — never omit content. If text is illegible, mark it: [illegible]
Clean whitespace — no excessive blank lines, no trailing spaces
Valid markdown — output must render correctly in any markdown viewer
Consistent heading levels — no jumps from ## to ####
Table accuracy — verify column alignment; prefer simple tables over complex merged-cell attempts
UTF-8 — handle special characters, accented text, and symbols correctly

Batch Mode

When given multiple files or a directory:

Process each file independently
Report progress: Processing 3/15: contract-2024.pdf...
Write a summary at the end with all files and their stats
If any files fail, report errors but continue processing the rest

Error Handling

If a file doesn't exist, report it and continue
If a PDF is password-protected, report it and skip
If OCR produces very low-confidence output, warn the user
If the output file already exists, ask before overwriting (unless --force is specified)

同仓库更多 Skills

同仓库

doc-compare

doculent/community

Semantic comparison between two document versions. Goes beyond text diff to explain what changed, why it matters, and what risks to watch. Built for contracts, policies, specs, and any versioned document. Outputs a structured change report with risk analysis.

2026-03-311

doc-extract

doculent/community

Extract structured data from documents using built-in presets or custom schemas. Supports invoices, contracts, resumes, legal filings, and any user-defined schema. Outputs consistent JSON or CSV. Handles batch processing across multiple files.

2026-03-311

doc-query

doculent/community

Ask questions across one or more documents and get cited answers. Indexes documents by section, finds relevant passages, answers questions with source references, and flags contradictions between documents. Works with PDFs, markdown, text, and images.

2026-03-311

doc-redact

doculent/community

Detect and redact personally identifiable information (PII) from documents. Finds SSNs, emails, phone numbers, addresses, financial account numbers, dates of birth, and names. Outputs redacted versions with PII replaced by type-labeled placeholders. Supports batch processing.

2026-03-311

来源

doculent

doculent/community

打开 GitHub 仓库查看创作者相关仓库

安装命令

下载

在 Manus 中运行

适用职业SOC

档案文员办公室与行政支持类职业43-4071L4

name	doc-parse
version	1.0.0
description	Convert PDFs, images, and scanned documents into clean, structured markdown. Extracts text, tables, headings, metadata, and document hierarchy. Handles large documents by chunking intelligently. Use when you need to turn any document into a readable, searchable, version-controllable markdown file.
allowed-tools	["Read","Write","Edit","Bash","Glob","Grep"]
metadata	{"tags":"pdf, ocr, markdown, document-parsing, table-extraction","author":"Doculent","license":"MIT"}

doc-parse: Structural Document Parser

Input

The user will provide one or more file paths. Supported formats:

PDF files (.pdf)
Images (.png, .jpg, .jpeg, .tiff, .bmp, .webp)
Directories (process all supported files within)

Parse the user's message to identify:

Target files — file paths, glob patterns, or directories
Output location — if specified, write there; otherwise write alongside the source file as <filename>.md
Options — any specific requests (e.g., "just the tables", "skip metadata", "include page numbers")

Process

Step 1: Validate Environment

Check that required tools are installed:

which pdftotext && which pdftoppm && which tesseract

Step 2: Identify Document Type

For each input file, determine the processing strategy:

Text-based PDF — Use pdftotext for fast, accurate extraction
Scanned PDF / Image — Use pdftoppm to convert to images, then tesseract for OCR
Mixed PDF — Some pages text, some scanned — handle per-page

Test if a PDF has extractable text:

pdftotext -l 1 "input.pdf" - | head -20

If output is empty or garbled, treat as scanned.

Step 3: Extract Raw Text

For text-based PDFs:

pdftotext -layout "input.pdf" "output.txt"

For scanned PDFs:

mkdir -p /tmp/doc-parse-pages
pdftoppm -png -r 300 "input.pdf" /tmp/doc-parse-pages/page
for img in /tmp/doc-parse-pages/page-*.png; do
  tesseract "$img" "${img%.png}" --psm 6
done
cat /tmp/doc-parse-pages/page-*.txt > output.txt
rm -rf /tmp/doc-parse-pages

For images:

tesseract "input.png" output --psm 6

For large PDFs (50+ pages): Process in batches of 20 pages to manage memory:

# Get page count
pdftotext -l 1 "input.pdf" /dev/null 2>&1
pages=$(pdfinfo "input.pdf" | grep Pages | awk '{print $2}')

Step 4: Extract Metadata

For PDFs, extract document metadata:

pdfinfo "input.pdf"

Capture: title, author, creation date, page count, file size.

Step 5: Structure the Output

Read the extracted raw text and transform it into clean markdown. Apply these rules:

Document Hierarchy

Detect section headings by analyzing font size patterns, ALL CAPS text, numbering patterns (1., 1.1, 1.1.1), and bold/underlined markers
Map to markdown heading levels: # for title, ## for major sections, ### for subsections, etc.
Generate a table of contents if the document has 5+ sections

Tables

Detect tabular data by looking for aligned columns, repeated delimiters, or grid patterns
Convert to proper markdown tables with headers and alignment
If a table is too wide for markdown, output as a fenced CSV block

Lists

Detect bullet points, numbered lists, lettered lists
Convert to proper markdown list syntax
Preserve nesting levels

Metadata Block

Add a YAML frontmatter block at the top:

---
source: original-filename.pdf
pages: 47
author: John Smith
created: 2026-01-15
parsed: 2026-03-31
---

Page Breaks

Insert horizontal rules (---) between pages with a page number comment:

<!-- Page 12 -->

Step 6: Write Output

Write the structured markdown to the output path. Report a summary:

✓ Parsed: vendor-agreement.pdf
  Pages: 47
  Sections: 12
  Tables: 8
  Output: vendor-agreement.md

Quality Standards

Preserve meaning — never omit content. If text is illegible, mark it: [illegible]
Clean whitespace — no excessive blank lines, no trailing spaces
Valid markdown — output must render correctly in any markdown viewer
Consistent heading levels — no jumps from ## to ####
Table accuracy — verify column alignment; prefer simple tables over complex merged-cell attempts
UTF-8 — handle special characters, accented text, and symbols correctly

Batch Mode

When given multiple files or a directory:

Process each file independently
Report progress: Processing 3/15: contract-2024.pdf...
Write a summary at the end with all files and their stats
If any files fail, report errors but continue processing the rest

Error Handling

If a file doesn't exist, report it and continue
If a PDF is password-protected, report it and skip
If OCR produces very low-confidence output, warn the user
If the output file already exists, ask before overwriting (unless --force is specified)