Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

pdf2markdown

Sterne0

Forks0

Aktualisiert26. Mai 2026 um 19:14

Convert academic PDF papers (especially Chinese academic papers) to well-structured Markdown. Use when the user needs to extract full content from PDF academic papers (thesis, dissertations, journal articles) and convert to Markdown format. Handles text extraction, OCR fallback for font-encoded/custom-encoding PDFs, layout-aware structuring (title/sections/figures/tables/formulas/references), table extraction, math formula preservation, figure caption formatting, and reference list formatting. Not for general PDF manipulation, form filling, or PDF creation tasks.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

lazyst

lazyst/my-skills

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Datei-Explorer

3 Dateien

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

ocr

lazyst/my-skills

Extract text and layout structure from images using OCR (Optical Character Recognition). Use this skill whenever the user provides or mentions an image file (screenshot, photo, scanned document, receipt, whiteboard photo, PDF page, etc.) and wants to extract text from it, understand its layout, or get structured information about what's on the image. This skill handles both Chinese and English text, and can identify layout regions (navigation bars, headers, footers, main content, tables, figures, etc.). Also use when the user says "OCR", "提取文字", "图片里的文字", "识别截图", "text from image", "scan this", "read this image" or similar phrases. Do NOT use for pure image generation, image editing, or image classification tasks that don't involve text extraction.

2026-05-260

name

pdf2markdown

description

PDF → Markdown Conversion

依赖

pip install rapidocr onnxruntime-directml rapid-layout pymupdf pdfplumber

GPU 加速：自动检测 DirectML（Windows 原生 GPU API）。如果 onnxruntime.get_available_providers() 包含 DmlExecutionProvider，则自动启用 GPU。否则回退到 CPU。

包	用途	必需
`rapidocr` + `onnxruntime-directml`	主 OCR 引擎	是
`rapid-layout`	版面分析（标题/正文/图表分类）	可选
`pymupdf`	PDF 渲染为图片	是
`pdfplumber`	PDF 文本/表格提取（非乱码 PDF）	是
`tesseract-ocr`	兼容方案 OCR（当 RapidOCR 不可用时）	可选

Workflow

1. Detect extraction method

Try extracting text with pdfplumber first:

import pdfplumber

with pdfplumber.open(pdf_path) as pdf:
    texts = []
    for page in pdf.pages:
        t = page.extract_text()
        texts.append(t or "")
    raw = "\n".join(texts)

Check if output is garbled:

import re

replacement_count = raw.count("\ufffd")
chinese_chars = len(re.findall(r"[\u4e00-\u9fff]", raw))
total_chars = len(raw.strip())

if total_chars == 0:
    garbled = True
elif replacement_count > total_chars * 0.1:
    garbled = True
elif chinese_chars < total_chars * 0.02 and total_chars > 200:
    garbled = True
else:
    garbled = False

garbled = True → use OCR path (step 2a)
garbled = False → use text extraction path (step 2b)

2a. OCR path (RapidOCR + rapid-layout)

import fitz  # pymupdf
from rapidocr import RapidOCR

try:
    from rapid_layout import RapidLayout
    HAS_LAYOUT = True
except ImportError:
    HAS_LAYOUT = False

ocr_engine = RapidOCR()
if HAS_LAYOUT:
    layout_engine = RapidLayout()

doc = fitz.open(pdf_path)
ocr_texts = []

for page_num in range(len(doc)):
    page = doc[page_num]
    pix = page.get_pixmap(dpi=150)
    img = pix.pil_tobytes("png")

    if HAS_LAYOUT:
        # Get layout regions
        layout_result = layout_engine(img)
        boxes = getattr(layout_result, "boxes", None)
        class_names = getattr(layout_result, "class_names", None)

        # Get full-page OCR
        ocr_result = ocr_engine(img)
        text_lines = []
        if ocr_result.boxes is not None:
            for box, txt, score in zip(ocr_result.boxes, ocr_result.txts, ocr_result.scores):
                text_lines.append({"text": txt, "confidence": score})

        # Assign text to layout regions (simplified)
        page_text = "\n".join([t["text"] for t in text_lines])
    else:
        ocr_result = ocr_engine(img)
        if ocr_result.boxes is not None:
            page_text = "\n".join(ocr_result.txts)
        else:
            page_text = ""

    ocr_texts.append(page_text)

Layout analysis is optional. When rapid-layout is installed, the OCR path automatically uses it to identify document structure (titles, sections, body text, figures, tables, formulas, references) and assign OCR text accordingly.

2b. Text extraction path

Use pdfplumber for clean PDFs. For tables:

with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            pass  # Convert to Markdown table

3. Post-process extracted text

When using ocr_extract.py, all post-processing is applied automatically. When doing manual extraction (2b text path or manual OCR), apply these steps:

3a. Remove repeated headers

KNOWN_HEADERS = [
    "硕士学位论文", "博士学位论文", "本科毕业论文",
    "中国知网", "CNKI", "万方数据",
    "University", "Dissertation", "Thesis",
]

3b. Detect and skip TOC pages

Heuristic: if a page has >=15 lines matching ^\d+\.\d or ^\d+\s*$, it's likely a TOC page. Remove it.

Also remove standalone roman numeral lines (^[ivxlcdm]+$, case-insensitive, length <=8).

3c. Format figure captions

**图X-X：标题内容**

Preserve any text like 图 1., Figure 1., 图1-1 as **图X：...** / **图X-X：...**.

3d. Format tables

Convert pdfplumber table output to Markdown:

| Col1 | Col2 |
|------|------|
| val1 | val2 |

3e. Preserve math formulas

Keep inline $...$ and block $$...$$ LaTeX as-is.

3f. Format references

At the end of document, convert reference entries to numbered list:

1. Author, "Title", Journal, vol. X, no. Y, pp. Z, year.
2. ...

4. Write output

Write to <pdf_stem>.md in the same directory as the source PDF:

import pathlib

path = pathlib.Path(pdf_path)
md_path = path.with_suffix(".md")
md_path.write_text(processed_text, encoding="utf-8")

Output Formatting Rules

Element	Format
Title	`# Title`
Sections	`## Section`, `### Subsection`
Abstract	`## Abstract` + paragraph
Keywords	`Keywords: term1; term2`
Figure caption	`图X-X：描述`
Table	Markdown table with header separator
Inline math	$x^2 + y^2 = z^2$
Block math	`$$E = mc^2$$`
In-text citation	`[1]`, `[2,3]` (keep as-is)
References	Numbered list `1. ...`
Bold	`text`
Italic	`text`
Superscript	`text^sup`
Subscript	`text~sub`

架构

提取与格式化分离为两个独立脚本，便于 AI agent 灵活选择处理方式：

ocr_extract.py          format_md.py
  │                       │
  │  原始纯文本 (.txt)     │  结构化 Markdown
  └──────────────────────→┘

ocr_extract.py — 只做 OCR 纯文本提取，输出 .txt（每行一个 OCR 片段）
format_md.py — 读取任意行式文本，应用 Markdown 格式化规则

AI agent 可以选择：

直接调用 format_md.py 一键格式化
用 Read/Write tool 手动逐行格式化（适用于非标准 PDF）

Scripts

`scripts/ocr_extract.py`

OCR 纯文本提取。接收 PDF，输出原始文本（一行一个 OCR 片段，页面间用空行分隔）。

引擎自动检测（RapidOCR+layout → RapidOCR → tesseract CLI），也可通过 --engine 强制指定。--lang 主要用于 Tesseract，RapidOCR 会忽略该参数。

# 自动检测
python scripts/ocr_extract.py document.pdf --dpi 150 --max-pages 10 --lang ch,en

# 强制使用 Tesseract
python scripts/ocr_extract.py document.pdf --engine tesseract

参数：

--dpi — PDF 渲染分辨率（默认 150）
--max-pages — 最大页数（0 = 全部）
--lang — OCR 语言，逗号分隔（默认 ch,en，Tesseract 时生效）
--engine {auto,rapidocr,tesseract} — 强制指定 OCR 引擎（默认 auto 自动检测）

输出到 document.txt，并提示下一步格式化命令。

`scripts/format_md.py`

Markdown 格式化。读取行式文本，根据内容模式自动应用标题/加粗/关键词等格式。

# 覆盖写入（原地替换）
python scripts/format_md.py document.txt --overwrite

# 输出到新文件
python scripts/format_md.py document.txt document.md

# 仅输出到 stdout
python scripts/format_md.py document.txt

格式化规则：

输入模式	输出
`第X章 xxx`	`## 第X章 xxx`
`X.X xxx` / `X.Xxxx`	`### X.X xxx`
`X.X.X xxx`	`#### X.X.X xxx`
`X xxx`（数字+空格+文字）	`## X xxx`
`一、xxxx`（>=4字，不含第二个、）	`### 一、xxxx`
`摘要` / `ABSTRACT`	`## 摘要` / `## ABSTRACT`
`关键词：...` / `Key words: ...`	`关键词： ...`
`图X / Figure X` 行首	`图X...`
`表X / Table X` 行首	`表X...`
纯文本段落	保持原样
页码 / 罗马数字	自动过滤