| name | pdf2markdown |
| description | Convert academic PDF papers (especially Chinese academic papers) to well-structured Markdown. Use when the user needs to extract full content from PDF academic papers (thesis, dissertations, journal articles) and convert to Markdown format. Handles text extraction, OCR fallback for font-encoded/custom-encoding PDFs, layout-aware structuring (title/sections/figures/tables/formulas/references), table extraction, math formula preservation, figure caption formatting, and reference list formatting. Not for general PDF manipulation, form filling, or PDF creation tasks. |
PDF → Markdown Conversion
依赖
pip install rapidocr onnxruntime-directml rapid-layout pymupdf pdfplumber
GPU 加速:自动检测 DirectML(Windows 原生 GPU API)。如果 onnxruntime.get_available_providers() 包含 DmlExecutionProvider,则自动启用 GPU。否则回退到 CPU。
| 包 | 用途 | 必需 |
|---|
rapidocr + onnxruntime-directml | 主 OCR 引擎 | 是 |
rapid-layout | 版面分析(标题/正文/图表分类) | 可选 |
pymupdf | PDF 渲染为图片 | 是 |
pdfplumber | PDF 文本/表格提取(非乱码 PDF) | 是 |
tesseract-ocr | 兼容方案 OCR(当 RapidOCR 不可用时) | 可选 |
Workflow
1. Detect extraction method
Try extracting text with pdfplumber first:
import pdfplumber
with pdfplumber.open(pdf_path) as pdf:
texts = []
for page in pdf.pages:
t = page.extract_text()
texts.append(t or "")
raw = "\n".join(texts)
Check if output is garbled:
import re
replacement_count = raw.count("\ufffd")
chinese_chars = len(re.findall(r"[\u4e00-\u9fff]", raw))
total_chars = len(raw.strip())
if total_chars == 0:
garbled = True
elif replacement_count > total_chars * 0.1:
garbled = True
elif chinese_chars < total_chars * 0.02 and total_chars > 200:
garbled = True
else:
garbled = False
garbled = True → use OCR path (step 2a)
garbled = False → use text extraction path (step 2b)
2a. OCR path (RapidOCR + rapid-layout)
import fitz
from rapidocr import RapidOCR
try:
from rapid_layout import RapidLayout
HAS_LAYOUT = True
except ImportError:
HAS_LAYOUT = False
ocr_engine = RapidOCR()
if HAS_LAYOUT:
layout_engine = RapidLayout()
doc = fitz.open(pdf_path)
ocr_texts = []
for page_num in range(len(doc)):
page = doc[page_num]
pix = page.get_pixmap(dpi=150)
img = pix.pil_tobytes("png")
if HAS_LAYOUT:
layout_result = layout_engine(img)
boxes = getattr(layout_result, "boxes", None)
class_names = getattr(layout_result, "class_names", None)
ocr_result = ocr_engine(img)
text_lines = []
if ocr_result.boxes is not None:
for box, txt, score in zip(ocr_result.boxes, ocr_result.txts, ocr_result.scores):
text_lines.append({"text": txt, "confidence": score})
page_text = "\n".join([t["text"] for t in text_lines])
else:
ocr_result = ocr_engine(img)
if ocr_result.boxes is not None:
page_text = "\n".join(ocr_result.txts)
else:
page_text = ""
ocr_texts.append(page_text)
Layout analysis is optional. When rapid-layout is installed, the OCR path
automatically uses it to identify document structure (titles, sections, body
text, figures, tables, formulas, references) and assign OCR text accordingly.
2b. Text extraction path
Use pdfplumber for clean PDFs. For tables:
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
pass
3. Post-process extracted text
When using ocr_extract.py, all post-processing is applied automatically.
When doing manual extraction (2b text path or manual OCR), apply these steps:
3a. Remove repeated headers
KNOWN_HEADERS = [
"硕士学位论文", "博士学位论文", "本科毕业论文",
"中国知网", "CNKI", "万方数据",
"University", "Dissertation", "Thesis",
]
3b. Detect and skip TOC pages
Heuristic: if a page has >=15 lines matching ^\d+\.\d or ^\d+\s*$, it's
likely a TOC page. Remove it.
Also remove standalone roman numeral lines (^[ivxlcdm]+$, case-insensitive, length <=8).
3c. Format figure captions
**图X-X:标题内容**
Preserve any text like 图 1., Figure 1., 图1-1 as **图X:...** / **图X-X:...**.
3d. Format tables
Convert pdfplumber table output to Markdown:
| Col1 | Col2 |
|------|------|
| val1 | val2 |
3e. Preserve math formulas
Keep inline $...$ and block $$...$$ LaTeX as-is.
3f. Format references
At the end of document, convert reference entries to numbered list:
1. Author, "Title", Journal, vol. X, no. Y, pp. Z, year.
2. ...
4. Write output
Write to <pdf_stem>.md in the same directory as the source PDF:
import pathlib
path = pathlib.Path(pdf_path)
md_path = path.with_suffix(".md")
md_path.write_text(processed_text, encoding="utf-8")
Output Formatting Rules
| Element | Format |
|---|
| Title | # Title |
| Sections | ## Section, ### Subsection |
| Abstract | ## Abstract + paragraph |
| Keywords | **Keywords:** term1; term2 |
| Figure caption | **图X-X:描述** |
| Table | Markdown table with header separator |
| Inline math | $x^2 + y^2 = z^2$ |
| Block math | $$E = mc^2$$ |
| In-text citation | [1], [2,3] (keep as-is) |
| References | Numbered list 1. ... |
| Bold | **text** |
| Italic | *text* |
| Superscript | text^sup |
| Subscript | text~sub |
架构
提取与格式化分离为两个独立脚本,便于 AI agent 灵活选择处理方式:
ocr_extract.py format_md.py
│ │
│ 原始纯文本 (.txt) │ 结构化 Markdown
└──────────────────────→┘
ocr_extract.py — 只做 OCR 纯文本提取,输出 .txt(每行一个 OCR 片段)
format_md.py — 读取任意行式文本,应用 Markdown 格式化规则
AI agent 可以选择:
- 直接调用
format_md.py 一键格式化
- 用 Read/Write tool 手动逐行格式化(适用于非标准 PDF)
Scripts
scripts/ocr_extract.py
OCR 纯文本提取。接收 PDF,输出原始文本(一行一个 OCR 片段,页面间用空行分隔)。
引擎自动检测(RapidOCR+layout → RapidOCR → tesseract CLI),
也可通过 --engine 强制指定。--lang 主要用于 Tesseract,RapidOCR 会忽略该参数。
python scripts/ocr_extract.py document.pdf --dpi 150 --max-pages 10 --lang ch,en
python scripts/ocr_extract.py document.pdf --engine tesseract
参数:
--dpi — PDF 渲染分辨率(默认 150)
--max-pages — 最大页数(0 = 全部)
--lang — OCR 语言,逗号分隔(默认 ch,en,Tesseract 时生效)
--engine {auto,rapidocr,tesseract} — 强制指定 OCR 引擎(默认 auto 自动检测)
输出到 document.txt,并提示下一步格式化命令。
scripts/format_md.py
Markdown 格式化。读取行式文本,根据内容模式自动应用标题/加粗/关键词等格式。
python scripts/format_md.py document.txt --overwrite
python scripts/format_md.py document.txt document.md
python scripts/format_md.py document.txt
格式化规则:
| 输入模式 | 输出 |
|---|
第X章 xxx | ## 第X章 xxx |
X.X xxx / X.Xxxx | ### X.X xxx |
X.X.X xxx | #### X.X.X xxx |
X xxx(数字+空格+文字) | ## X xxx |
一、xxxx(>=4字,不含第二个、) | ### 一、xxxx |
摘要 / ABSTRACT | ## 摘要 / ## ABSTRACT |
关键词:... / Key words: ... | **关键词:** ... |
图X / Figure X 行首 | **图X...** |
表X / Table X 行首 | **表X...** |
| 纯文本段落 | 保持原样 |
| 页码 / 罗马数字 | 自动过滤 |