ワンクリックでManusで任意のスキルを実行

始める

ocr-document

スター526

フォーク19

更新日2026年6月13日 14:21

Extract text from PDFs, images, and scanned documents. Uses pymupdf (local) or optional cloud OCR APIs.

インストール

Codex または Claude でインストールこの Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。

Manusで実行

ソース

fuyuxiang

fuyuxiang/echo-agent

GitHub リポジトリを開く Creator のリポジトリを見る

ダウンロード

Manusで実行

OCR & Document Processing

Extract text from PDFs, scanned images, and documents.

PDF Text Extraction (PyMuPDF)

Best choice for text-based PDFs:

pip install pymupdf

import pymupdf

doc = pymupdf.open("file.pdf")
for page in doc:
    text = page.get_text()
    print(text)

# All pages at once
full_text = "\n".join(page.get_text() for page in doc)

PDF → Markdown (marker-pdf)

High-quality conversion preserving structure:

pip install marker-pdf
marker_single file.pdf output_dir/ --output_format markdown

Image OCR

Surya OCR (Modern ML-based, best for Chinese)

pip install surya-ocr
surya_ocr image.png --langs zh,en

Pytesseract (Traditional, widely available)

# Install Tesseract engine first
brew install tesseract tesseract-lang  # macOS
apt install tesseract-ocr tesseract-ocr-chi-sim  # Linux
pip install pytesseract Pillow

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(
    Image.open("scan.png"),
    lang="chi_sim+eng"
)

Script

python3 scripts/extract_document.py document.pdf
python3 scripts/extract_document.py scan.png
python3 scripts/extract_document.py report.pdf --output extracted.txt

Auto-detects format by extension: PDF → pymupdf, DOCX → python-docx, Image → pytesseract. OCR language is controlled by system Tesseract config (e.g., chi_sim+eng default).

Tips

For scanned PDFs, extract images first then OCR each page
Preprocessing (deskew, contrast) improves OCR accuracy
Chinese OCR: surya-ocr > pytesseract for accuracy

このリポジトリの他の Skills

同じリポジトリ

ppt-author

fuyuxiang/echo-agent

Create and edit PowerPoint (.pptx) presentations programmatically. Requires python-pptx.

2026-06-22526

excel-author

fuyuxiang/echo-agent

Create and edit Excel (.xlsx) workbooks with openpyxl. Supports formulas, charts, formatting, and data analysis.

2026-06-13526

image-gen

fuyuxiang/echo-agent

Generate images via DALL-E, Stable Diffusion, or free alternatives. Supports multi-channel delivery.

2026-06-13526

meme-gen

fuyuxiang/echo-agent

Generate meme images with text overlays using Pillow. Pick templates or create custom image macros.

2026-06-13526

code-runner

fuyuxiang/echo-agent

Execute Python code snippets in a sandboxed environment. Supports data analysis, visualization, and quick scripts.

2026-06-13526

github-ops

fuyuxiang/echo-agent

GitHub CLI for issues, PRs, code search, CI logs, releases, and API queries. Requires gh CLI and auth.

2026-06-13526

name	ocr-document
description	Extract text from PDFs, images, and scanned documents. Uses pymupdf (local) or optional cloud OCR APIs.
version	1.0.0
metadata	{"echo":{"tags":["OCR","PDF","Document","Extract","Text"]}}

OCR & Document Processing

Extract text from PDFs, scanned images, and documents.

PDF Text Extraction (PyMuPDF)

Best choice for text-based PDFs:

pip install pymupdf

import pymupdf

doc = pymupdf.open("file.pdf")
for page in doc:
    text = page.get_text()
    print(text)

# All pages at once
full_text = "\n".join(page.get_text() for page in doc)

PDF → Markdown (marker-pdf)

High-quality conversion preserving structure:

pip install marker-pdf
marker_single file.pdf output_dir/ --output_format markdown

Image OCR

Surya OCR (Modern ML-based, best for Chinese)

pip install surya-ocr
surya_ocr image.png --langs zh,en

Pytesseract (Traditional, widely available)

# Install Tesseract engine first
brew install tesseract tesseract-lang  # macOS
apt install tesseract-ocr tesseract-ocr-chi-sim  # Linux
pip install pytesseract Pillow

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(
    Image.open("scan.png"),
    lang="chi_sim+eng"
)

Script

python3 scripts/extract_document.py document.pdf
python3 scripts/extract_document.py scan.png
python3 scripts/extract_document.py report.pdf --output extracted.txt

Auto-detects format by extension: PDF → pymupdf, DOCX → python-docx, Image → pytesseract. OCR language is controlled by system Tesseract config (e.g., chi_sim+eng default).

Tips

For scanned PDFs, extract images first then OCR each page
Preprocessing (deskew, contrast) improves OCR accuracy
Chinese OCR: surya-ocr > pytesseract for accuracy