원클릭으로 Manus에서 모든 스킬 실행

시작하기

ocr-document

스타526

포크19

업데이트2026년 6월 13일 14:21

Extract text from PDFs, images, and scanned documents. Uses pymupdf (local) or optional cloud OCR APIs.

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

fuyuxiang

fuyuxiang/echo-agent

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

OCR & Document Processing

Extract text from PDFs, scanned images, and documents.

PDF Text Extraction (PyMuPDF)

Best choice for text-based PDFs:

pip install pymupdf

import pymupdf

doc = pymupdf.open("file.pdf")
for page in doc:
    text = page.get_text()
    print(text)

# All pages at once
full_text = "\n".join(page.get_text() for page in doc)

PDF → Markdown (marker-pdf)

High-quality conversion preserving structure:

pip install marker-pdf
marker_single file.pdf output_dir/ --output_format markdown

Image OCR

Surya OCR (Modern ML-based, best for Chinese)

pip install surya-ocr
surya_ocr image.png --langs zh,en

Pytesseract (Traditional, widely available)

# Install Tesseract engine first
brew install tesseract tesseract-lang  # macOS
apt install tesseract-ocr tesseract-ocr-chi-sim  # Linux
pip install pytesseract Pillow

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(
    Image.open("scan.png"),
    lang="chi_sim+eng"
)

Script

python3 scripts/extract_document.py document.pdf
python3 scripts/extract_document.py scan.png
python3 scripts/extract_document.py report.pdf --output extracted.txt

Auto-detects format by extension: PDF → pymupdf, DOCX → python-docx, Image → pytesseract. OCR language is controlled by system Tesseract config (e.g., chi_sim+eng default).

Tips

For scanned PDFs, extract images first then OCR each page
Preprocessing (deskew, contrast) improves OCR accuracy
Chinese OCR: surya-ocr > pytesseract for accuracy

이 저장소의 다른 Skills

같은 저장소

ppt-author

fuyuxiang/echo-agent

Create and edit PowerPoint (.pptx) presentations programmatically. Requires python-pptx.

2026-06-22526

excel-author

fuyuxiang/echo-agent

Create and edit Excel (.xlsx) workbooks with openpyxl. Supports formulas, charts, formatting, and data analysis.

2026-06-13526

image-gen

fuyuxiang/echo-agent

Generate images via DALL-E, Stable Diffusion, or free alternatives. Supports multi-channel delivery.

2026-06-13526

meme-gen

fuyuxiang/echo-agent

Generate meme images with text overlays using Pillow. Pick templates or create custom image macros.

2026-06-13526

code-runner

fuyuxiang/echo-agent

Execute Python code snippets in a sandboxed environment. Supports data analysis, visualization, and quick scripts.

2026-06-13526

github-ops

fuyuxiang/echo-agent

GitHub CLI for issues, PRs, code search, CI logs, releases, and API queries. Requires gh CLI and auth.

2026-06-13526

name	ocr-document
description	Extract text from PDFs, images, and scanned documents. Uses pymupdf (local) or optional cloud OCR APIs.
version	1.0.0
metadata	{"echo":{"tags":["OCR","PDF","Document","Extract","Text"]}}

OCR & Document Processing

Extract text from PDFs, scanned images, and documents.

PDF Text Extraction (PyMuPDF)

Best choice for text-based PDFs:

pip install pymupdf

import pymupdf

doc = pymupdf.open("file.pdf")
for page in doc:
    text = page.get_text()
    print(text)

# All pages at once
full_text = "\n".join(page.get_text() for page in doc)

PDF → Markdown (marker-pdf)

High-quality conversion preserving structure:

pip install marker-pdf
marker_single file.pdf output_dir/ --output_format markdown

Image OCR

Surya OCR (Modern ML-based, best for Chinese)

pip install surya-ocr
surya_ocr image.png --langs zh,en

Pytesseract (Traditional, widely available)

# Install Tesseract engine first
brew install tesseract tesseract-lang  # macOS
apt install tesseract-ocr tesseract-ocr-chi-sim  # Linux
pip install pytesseract Pillow

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(
    Image.open("scan.png"),
    lang="chi_sim+eng"
)

Script

python3 scripts/extract_document.py document.pdf
python3 scripts/extract_document.py scan.png
python3 scripts/extract_document.py report.pdf --output extracted.txt

Auto-detects format by extension: PDF → pymupdf, DOCX → python-docx, Image → pytesseract. OCR language is controlled by system Tesseract config (e.g., chi_sim+eng default).

Tips

For scanned PDFs, extract images first then OCR each page
Preprocessing (deskew, contrast) improves OCR accuracy
Chinese OCR: surya-ocr > pytesseract for accuracy