원클릭으로
ocr-document
Extract text from PDFs, images, and scanned documents. Uses pymupdf (local) or optional cloud OCR APIs.
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
메뉴
Extract text from PDFs, images, and scanned documents. Uses pymupdf (local) or optional cloud OCR APIs.
Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.
SOC 직업 분류 기준
Create and edit PowerPoint (.pptx) presentations programmatically. Requires python-pptx.
Create and edit Excel (.xlsx) workbooks with openpyxl. Supports formulas, charts, formatting, and data analysis.
Generate images via DALL-E, Stable Diffusion, or free alternatives. Supports multi-channel delivery.
Generate meme images with text overlays using Pillow. Pick templates or create custom image macros.
Execute Python code snippets in a sandboxed environment. Supports data analysis, visualization, and quick scripts.
GitHub CLI for issues, PRs, code search, CI logs, releases, and API queries. Requires gh CLI and auth.
| name | ocr-document |
| description | Extract text from PDFs, images, and scanned documents. Uses pymupdf (local) or optional cloud OCR APIs. |
| version | 1.0.0 |
| metadata | {"echo":{"tags":["OCR","PDF","Document","Extract","Text"]}} |
Extract text from PDFs, scanned images, and documents.
Best choice for text-based PDFs:
pip install pymupdf
import pymupdf
doc = pymupdf.open("file.pdf")
for page in doc:
text = page.get_text()
print(text)
# All pages at once
full_text = "\n".join(page.get_text() for page in doc)
High-quality conversion preserving structure:
pip install marker-pdf
marker_single file.pdf output_dir/ --output_format markdown
pip install surya-ocr
surya_ocr image.png --langs zh,en
# Install Tesseract engine first
brew install tesseract tesseract-lang # macOS
apt install tesseract-ocr tesseract-ocr-chi-sim # Linux
pip install pytesseract Pillow
import pytesseract
from PIL import Image
text = pytesseract.image_to_string(
Image.open("scan.png"),
lang="chi_sim+eng"
)
python3 scripts/extract_document.py document.pdf
python3 scripts/extract_document.py scan.png
python3 scripts/extract_document.py report.pdf --output extracted.txt
Auto-detects format by extension: PDF → pymupdf, DOCX → python-docx, Image → pytesseract.
OCR language is controlled by system Tesseract config (e.g., chi_sim+eng default).