원클릭으로 Manus에서 모든 스킬 실행

ocr

스타0

포크0

업데이트2026년 5월 26일 19:22

Extract text and layout structure from images using OCR (Optical Character Recognition). Use this skill whenever the user provides or mentions an image file (screenshot, photo, scanned document, receipt, whiteboard photo, PDF page, etc.) and wants to extract text from it, understand its layout, or get structured information about what's on the image. This skill handles both Chinese and English text, and can identify layout regions (navigation bars, headers, footers, main content, tables, figures, etc.). Also use when the user says "OCR", "提取文字", "图片里的文字", "识别截图", "text from image", "scan this", "read this image" or similar phrases. Do NOT use for pure image generation, image editing, or image classification tasks that don't involve text extraction.

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

lazyst

lazyst/my-skills

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

파일 탐색기

3 개 파일

SKILL.md

readonly

이 저장소의 다른 Skills

같은 저장소

pdf2markdown

lazyst/my-skills

Convert academic PDF papers (especially Chinese academic papers) to well-structured Markdown. Use when the user needs to extract full content from PDF academic papers (thesis, dissertations, journal articles) and convert to Markdown format. Handles text extraction, OCR fallback for font-encoded/custom-encoding PDFs, layout-aware structuring (title/sections/figures/tables/formulas/references), table extraction, math formula preservation, figure caption formatting, and reference list formatting. Not for general PDF manipulation, form filling, or PDF creation tasks.

2026-05-260

name

ocr

description

OCR Skill — Text Extraction with Layout Analysis

This skill extracts text and layout information from images using a local OCR pipeline. It calls scripts/ocr_pipeline.py to do the actual work.

Workflow

When the user provides an image file or asks you to OCR an image, follow these steps:

Step 1: Locate the image

Use the path the user provides. Work in the user's current working directory. Do NOT copy or move the file.

Step 2: Choose OCR engine (ask the user)

Do NOT auto-install or auto-fallback. The user decides which engine to use. Your job is to present the options clearly and let them choose.

First, detect what's already available on their system:

rapidocr_ok() { python -c "from rapidocr import RapidOCR" 2>/dev/null && echo "yes" || echo "no"; }
tesseract_ok() { command -v tesseract 2>/dev/null && echo "yes" || echo "no"; }
echo "rapidocr:$(rapidocr_ok) tesseract:$(tesseract_ok)"

Then present a comparison table via the question tool with exactly these three options. Use the descriptions verbatim:

方案	引擎	识别质量	版面分析	安装耗时	环境要求
A 快速方案	用已有引擎（自动检测）	视当前已装引擎	视当前情况	0秒	什么都不装，有哪个用哪个
B 质量方案	RapidOCR + rapid-layout	★★★ 最高	✅ 有（header/title/table/列表等15种区域）	~1-2分钟	`pip install rapidocr onnxruntime-directml`；可选 `rapid-layout`
C 兼容方案	Tesseract CLI	★★☆ 中等	❌ 无（全部标为text）	~10秒	系统 `tesseract-ocr` 包（几乎任何平台都能装）

After the user chooses, execute accordingly:

方案 A — 直接运行，不加 --engine 参数（脚本自动检测可用引擎）
方案 B — 如果 RapidOCR 未安装，先执行安装命令，然后使用 --engine rapidocr
方案 C — 如果 Tesseract 未安装，先执行安装命令，然后使用 --engine tesseract

安装命令（根据用户平台选择）：

# RapidOCR（方案 B）
pip install rapidocr onnxruntime-directml
pip install rapid-layout  # 可选，启用版面分析

# Tesseract CLI（方案 C，Linux）
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-eng
# Windows
winget install -e --id UB-Mannheim.TesseractOCR
# macOS
brew install tesseract

如果安装失败，立即告知用户并建议换其他方案，不要自动回退。

重要：只安装表格中列出的包，不要额外安装 pytesseract 或 Pillow。 方案 C 只需要 tesseract CLI 二进制，不需要 Python 绑定库。

输出 JSON 的 "note" 字段会记录实际使用的引擎及原因。

Step 3: Run the OCR pipeline

python <skill_dir>/scripts/ocr_pipeline.py <image_path> [--lang en,ch] [--output result.json] [--engine rapidocr]

<skill_dir> is the directory containing this SKILL.md file (the skill root).
--lang specifies languages, comma-separated. Default is ch,en.
--output writes results to a JSON file. If omitted, writes <image_stem>.json.
--engine {auto,rapidocr,tesseract} forces a specific engine. Default auto uses the first available engine. Use this when the user chose a specific plan.
--no-layout disables layout region classification (RapidOCR only). Tesseract always returns all text as type "text" regardless of this flag.

Step 4: Present the results

The JSON output contains text grouped by layout region with bounding boxes. Use this to answer the user's question:

Summarize what text was found
Point out specific layout regions (navigation, content, footer, etc.)
Answer questions about specific text in the image
If the user asks for a specific output format, transform the JSON accordingly

Step 5: If something goes wrong

方案 A 但无可用引擎 — 告知用户"当前环境没有可用的 OCR 引擎"，建议切到方案 B（安装 RapidOCR）或方案 C（安装 Tesseract）。
"Requested engine 'X' is not installed" — 用户选的方案的引擎未安装。执行安装命令，安装失败则告知用户并建议换方案。
"No OCR engine found" — 三个方案都无法使用。建议安装 Tesseract （方案 C，最快部署）。
"File not found" — verify the image path exists and is readable.
Empty regions / no text detected — the image may be low quality, or the language might be wrong. Try --lang with the correct language code.

Output Format

{
  "engine": "rapidocr | tesseract",
  "image": "filename.jpg",
  "width": 1920,
  "height": 1080,
  "note": "rapid-layout not available (pip install rapid-layout). No layout classification.",
  "regions": [
    {
      "type": "header|title|text|figure|figure_caption|table|table_caption|footer|reference|formula|list|page_number|footnote|abstract|unknown",
      "bbox": [x1, y1, x2, y2],
      "lines": [
        {
          "text": "actual text content",
          "bbox": [x1, y1, x2, y2],
          "bbox_poly": [x1, y1, x2, y2, x3, y3, x4, y4],
          "confidence": 0.98
        }
      ]
    }
  ]
}

note — explains why a fallback engine or no-layout mode was used
type — layout category. RapidOCR with rapid-layout provides detailed classification; tesseract fallback only returns "text" for all regions.
bbox — bounding box in [x1, y1, x2, y2] format (axis-aligned rectangle). For RapidOCR, single-line bbox may be [x1, y1, x2, y2] (axis-aligned rectangle).
bbox_poly — precise 4-point polygon [x1,y1,x2,y2,x3,y3,x4,y4] (clockwise from top-left); only available when using RapidOCR engine.
confidence — recognition confidence (0 to 1, higher is better).

Multiple images

If the user provides multiple images, pass them all to the script:

python <skill_dir>/scripts/ocr_pipeline.py img1.jpg img2.jpg --output-dir ./ocr_results

Each image gets its own <stem>.json file in --output-dir.

Examples

Example 1: Web screenshot

User: 帮我分析这个网页截图，提取所有文字
You:  (先检测环境 → 展示三方案 → 用户选方案 A 快速）
      python <skill_dir>/scripts/ocr_pipeline.py screenshot.png

Output shows regions like header, text, figure, footer with their content.

Example 2: Receipt photo

User: OCR这张收据，告诉我总金额是多少
You:  (展示三方案 → 用户选方案 B 质量 → 安装后执行)
      pip install rapidocr onnxruntime-directml
      python <skill_dir>/scripts/ocr_pipeline.py receipt.jpg --engine rapidocr

Find the line containing the total amount (often in a text or table region).

Example 3: Chinese document

User: 这张图片里的中文文字帮我提取出来
You:  (展示三方案 → 用户选方案 C 兼容 → 直接执行)
      python <skill_dir>/scripts/ocr_pipeline.py document.png --lang ch --engine tesseract

Notes

RapidOCR is the primary engine (based on ONNX Runtime, no PyTorch needed).
rapid-layout is optional but required for layout region classification.
Tesseract is the compatibility option (方案 C). It provides bounding boxes but NO layout classification.
The script runs locally (no API calls needed).
CPU mode is typically fast (1-5 seconds per image on a typical laptop). GPU acceleration via DirectML is auto-detected; install onnxruntime-directml instead of onnxruntime.