| name | ocr |
| description | Extract text and layout structure from images using OCR (Optical Character Recognition). Use this skill whenever the user provides or mentions an image file (screenshot, photo, scanned document, receipt, whiteboard photo, PDF page, etc.) and wants to extract text from it, understand its layout, or get structured information about what's on the image. This skill handles both Chinese and English text, and can identify layout regions (navigation bars, headers, footers, main content, tables, figures, etc.). Also use when the user says "OCR", "提取文字", "图片里的文字", "识别截图", "text from image", "scan this", "read this image" or similar phrases. Do NOT use for pure image generation, image editing, or image classification tasks that don't involve text extraction. |
OCR Skill — Text Extraction with Layout Analysis
This skill extracts text and layout information from images using a local OCR pipeline.
It calls scripts/ocr_pipeline.py to do the actual work.
Workflow
When the user provides an image file or asks you to OCR an image, follow these steps:
Step 1: Locate the image
Use the path the user provides. Work in the user's current working directory. Do NOT copy or move the file.
Step 2: Choose OCR engine (ask the user)
Do NOT auto-install or auto-fallback. The user decides which engine to
use. Your job is to present the options clearly and let them choose.
First, detect what's already available on their system:
rapidocr_ok() { python -c "from rapidocr import RapidOCR" 2>/dev/null && echo "yes" || echo "no"; }
tesseract_ok() { command -v tesseract 2>/dev/null && echo "yes" || echo "no"; }
echo "rapidocr:$(rapidocr_ok) tesseract:$(tesseract_ok)"
Then present a comparison table via the question tool with exactly these
three options. Use the descriptions verbatim:
| 方案 | 引擎 | 识别质量 | 版面分析 | 安装耗时 | 环境要求 |
|---|
| A 快速方案 | 用已有引擎(自动检测) | 视当前已装引擎 | 视当前情况 | 0秒 | 什么都不装,有哪个用哪个 |
| B 质量方案 | RapidOCR + rapid-layout | ★★★ 最高 | ✅ 有(header/title/table/列表等15种区域) | ~1-2分钟 | pip install rapidocr onnxruntime-directml;可选 rapid-layout |
| C 兼容方案 | Tesseract CLI | ★★☆ 中等 | ❌ 无(全部标为text) | ~10秒 | 系统 tesseract-ocr 包(几乎任何平台都能装) |
After the user chooses, execute accordingly:
- 方案 A — 直接运行,不加
--engine 参数(脚本自动检测可用引擎)
- 方案 B — 如果 RapidOCR 未安装,先执行安装命令,然后使用
--engine rapidocr
- 方案 C — 如果 Tesseract 未安装,先执行安装命令,然后使用
--engine tesseract
安装命令(根据用户平台选择):
pip install rapidocr onnxruntime-directml
pip install rapid-layout
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-eng
winget install -e --id UB-Mannheim.TesseractOCR
brew install tesseract
如果安装失败,立即告知用户并建议换其他方案,不要自动回退。
重要:只安装表格中列出的包,不要额外安装 pytesseract 或 Pillow。
方案 C 只需要 tesseract CLI 二进制,不需要 Python 绑定库。
输出 JSON 的 "note" 字段会记录实际使用的引擎及原因。
Step 3: Run the OCR pipeline
python <skill_dir>/scripts/ocr_pipeline.py <image_path> [--lang en,ch] [--output result.json] [--engine rapidocr]
<skill_dir> is the directory containing this SKILL.md file (the skill root).
--lang specifies languages, comma-separated. Default is ch,en.
--output writes results to a JSON file. If omitted, writes <image_stem>.json.
--engine {auto,rapidocr,tesseract} forces a specific engine. Default auto
uses the first available engine. Use this when the user chose a specific plan.
--no-layout disables layout region classification (RapidOCR only). Tesseract
always returns all text as type "text" regardless of this flag.
Step 4: Present the results
The JSON output contains text grouped by layout region with bounding boxes.
Use this to answer the user's question:
- Summarize what text was found
- Point out specific layout regions (navigation, content, footer, etc.)
- Answer questions about specific text in the image
- If the user asks for a specific output format, transform the JSON accordingly
Step 5: If something goes wrong
- 方案 A 但无可用引擎 — 告知用户"当前环境没有可用的 OCR 引擎",建议切到
方案 B(安装 RapidOCR)或方案 C(安装 Tesseract)。
- "Requested engine 'X' is not installed" — 用户选的方案的引擎未安装。
执行安装命令,安装失败则告知用户并建议换方案。
- "No OCR engine found" — 三个方案都无法使用。建议安装 Tesseract
(方案 C,最快部署)。
- "File not found" — verify the image path exists and is readable.
- Empty regions / no text detected — the image may be low quality, or the
language might be wrong. Try
--lang with the correct language code.
Output Format
{
"engine": "rapidocr | tesseract",
"image": "filename.jpg",
"width": 1920,
"height": 1080,
"note": "rapid-layout not available (pip install rapid-layout). No layout classification.",
"regions": [
{
"type": "header|title|text|figure|figure_caption|table|table_caption|footer|reference|formula|list|page_number|footnote|abstract|unknown",
"bbox": [x1, y1, x2, y2],
"lines": [
{
"text": "actual text content",
"bbox": [x1, y1, x2, y2],
"bbox_poly": [x1, y1, x2, y2, x3, y3, x4, y4],
"confidence": 0.98
}
]
}
]
}
note — explains why a fallback engine or no-layout mode was used
type — layout category. RapidOCR with rapid-layout provides detailed
classification; tesseract fallback only returns "text" for all regions.
bbox — bounding box in [x1, y1, x2, y2] format (axis-aligned rectangle).
For RapidOCR, single-line bbox may be [x1, y1, x2, y2] (axis-aligned rectangle).
bbox_poly — precise 4-point polygon [x1,y1,x2,y2,x3,y3,x4,y4] (clockwise from top-left); only available when using RapidOCR engine.
confidence — recognition confidence (0 to 1, higher is better).
Multiple images
If the user provides multiple images, pass them all to the script:
python <skill_dir>/scripts/ocr_pipeline.py img1.jpg img2.jpg --output-dir ./ocr_results
Each image gets its own <stem>.json file in --output-dir.
Examples
Example 1: Web screenshot
User: 帮我分析这个网页截图,提取所有文字
You: (先检测环境 → 展示三方案 → 用户选方案 A 快速)
python <skill_dir>/scripts/ocr_pipeline.py screenshot.png
Output shows regions like header, text, figure, footer with their content.
Example 2: Receipt photo
User: OCR这张收据,告诉我总金额是多少
You: (展示三方案 → 用户选方案 B 质量 → 安装后执行)
pip install rapidocr onnxruntime-directml
python <skill_dir>/scripts/ocr_pipeline.py receipt.jpg --engine rapidocr
Find the line containing the total amount (often in a text or table region).
Example 3: Chinese document
User: 这张图片里的中文文字帮我提取出来
You: (展示三方案 → 用户选方案 C 兼容 → 直接执行)
python <skill_dir>/scripts/ocr_pipeline.py document.png --lang ch --engine tesseract
Notes
- RapidOCR is the primary engine (based on ONNX Runtime, no PyTorch needed).
rapid-layout is optional but required for layout region classification.
- Tesseract is the compatibility option (方案 C).
It provides bounding boxes but NO layout classification.
- The script runs locally (no API calls needed).
- CPU mode is typically fast (1-5 seconds per image on a typical laptop).
GPU acceleration via DirectML is auto-detected; install
onnxruntime-directml instead of onnxruntime.