Run any Skill in Manus with one click

ocr

Stars0

Forks0

UpdatedMay 26, 2026 at 19:22

Extract text and layout structure from images using OCR (Optical Character Recognition). Use this skill whenever the user provides or mentions an image file (screenshot, photo, scanned document, receipt, whiteboard photo, PDF page, etc.) and wants to extract text from it, understand its layout, or get structured information about what's on the image. This skill handles both Chinese and English text, and can identify layout regions (navigation bars, headers, footers, main content, tables, figures, etc.). Also use when the user says "OCR", "提取文字", "图片里的文字", "识别截图", "text from image", "scan this", "read this image" or similar phrases. Do NOT use for pure image generation, image editing, or image classification tasks that don't involve text extraction.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

lazyst

lazyst/my-skills

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

File Explorer

3 files

SKILL.md

readonly

OCR Skill — Text Extraction with Layout Analysis

This skill extracts text and layout information from images using a local OCR pipeline. It calls scripts/ocr_pipeline.py to do the actual work.

Workflow

When the user provides an image file or asks you to OCR an image, follow these steps:

Step 1: Locate the image

Use the path the user provides. Work in the user's current working directory. Do NOT copy or move the file.

Step 2: Choose OCR engine (ask the user)

Do NOT auto-install or auto-fallback. The user decides which engine to use. Your job is to present the options clearly and let them choose.

First, detect what's already available on their system:

rapidocr_ok() { python -c "from rapidocr import RapidOCR" 2>/dev/null && echo "yes" || echo "no"; }
tesseract_ok() { command -v tesseract 2>/dev/null && echo "yes" || echo "no"; }
echo "rapidocr:$(rapidocr_ok) tesseract:$(tesseract_ok)"

Then present a comparison table via the question tool with exactly these three options. Use the descriptions verbatim:

方案	引擎	识别质量	版面分析	安装耗时	环境要求
A 快速方案	用已有引擎（自动检测）	视当前已装引擎	视当前情况	0秒	什么都不装，有哪个用哪个
B 质量方案	RapidOCR + rapid-layout	★★★ 最高	✅ 有（header/title/table/列表等15种区域）	~1-2分钟	`pip install rapidocr onnxruntime-directml`；可选 `rapid-layout`
C 兼容方案	Tesseract CLI	★★☆ 中等	❌ 无（全部标为text）	~10秒	系统 `tesseract-ocr` 包（几乎任何平台都能装）

After the user chooses, execute accordingly:

方案 A — 直接运行，不加 --engine 参数（脚本自动检测可用引擎）
方案 B — 如果 RapidOCR 未安装，先执行安装命令，然后使用 --engine rapidocr
方案 C — 如果 Tesseract 未安装，先执行安装命令，然后使用 --engine tesseract

安装命令（根据用户平台选择）：

# RapidOCR（方案 B）
pip install rapidocr onnxruntime-directml
pip install rapid-layout  # 可选，启用版面分析

# Tesseract CLI（方案 C，Linux）
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-eng
# Windows
winget install -e --id UB-Mannheim.TesseractOCR
# macOS
brew install tesseract

如果安装失败，立即告知用户并建议换其他方案，不要自动回退。

重要：只安装表格中列出的包，不要额外安装 pytesseract 或 Pillow。 方案 C 只需要 tesseract CLI 二进制，不需要 Python 绑定库。

输出 JSON 的 "note" 字段会记录实际使用的引擎及原因。

Step 3: Run the OCR pipeline

python <skill_dir>/scripts/ocr_pipeline.py <image_path> [--lang en,ch] [--output result.json] [--engine rapidocr]

<skill_dir> is the directory containing this SKILL.md file (the skill root).
--lang specifies languages, comma-separated. Default is ch,en.
--output writes results to a JSON file. If omitted, writes <image_stem>.json.
--engine {auto,rapidocr,tesseract} forces a specific engine. Default auto uses the first available engine. Use this when the user chose a specific plan.
--no-layout disables layout region classification (RapidOCR only). Tesseract always returns all text as type "text" regardless of this flag.

Step 4: Present the results

The JSON output contains text grouped by layout region with bounding boxes. Use this to answer the user's question:

Summarize what text was found
Point out specific layout regions (navigation, content, footer, etc.)
Answer questions about specific text in the image
If the user asks for a specific output format, transform the JSON accordingly

Step 5: If something goes wrong

方案 A 但无可用引擎 — 告知用户"当前环境没有可用的 OCR 引擎"，建议切到方案 B（安装 RapidOCR）或方案 C（安装 Tesseract）。
"Requested engine 'X' is not installed" — 用户选的方案的引擎未安装。执行安装命令，安装失败则告知用户并建议换方案。
"No OCR engine found" — 三个方案都无法使用。建议安装 Tesseract （方案 C，最快部署）。
"File not found" — verify the image path exists and is readable.
Empty regions / no text detected — the image may be low quality, or the language might be wrong. Try --lang with the correct language code.

Output Format

{
  "engine": "rapidocr | tesseract",
  "image": "filename.jpg",
  "width": 1920,
  "height": 1080,
  "note": "rapid-layout not available (pip install rapid-layout). No layout classification.",
  "regions": [
    {
      "type": "header|title|text|figure|figure_caption|table|table_caption|footer|reference|formula|list|page_number|footnote|abstract|unknown",
      "bbox": [x1, y1, x2, y2],
      "lines": [
        {
          "text": "actual text content",
          "bbox": [x1, y1, x2, y2],
          "bbox_poly": [x1, y1, x2, y2, x3, y3, x4, y4],
          "confidence": 0.98
        }
      ]
    }
  ]
}

note — explains why a fallback engine or no-layout mode was used
type — layout category. RapidOCR with rapid-layout provides detailed classification; tesseract fallback only returns "text" for all regions.
bbox — bounding box in [x1, y1, x2, y2] format (axis-aligned rectangle). For RapidOCR, single-line bbox may be [x1, y1, x2, y2] (axis-aligned rectangle).
bbox_poly — precise 4-point polygon [x1,y1,x2,y2,x3,y3,x4,y4] (clockwise from top-left); only available when using RapidOCR engine.
confidence — recognition confidence (0 to 1, higher is better).

Multiple images

If the user provides multiple images, pass them all to the script:

python <skill_dir>/scripts/ocr_pipeline.py img1.jpg img2.jpg --output-dir ./ocr_results

Each image gets its own <stem>.json file in --output-dir.

Examples

Example 1: Web screenshot

User: 帮我分析这个网页截图，提取所有文字
You:  (先检测环境 → 展示三方案 → 用户选方案 A 快速）
      python <skill_dir>/scripts/ocr_pipeline.py screenshot.png

Output shows regions like header, text, figure, footer with their content.

Example 2: Receipt photo

User: OCR这张收据，告诉我总金额是多少
You:  (展示三方案 → 用户选方案 B 质量 → 安装后执行)
      pip install rapidocr onnxruntime-directml
      python <skill_dir>/scripts/ocr_pipeline.py receipt.jpg --engine rapidocr

Find the line containing the total amount (often in a text or table region).

Example 3: Chinese document

User: 这张图片里的中文文字帮我提取出来
You:  (展示三方案 → 用户选方案 C 兼容 → 直接执行)
      python <skill_dir>/scripts/ocr_pipeline.py document.png --lang ch --engine tesseract

Notes

RapidOCR is the primary engine (based on ONNX Runtime, no PyTorch needed).
rapid-layout is optional but required for layout region classification.
Tesseract is the compatibility option (方案 C). It provides bounding boxes but NO layout classification.
The script runs locally (no API calls needed).
CPU mode is typically fast (1-5 seconds per image on a typical laptop). GPU acceleration via DirectML is auto-detected; install onnxruntime-directml instead of onnxruntime.

ocr

More from this repository

More from this repository

OCR Skill — Text Extraction with Layout Analysis

Workflow

Step 1: Locate the image

Step 2: Choose OCR engine (ask the user)

Step 3: Run the OCR pipeline

Step 4: Present the results

Step 5: If something goes wrong

Output Format

Multiple images

Examples

Notes

OCR Skill — Text Extraction with Layout Analysis

Workflow

Step 1: Locate the image

Step 2: Choose OCR engine (ask the user)

Step 3: Run the OCR pipeline

Step 4: Present the results

Step 5: If something goes wrong

Output Format

Multiple images

Examples

Notes