Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

extracting-with-ocr

Sterne5

Forks0

Aktualisiert8. Juni 2026 um 18:00

Use when extracting text from scanned PDFs, photographed pages, or images that have no embedded text layer. Covers OCR backends, language packs, force-OCR, and performance tuning.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

xberg-io

xberg-io/plugins

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

SKILL.md

readonly

name	extracting-with-ocr
description	Use when extracting text from scanned PDFs, photographed pages, or images that have no embedded text layer. Covers OCR backends, language packs, force-OCR, and performance tuning.

Extracting with OCR

Use this when a document is image-based: scanned PDFs, photographed pages, screenshots, JPEG/PNG/TIFF with text. Kreuzberg auto-OCRs raster images and auto-detects PDFs that lack a text layer. Force it on when extraction returned empty/garbled text from a PDF that "looks" textual.

When to force OCR

Extraction returned an empty content field, but the file opens visually.
The PDF text layer is junk (copy-paste from a viewer produces gibberish).
You want consistent output across mixed scanned + digital PDFs.

kreuzberg extract scan.pdf --force-ocr=true
kreuzberg extract scan.pdf --ocr=true --ocr-language eng

If a page has an unreliable text layer, --force-ocr=true re-rasterizes and runs OCR on every page.

Backends

Tesseract is the default and ships with the CLI — no extra install. Other backends are opt-in:

Backend	Flag	Install	Notes
Tesseract	`--ocr-backend tesseract` (default)	bundled	Best general-purpose, 100+ languages via tessdata.
PaddleOCR	`--ocr-backend paddle-ocr`	bundled (ONNX Runtime)	Strong on Asian scripts. Not available on WASM or Windows.
EasyOCR	`--ocr-backend easyocr`	Python binding (`pip install kreuzberg[easyocr]`)	Heavier model. CUDA accel via `easyocr_kwargs={"gpu": True}`.
VLM (vision)	layout + a multimodal LLM via config	configured per backend	Use when OCR fails on dense or handwritten layouts.

Pick Tesseract first. Switch only when accuracy is unacceptable.

Language packs

Tesseract uses ISO 639-2 codes. Default is eng. Combine with +:

kreuzberg extract menu.jpg --ocr=true --ocr-language "eng+deu"
kreuzberg extract bilingual.pdf --ocr-language "eng+jpn"
kreuzberg extract any.pdf --ocr-language all   # all installed packs

Install missing packs at the OS level:

# macOS
brew install tesseract-lang

# Debian/Ubuntu
sudo apt install tesseract-ocr-deu tesseract-ocr-jpn tesseract-ocr-fra

# Specific lang only
sudo apt install tesseract-ocr-<iso639-2>

Kreuzberg fails fast with a helpful error if you request a language pack that is not installed. Read the error — it names the missing file.

Useful flags

--ocr=true — enable OCR (auto-enabled for images and scanned PDFs).
--force-ocr=true — OCR every page even if a text layer exists.
--disable-ocr=true — never OCR (extract embedded text only or fail).
--ocr-language <lang> — single code or +-joined list, or all.
--ocr-backend <tesseract|paddle-ocr|easyocr> — pick backend.
--ocr-auto-rotate=true — pre-rotate via the auto-rotate model.
--acceleration <cpu|coreml|cuda|tensorrt|auto> — ONNX accelerator for paddle-ocr / auto-rotate / layout models.

Performance tips

Cache is on by default. Repeated extraction of the same file + config is instant. Do not pass --no-cache=true unless you have a reason.
For batch OCR, use kreuzberg batch *.pdf --ocr=true — internal worker pool parallelizes across CPU cores. Cap with --max-concurrent N if memory is tight.
Raise --target-dpi (default 300) only for low-resolution scans. Higher DPI is slower; 200 is usually enough for printed text.
Enable --ocr-auto-rotate=true only when pages may be rotated; the classifier adds latency.
On Apple Silicon, --acceleration coreml typically beats CPU for paddle-ocr and layout detection.

Config file alternative

Long flag chains belong in kreuzberg.toml — auto-discovered from cwd upward.

force_ocr = true
output_format = "markdown"

[ocr]
backend = "tesseract"
language = "eng+deu"
auto_rotate = true

Then just run:

kreuzberg extract document.pdf

Common failure modes

"missing tessdata" — install the language pack at OS level (see above).
Empty content on a scanned PDF without --force-ocr — the file has a bogus zero-width text layer. Re-run with --force-ocr=true.
OCR on a rotated page — add --ocr-auto-rotate=true or pre-rotate.
Garbled CJK output — ensure the right language pack is installed and passed via --ocr-language; consider paddle-ocr for Chinese/Japanese.

See references/cli-reference.md and references/configuration.md in the sibling kreuzberg skill for the full flag and config schema.

Mehr aus diesem Repository

gleiches Repository

automating-the-browser

xberg-io/plugins

Use when extracting a page needs scripted interaction first — click, type, press a key, scroll, wait, screenshot, or run JS before capturing the DOM. Covers `crawlberg interact <url> --actions` with the real action schema, result shape, limits, and external-CDP options.

2026-06-255

crawlberg

xberg-io/plugins

Crawl, scrape, and convert websites to Markdown using the local crawlberg CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out.

2026-06-255

crawling-a-site

xberg-io/plugins

Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

2026-06-255

headless-fallback

xberg-io/plugins

Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto|always|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

2026-06-255

mapping-urls

xberg-io/plugins

Use when the user wants the list of URLs on a site rather than the page content — sitemap analysis, link planning, or seeding another tool. Covers `crawlberg map <url>` with `--limit`, `--search`, robots, output, and how it differs from a full crawl.

2026-06-255

scraping-html-to-markdown

xberg-io/plugins

Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

2026-06-255

name	extracting-with-ocr
description	Use when extracting text from scanned PDFs, photographed pages, or images that have no embedded text layer. Covers OCR backends, language packs, force-OCR, and performance tuning.

Extracting with OCR

When to force OCR

Extraction returned an empty content field, but the file opens visually.
The PDF text layer is junk (copy-paste from a viewer produces gibberish).
You want consistent output across mixed scanned + digital PDFs.

kreuzberg extract scan.pdf --force-ocr=true
kreuzberg extract scan.pdf --ocr=true --ocr-language eng

If a page has an unreliable text layer, --force-ocr=true re-rasterizes and runs OCR on every page.

Backends

Tesseract is the default and ships with the CLI — no extra install. Other backends are opt-in:

Backend	Flag	Install	Notes
Tesseract	`--ocr-backend tesseract` (default)	bundled	Best general-purpose, 100+ languages via tessdata.
PaddleOCR	`--ocr-backend paddle-ocr`	bundled (ONNX Runtime)	Strong on Asian scripts. Not available on WASM or Windows.
EasyOCR	`--ocr-backend easyocr`	Python binding (`pip install kreuzberg[easyocr]`)	Heavier model. CUDA accel via `easyocr_kwargs={"gpu": True}`.
VLM (vision)	layout + a multimodal LLM via config	configured per backend	Use when OCR fails on dense or handwritten layouts.

Pick Tesseract first. Switch only when accuracy is unacceptable.

Language packs

Tesseract uses ISO 639-2 codes. Default is eng. Combine with +:

kreuzberg extract menu.jpg --ocr=true --ocr-language "eng+deu"
kreuzberg extract bilingual.pdf --ocr-language "eng+jpn"
kreuzberg extract any.pdf --ocr-language all   # all installed packs

Install missing packs at the OS level:

# macOS
brew install tesseract-lang

# Debian/Ubuntu
sudo apt install tesseract-ocr-deu tesseract-ocr-jpn tesseract-ocr-fra

# Specific lang only
sudo apt install tesseract-ocr-<iso639-2>

Kreuzberg fails fast with a helpful error if you request a language pack that is not installed. Read the error — it names the missing file.

Useful flags

--ocr=true — enable OCR (auto-enabled for images and scanned PDFs).
--force-ocr=true — OCR every page even if a text layer exists.
--disable-ocr=true — never OCR (extract embedded text only or fail).
--ocr-language <lang> — single code or +-joined list, or all.
--ocr-backend <tesseract|paddle-ocr|easyocr> — pick backend.
--ocr-auto-rotate=true — pre-rotate via the auto-rotate model.
--acceleration <cpu|coreml|cuda|tensorrt|auto> — ONNX accelerator for paddle-ocr / auto-rotate / layout models.

Performance tips

Cache is on by default. Repeated extraction of the same file + config is instant. Do not pass --no-cache=true unless you have a reason.
For batch OCR, use kreuzberg batch *.pdf --ocr=true — internal worker pool parallelizes across CPU cores. Cap with --max-concurrent N if memory is tight.
Raise --target-dpi (default 300) only for low-resolution scans. Higher DPI is slower; 200 is usually enough for printed text.
Enable --ocr-auto-rotate=true only when pages may be rotated; the classifier adds latency.
On Apple Silicon, --acceleration coreml typically beats CPU for paddle-ocr and layout detection.

Config file alternative

Long flag chains belong in kreuzberg.toml — auto-discovered from cwd upward.

force_ocr = true
output_format = "markdown"

[ocr]
backend = "tesseract"
language = "eng+deu"
auto_rotate = true

Then just run:

kreuzberg extract document.pdf

Common failure modes

"missing tessdata" — install the language pack at OS level (see above).
Empty content on a scanned PDF without --force-ocr — the file has a bogus zero-width text layer. Re-run with --force-ocr=true.
OCR on a rotated page — add --ocr-auto-rotate=true or pre-rotate.
Garbled CJK output — ensure the right language pack is installed and passed via --ocr-language; consider paddle-ocr for Chinese/Japanese.

See references/cli-reference.md and references/configuration.md in the sibling kreuzberg skill for the full flag and config schema.