Run any Skill in Manus with one click

ocr-document

Extracts text and tables from PDFs and images using PyMuPDF, pdfplumber, and Tesseract OCR (Norwegian/English), with auto/text/ocr/hybrid methods and page selection. Use when reading, OCR-ing, or extracting text/tables from a PDF, scanned document, or image.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/JansenAnalytics/claudex --skill ocr-document

Copy and paste this command into Claude Code to install the skill

Source

JansenAnalytics/claudex

Stars4

Forks1

UpdatedJune 3, 2026 at 20:50

File Explorer

3 files

SKILL.md

readonly

description	Extracts text and tables from PDFs and images using PyMuPDF, pdfplumber, and Tesseract OCR (Norwegian/English), with auto/text/ocr/hybrid methods and page selection. Use when reading, OCR-ing, or extracting text/tables from a PDF, scanned document, or image.
name	ocr-document
triggers	["PDF","OCR","document","extract text","scan","read document"]
category	media
maturity	stable
tags	["ocr","tesseract","pdf-extraction","pymupdf","norwegian"]

OCR Document Skill

Extract text from PDF/image

SKILL=${CLAUDE_SKILLS_DIR:-$HOME/.claude-agent/.claude/skills}/ocr-document/scripts

# Auto-detect best method (text vs OCR)
python3 $SKILL/extract.py document.pdf

# Force OCR (for scanned documents)
python3 $SKILL/extract.py document.pdf --method ocr --lang nor+eng

# Extract tables (pdfplumber)
python3 $SKILL/extract.py document.pdf --method pdfplumber

# Specific pages
python3 $SKILL/extract.py document.pdf --pages 1-5

# Save to file
python3 $SKILL/extract.py document.pdf --output extracted.md

# OCR an image
python3 $SKILL/extract.py photo.jpg --lang nor+eng

Download Telegram files

# Download by file_id
python3 $SKILL/telegram_file.py --file-id <id> --output file.pdf

# Get most recent document from a chat
python3 $SKILL/telegram_file.py --chat-id <your-telegram-user-id> --recent --output file.pdf

Methods

auto: Try text extraction first, fall back to OCR if low text content
text: PyMuPDF direct extraction (fastest)
pdfplumber: Better for tables and structured data
ocr: Tesseract OCR via pdf2image (for scanned docs)
hybrid: Text first, OCR only for pages that need it

Languages

Default: nor+eng (Norwegian + English). Change with --lang. Available: tesseract --list-langs

More from this repository

same repository

memory-search

JansenAnalytics/claudex

Semantic memory search across all agent memories and conversation history. Use BEFORE answering questions about prior work, decisions, dates, people, preferences, projects, or past conversations. Also use when asked "do you remember", "what did we discuss", "when did we", etc.

2026-06-034

weather

JansenAnalytics/claudex

Get current weather and forecasts. Use when the user asks about weather, temperature, or forecasts for any location.

2026-06-034

1password

JansenAnalytics/claudex

Set up and use 1Password CLI (op). Use when installing the CLI, enabling desktop app integration, signing in (single or multi-account), or reading/injecting/running secrets via op.

2026-06-034

a11y-audit

JansenAnalytics/claudex

Accessibility auditing: WCAG compliance checking, contrast ratios, ARIA labels, keyboard navigation, semantic HTML, screen reader compatibility.

2026-06-034

adr-manager

JansenAnalytics/claudex

ADR Manager Skill

2026-06-034

api-critic

JansenAnalytics/claudex

Autonomous API testing and evaluation. Tests any REST API for correctness, security, performance, error handling, and standards compliance. Discovers endpoints, probes with valid/invalid/edge-case payloads, checks auth, response times, injection vulnerabilities, and generates severity-scored reports with actionable fixes. Use before any API "done" claim.

2026-06-034

Source

JansenAnalytics

JansenAnalytics/claudex

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

description	Extracts text and tables from PDFs and images using PyMuPDF, pdfplumber, and Tesseract OCR (Norwegian/English), with auto/text/ocr/hybrid methods and page selection. Use when reading, OCR-ing, or extracting text/tables from a PDF, scanned document, or image.
name	ocr-document
triggers	["PDF","OCR","document","extract text","scan","read document"]
category	media
maturity	stable
tags	["ocr","tesseract","pdf-extraction","pymupdf","norwegian"]

OCR Document Skill

Extract text from PDF/image

SKILL=${CLAUDE_SKILLS_DIR:-$HOME/.claude-agent/.claude/skills}/ocr-document/scripts

# Auto-detect best method (text vs OCR)
python3 $SKILL/extract.py document.pdf

# Force OCR (for scanned documents)
python3 $SKILL/extract.py document.pdf --method ocr --lang nor+eng

# Extract tables (pdfplumber)
python3 $SKILL/extract.py document.pdf --method pdfplumber

# Specific pages
python3 $SKILL/extract.py document.pdf --pages 1-5

# Save to file
python3 $SKILL/extract.py document.pdf --output extracted.md

# OCR an image
python3 $SKILL/extract.py photo.jpg --lang nor+eng

Download Telegram files

# Download by file_id
python3 $SKILL/telegram_file.py --file-id <id> --output file.pdf

# Get most recent document from a chat
python3 $SKILL/telegram_file.py --chat-id <your-telegram-user-id> --recent --output file.pdf

Methods

auto: Try text extraction first, fall back to OCR if low text content
text: PyMuPDF direct extraction (fastest)
pdfplumber: Better for tables and structured data
ocr: Tesseract OCR via pdf2image (for scanned docs)
hybrid: Text first, OCR only for pages that need it

Languages

Default: nor+eng (Norwegian + English). Change with --lang. Available: tesseract --list-langs