تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

pdfvision

Name: Pdfvision
Author: yamadashy

// Extract text, metadata, per-page density signals, structural layout, image bounding boxes, optional OCR, and rendered page PNGs from a PDF using the pdfvision CLI. Use when the input is a `.pdf` URL, a local PDF path, or another agent skill produced a PDF that still needs structured extraction. Triggers on: 'read this pdf', 'extract from <file>.pdf', '.pdf', 'scan / slide / paper / form contents'.

تشغيل في Manus

$ git log --oneline --stat

stars:١

forks:٠

updated:١٩ مايو ٢٠٢٦ في ١٥:١٤

مستكشف الملفات

3 ملفات

SKILL.md

readonly

package.json

"author": "yamadashy"

"repository": "yamadashy/pdfvision"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

تشغيل أي مهارة بنقرة واحدة

name

pdfvision

description

Extract text, metadata, per-page density signals, structural layout, image bounding boxes, optional OCR, and rendered page PNGs from a PDF using the pdfvision CLI. Use when the input is a `.pdf` URL, a local PDF path, or another agent skill produced a PDF that still needs structured extraction. Triggers on: 'read this pdf', 'extract from <file>.pdf', '.pdf', 'scan / slide / paper / form contents'.

pdfvision

pdfvision extracts text + metadata + per-page density signals from any PDF, with opt-in OCR (--ocr), layout reconstruction (--layout), image bounding boxes (--image-boxes), per-text-item geometry (--geometry), and PNG rendering (--render). Cached by content hash, so the second read of the same PDF returns in ~30 ms.

Prerequisite

npx pdfvision --version

Requires Node.js >= 22.13. Install globally with npm install -g pdfvision if used repeatedly.

Always run npx pdfvision --help once before reaching for non-obvious flags — the flag set evolves (OCR, remote URLs, layout, geometry, image-boxes, render output) and the help text is the source of truth for what the installed version supports.

Quick reference

# Local PDF, markdown to stdout (per-page sections + density Overview table)
npx pdfvision /path/to/doc.pdf

# Fetch a PDF over http(s) — downloads to cache, then extracts
npx pdfvision --remote https://example.org/paper.pdf

# Page subset
npx pdfvision doc.pdf -p 1-5
npx pdfvision doc.pdf -p 1,3,5

# Programmatic / structured consumers
npx pdfvision doc.pdf -f json
npx pdfvision doc.pdf -f xml          # tag-shaped, some LLMs locate <page> faster than JSON keys

# Image-flattened / scanned page — two options:
npx pdfvision scan.pdf --ocr -f json                     # tesseract.js OCR
npx pdfvision scan.pdf --render --render-output ./images # PNG for vision LLM

# Wipe the on-disk cache
npx pdfvision --clear-cache

Format choice (markdown / json / xml) does not change the cache slot — the structured payload is shared and only re-formatted on output.

Picking the right flags

The default extraction is enough for most native-text PDFs (papers, exports from Word / Pages / Markdown tooling). Reach for opt-ins only when the default isn't enough.

Goal	Flag	When to reach for it
Reconstruct reading order, find headings	`--layout`	Multi-column papers, slides where the agent must process blocks in order
Know where images sit on the page	`--image-boxes`	Bbox overlay on rendered PNG, figure detection
Per-glyph bbox + fontSize	`--geometry`	Heading detection by font-size, custom layout heuristics
Page is an image — get text from raster	`--ocr` + `--ocr-lang`	`coverage: 0%` in the Overview, or `nonPrintableRatio >= 0.05` (text exists but is glyph-index garbage; see below). For non-English text, language order matters — primary language goes first (`jpn+eng` for Japanese-dominant, `eng+jpn` for English-dominant). Full lang combinations and confidence semantics in `references/ocr.md`.
Hand the page to a vision model	`--render` + `--render-output <dir>`	Multimodal flows. Density Overview already flagged the page as low-text
Skip the on-disk cache	`--no-cache`	Forced re-extraction. Default behaviour is cache-on

Detecting silent failures with the density Overview

When result.pages.length > 1, the markdown output starts with an Overview table that reports Chars / Images / Coverage / Size per page (plus NonPrint when any page has non-zero non-printable ratio, and Blocks when --layout was on). The JSON / XML output carries the same data in overview[] with field names charCount / imageCount / textCoverage / nonPrintableRatio / nonPrintableCount / width / height / quality — use the field names directly when grepping or filtering in code. Use the Overview before scrolling the body.

One-shot dispatch: `pages[].quality`

Each page (and each overview row) carries a derived quality field that classifies the page from the raw signals so agents don't have to reimplement the threshold logic:

quality.nativeTextStatus:
- ok — usable native text.
- unusable_glyph_indices — nonPrintableRatio >= 0.05. Text is binary garbage even though charCount looks healthy. Fall back to --render or --ocr.
- empty_but_visual_content — no native text, but the page carries images or non-blank pixels. Re-run with --ocr (or read the rendered PNG via --render).
- empty — no text, no detected visual content. Likely a genuinely blank page (or a render failure — combine with visualStatus below).
quality.visualStatus (present only when --render or --ocr ran):
- ok — renderer drew real content.
- blank — page came out effectively blank against its own dominant background. Render-pipeline failure or genuinely blank page.

pdfvision deliberately stops at observation: it does not recommend an action. The action is the agent's call based on the two statuses + the raw signals below.

Raw signals (the inputs to `quality`)

textCoverage: 0 (rendered as coverage: 0% in markdown) + imageCount > 0 → the page body is a rasterised image. The text stream is empty. Re-run with --ocr or --render.
nonPrintableRatio >= 0.05 → pdf.js fell back to raw glyph indices because the PDF's fonts lack a ToUnicode CMap (common with Hebrew, older CJK, custom symbol fonts). text reads as full coverage but is binary garbage. Values >= 0.3 are pathological; < 0.01 is normal. The raw count is in nonPrintableCount — when the 3dp ratio rounds to 0 the count still tells you whether any non-printable code points slipped through (useful for "is there ANY garbage in this page?" filters).
charCount: 0 but imageCount: 0 → genuinely blank page (separator, end matter).
Sudden drop in textCoverage on a single page in an otherwise text-dense doc → that page is likely a figure / scan / chart. Inspect with --render.
renderContentRatio <= 0.001 (when --render or --ocr was on) → the rasterised page came out blank against its own dominant background. Likely a render-pipeline failure (pdf.js + @napi-rs/canvas can't decode JPEG2000 image streams, or the font has no resolvable glyphs). The ratio is background-aware — dark book covers and beige scan paper don't false-trip it. OCR on this page returns confidence: 0 not because OCR failed but because the input was a near-uniform image.

The density signal is the reason to prefer pdfvision over reading a PDF directly — silent failures (empty text that looks fine to a downstream consumer, or full text that is actually NUL bytes) become visible up front.

Caching

Cache root: <os-tmp>/pdfvision/<content-sha>/ — macOS /var/folders/.../T/pdfvision/, Linux /tmp/pdfvision/. Override with PDFVISION_CACHE_DIR=/path.
Keyed by PDF content hash + flag combination. Same PDF + same flags → ~30 ms on the second call. Different flags (e.g. add --layout later) → different slot, fresh extraction.
Wipe everything (cached extractions, rendered PNGs, downloaded remote PDFs, OCR traineddata) with npx pdfvision --clear-cache.

Typical agent flow

Inherit the user's scope first. If the user already named a specific page or range ("page 2", "chapter 3", "the last few pages"), pass -p from step 1 — the density Overview works per page, so there's no need to scan a 100-page doc when the user pointed at page 2. Only run unscoped when the user genuinely asked about the whole document. Sections with conventional locations also help: "abstract" → -p 1, "conclusion" → -p <last-few>, "TOC" → -p 1-3.

Pick a format that matches the consumer. If the consumer is the LLM itself reading text inline (the typical "user asks me to read this PDF" case), the markdown default is already optimal — no flag needed. Switch to -f json only when a downstream programmatic step needs structured field access (overview[], pages[].layout, pages[].ocr, etc.). XML when the LLM downstream parses tags more reliably than nested JSON.

Run npx pdfvision doc.pdf (add -p <range> per the scope note, and -f json only when you'll consume structured fields) — gets text + density Overview for the selected pages.
Read the density signals (the markdown Overview table, or overview[] / pages[].textCoverage / imageCount / charCount in JSON) to find low-coverage pages.
For low-coverage pages: re-run with --ocr if text is needed, or --render if a vision model will look at the rasterised page.
For structured / multi-column docs: re-run with --layout (and --image-boxes when figure positions matter).
Cache means steps 3–4 only re-pay the cost of the new flag combination on the affected page subset, not the whole extract.

When to read `references/`

The base of this file already covers daily extraction. Open a reference file only in one of these specific cases — they are not always-on context, do not load speculatively.

Each entry is tagged as mandatory (read before producing the deliverable; this file doesn't carry enough on its own for the case) or escalation (read only if the basic guidance above isn't enough for the situation).

Read this file	Gate	When
`references/structured-output.md`	mandatory when you're consuming `--layout`, `--image-boxes`, `--geometry`, `--ocr`, or any other structured JSON / XML field whose schema isn't fully described in this file. SKILL.md only names the flags — the field-by-field shape lives in the reference.	Programmatic consumers of `-f json` / `-f xml`. Covers `DocumentResult` / `PageResult` / `LayoutBlock` / `ImageBox` / `TextSpan` / `PageOcr` schemas and coordinate-system semantics.
`references/ocr.md`	escalation for the easy cases (English-only, expected confidence). Mandatory when the user's text is non-English (lang ordering affects results), confidence is unexpectedly low, or the `tesseract.js` install / stderr is misbehaving.	Lang code combinations, primary-language ordering, traineddata cache, install diagnostics, troubleshooting (low confidence, blank PNG, stderr noise).

name

pdfvision

description

pdfvision

Prerequisite

npx pdfvision --version

Requires Node.js >= 22.13. Install globally with npm install -g pdfvision if used repeatedly.

Quick reference

# Local PDF, markdown to stdout (per-page sections + density Overview table)
npx pdfvision /path/to/doc.pdf

# Fetch a PDF over http(s) — downloads to cache, then extracts
npx pdfvision --remote https://example.org/paper.pdf

# Page subset
npx pdfvision doc.pdf -p 1-5
npx pdfvision doc.pdf -p 1,3,5

# Programmatic / structured consumers
npx pdfvision doc.pdf -f json
npx pdfvision doc.pdf -f xml          # tag-shaped, some LLMs locate <page> faster than JSON keys

# Image-flattened / scanned page — two options:
npx pdfvision scan.pdf --ocr -f json                     # tesseract.js OCR
npx pdfvision scan.pdf --render --render-output ./images # PNG for vision LLM

# Wipe the on-disk cache
npx pdfvision --clear-cache

Format choice (markdown / json / xml) does not change the cache slot — the structured payload is shared and only re-formatted on output.

Picking the right flags

The default extraction is enough for most native-text PDFs (papers, exports from Word / Pages / Markdown tooling). Reach for opt-ins only when the default isn't enough.

Goal	Flag	When to reach for it
Reconstruct reading order, find headings	`--layout`	Multi-column papers, slides where the agent must process blocks in order
Know where images sit on the page	`--image-boxes`	Bbox overlay on rendered PNG, figure detection
Per-glyph bbox + fontSize	`--geometry`	Heading detection by font-size, custom layout heuristics
Page is an image — get text from raster	`--ocr` + `--ocr-lang`	`coverage: 0%` in the Overview, or `nonPrintableRatio >= 0.05` (text exists but is glyph-index garbage; see below). For non-English text, language order matters — primary language goes first (`jpn+eng` for Japanese-dominant, `eng+jpn` for English-dominant). Full lang combinations and confidence semantics in `references/ocr.md`.
Hand the page to a vision model	`--render` + `--render-output <dir>`	Multimodal flows. Density Overview already flagged the page as low-text
Skip the on-disk cache	`--no-cache`	Forced re-extraction. Default behaviour is cache-on

Detecting silent failures with the density Overview

One-shot dispatch: `pages[].quality`

Each page (and each overview row) carries a derived quality field that classifies the page from the raw signals so agents don't have to reimplement the threshold logic:

quality.nativeTextStatus:
- ok — usable native text.
- unusable_glyph_indices — nonPrintableRatio >= 0.05. Text is binary garbage even though charCount looks healthy. Fall back to --render or --ocr.
- empty_but_visual_content — no native text, but the page carries images or non-blank pixels. Re-run with --ocr (or read the rendered PNG via --render).
- empty — no text, no detected visual content. Likely a genuinely blank page (or a render failure — combine with visualStatus below).
quality.visualStatus (present only when --render or --ocr ran):
- ok — renderer drew real content.
- blank — page came out effectively blank against its own dominant background. Render-pipeline failure or genuinely blank page.

pdfvision deliberately stops at observation: it does not recommend an action. The action is the agent's call based on the two statuses + the raw signals below.

Raw signals (the inputs to `quality`)

textCoverage: 0 (rendered as coverage: 0% in markdown) + imageCount > 0 → the page body is a rasterised image. The text stream is empty. Re-run with --ocr or --render.
nonPrintableRatio >= 0.05 → pdf.js fell back to raw glyph indices because the PDF's fonts lack a ToUnicode CMap (common with Hebrew, older CJK, custom symbol fonts). text reads as full coverage but is binary garbage. Values >= 0.3 are pathological; < 0.01 is normal. The raw count is in nonPrintableCount — when the 3dp ratio rounds to 0 the count still tells you whether any non-printable code points slipped through (useful for "is there ANY garbage in this page?" filters).
charCount: 0 but imageCount: 0 → genuinely blank page (separator, end matter).
Sudden drop in textCoverage on a single page in an otherwise text-dense doc → that page is likely a figure / scan / chart. Inspect with --render.
renderContentRatio <= 0.001 (when --render or --ocr was on) → the rasterised page came out blank against its own dominant background. Likely a render-pipeline failure (pdf.js + @napi-rs/canvas can't decode JPEG2000 image streams, or the font has no resolvable glyphs). The ratio is background-aware — dark book covers and beige scan paper don't false-trip it. OCR on this page returns confidence: 0 not because OCR failed but because the input was a near-uniform image.

Caching

Cache root: <os-tmp>/pdfvision/<content-sha>/ — macOS /var/folders/.../T/pdfvision/, Linux /tmp/pdfvision/. Override with PDFVISION_CACHE_DIR=/path.
Keyed by PDF content hash + flag combination. Same PDF + same flags → ~30 ms on the second call. Different flags (e.g. add --layout later) → different slot, fresh extraction.
Wipe everything (cached extractions, rendered PNGs, downloaded remote PDFs, OCR traineddata) with npx pdfvision --clear-cache.

Typical agent flow

Run npx pdfvision doc.pdf (add -p <range> per the scope note, and -f json only when you'll consume structured fields) — gets text + density Overview for the selected pages.
Read the density signals (the markdown Overview table, or overview[] / pages[].textCoverage / imageCount / charCount in JSON) to find low-coverage pages.
For low-coverage pages: re-run with --ocr if text is needed, or --render if a vision model will look at the rasterised page.
For structured / multi-column docs: re-run with --layout (and --image-boxes when figure positions matter).
Cache means steps 3–4 only re-pay the cost of the new flag combination on the affected page subset, not the whole extract.

When to read `references/`

The base of this file already covers daily extraction. Open a reference file only in one of these specific cases — they are not always-on context, do not load speculatively.

Read this file	Gate	When
`references/structured-output.md`	mandatory when you're consuming `--layout`, `--image-boxes`, `--geometry`, `--ocr`, or any other structured JSON / XML field whose schema isn't fully described in this file. SKILL.md only names the flags — the field-by-field shape lives in the reference.	Programmatic consumers of `-f json` / `-f xml`. Covers `DocumentResult` / `PageResult` / `LayoutBlock` / `ImageBox` / `TextSpan` / `PageOcr` schemas and coordinate-system semantics.
`references/ocr.md`	escalation for the easy cases (English-only, expected confidence). Mandatory when the user's text is non-English (lang ordering affects results), confidence is unexpectedly low, or the `tesseract.js` install / stderr is misbehaving.	Lang code combinations, primary-language ordering, traineddata cache, install diagnostics, troubleshooting (low confidence, blank PNG, stderr noise).

pdfvision

pdfvision

Prerequisite

Quick reference

Picking the right flags

Detecting silent failures with the density Overview

One-shot dispatch: pages[].quality

Raw signals (the inputs to quality)

Caching

Typical agent flow

When to read references/

pdfvision

Prerequisite

Quick reference

Picking the right flags

Detecting silent failures with the density Overview

One-shot dispatch: pages[].quality

Raw signals (the inputs to quality)

Caching

Typical agent flow

When to read references/

One-shot dispatch: `pages[].quality`

Raw signals (the inputs to `quality`)

When to read `references/`

One-shot dispatch: `pages[].quality`

Raw signals (the inputs to `quality`)

When to read `references/`