| name | pdf-to-html |
| description | Converts a PDF into one self-contained, readable HTML file that preserves images, tables, charts and reading order — optionally translating it into another language while keeping every figure. Uses structured extraction (PyMuPDF), font-size-driven layout, compressed base64-inlined images (a single portable file), and mandatory headless-Chrome visual verification. Use whenever someone wants to READ a PDF as a web page or clean document, turn a PDF into HTML, or translate a PDF into another language while keeping its images/tables/charts intact — e.g. "PDF 转 HTML", "把这个 PDF 转成中文网页版", "make this report readable", "translate this PDF but don't lose the charts", "I just want to read this PDF on my phone". Distinct from doc-to-markdown (plain Markdown text) and pdf-creator (Markdown→PDF) — this one produces a styled, image-faithful HTML reading experience. |
PDF to HTML
Turn a PDF into a single, self-contained, readable HTML file — images, tables,
charts and reading order preserved — and optionally translate it, keeping every
figure in place.
The pipeline is extract → look → (translate) → build → verify. The middle
"look" and final "verify" steps are where faithfulness actually comes from: a PDF
is a layout, not just a text stream, so you read the rendered pages before
building and the rendered HTML before delivering.
This skill runs inline (no context: fork): translation orchestrates a
Dynamic Workflow, and a subagent cannot spawn one.
When to use / not use
- Use when the goal is to read a PDF as HTML/web page, to convert a PDF to
a styled HTML document, or to translate a PDF into another language while
keeping its figures and tables.
- doc-to-markdown instead if they want plain Markdown text (no styling, figures optional).
- pdf-creator instead for the reverse direction (Markdown → PDF).
What it does NOT do
- Scanned/image-only PDFs (no text layer): OCR first (e.g.
ocrmypdf), then use this.
- Complex multi-column tables: cell text is preserved and readable, but column
alignment can flatten into a text flow — PyMuPDF reads a table as text blocks, not a
grid, so the grid lines are gone. Tables that are images in the PDF survive as
images. If the table's grid structure is essential, use doc-to-markdown (pandoc
rebuilds real tables) or convert that page separately.
- Pixel-perfect facsimile: output is a clean re-flow that keeps images and
reading order, not a 1:1 copy of the original page layout.
- Rewriting: it translates and re-lays-out; it does not summarize, add a TL;DR,
or editorialize. Faithfulness is the point (see Fidelity below).
Dependencies
uv (runs Python with inline deps), Google Chrome or Chromium (visual
verification). Python packages come via uv run --with: PyMuPDF, Pillow, numpy.
Nothing to pre-install beyond Chrome and uv.
Workflow
Copy this checklist and tick as you go:
- [ ] 1. Extract structure + render pages (extract_pdf.py)
- [ ] 2. Read pages/*.png — SEE the layout, find content vs decorative images
- [ ] 3. (only if translating) run the translation workflow
- [ ] 4. Build the single-file HTML (build_html.py)
- [ ] 5. Verify visually (verify_render.py → Read every segment)
- [ ] 6. Deliver the .html
1. Extract
uv run --with pymupdf python scripts/extract_pdf.py input.pdf
Writes input-build/ with structure.json (text blocks with font sizes + image
blocks flagged decorative), images/, and pages/ (one PNG per page).
2. Look before you build
Read input-build/pages/*.png. This is not optional: you need to see the real
layout, confirm which images are content vs decoration, and spot tables/charts.
For a long PDF, read every page; for a short one it's quick. This is also where
you understand the document well enough to translate it well.
3. Translate (optional)
Only if the user asked for another language. Read
references/translation_workflow.md and
follow it: a Dynamic Workflow translates pages in parallel, captions data charts,
and reconciles terminology. It produces two overlay files (units.json,
caps.json) that step 4 consumes. Do not hand-translate inline for anything
longer than a page — the workflow keeps terminology consistent and is far faster.
4. Build
uv run --with Pillow python scripts/build_html.py input-build/structure.json --out output.html
uv run --with Pillow python scripts/build_html.py input-build/structure.json --out output.html \
--translation input-build/units.json --captions input-build/caps.json --lang zh-CN
build_html.py is data-driven: it infers heading levels from font size (most
common size = body; larger steps up to h3/h2/h1), drops decorative images, and
inlines content images as compressed base64 → one portable file. It is not
hand-tuned to any document. If a particular PDF has an unusual structure (e.g.
multi-column, sidebars, a figure the size heuristic misreads), read the script and
adjust — it's short and meant to be edited per document.
5. Verify visually (mandatory)
uv run --with Pillow --with numpy python scripts/verify_render.py output.html
Then Read every seg-*.png and check: fonts render (no tofu boxes), no
clipped tables/figures, headings/lists look right, all expected images present.
Text being correct does not mean the render is correct (failure_cases #7). Fix and
re-verify until it's clean.
A quick structural cross-check is fine too, but count occurrences correctly:
grep -o '<figure>' output.html | wc -l — not grep -c (failure_cases #1).
6. Deliver
Hand over the single .html. It's self-contained (images inlined), so it opens
with a double-click and nothing can go missing.
Scripts
| Script | Run with | Purpose |
|---|
scripts/extract_pdf.py | uv run --with pymupdf | PDF → structure.json + images/ + page renders |
scripts/build_html.py | uv run --with Pillow | structure.json (+ optional translation/captions) → single-file HTML |
scripts/verify_render.py | uv run --with Pillow --with numpy | headless-Chrome render → readable PNG segments |
Fidelity (read before translating)
The deliverable looks authoritative, so wrong content is worse than ugly content.
The non-negotiable rules — and the specific ways this has gone wrong before — are
in references/failure_cases.md. The one that bites
hardest: never give a real person an inferred translated name, and copy every
number/proper-noun verbatim (failure_cases #6). Read that file before any
translation run; skim it before any run.
Next Step
After producing the HTML, suggest the natural follow-up:
Conversion complete: output.html (single self-contained file).
Options:
A) Make a PDF of it — run /daymade-docs:pdf-creator if you want a print/share copy (Recommended if they need to send it)
B) Extract the text as Markdown instead — run /daymade-docs:doc-to-markdown (if they wanted editable text, not a reading page)
C) No thanks — the HTML is what I wanted