| name | standardize-ebook |
| description | Turn a parsed book (EPUB or PDF) into the "reader-ready" format used by /ebook/[id] — markdown chunks with simplified→traditional Chinese, publisher boilerplate stripped, multi-volume hierarchy, smart chapter titles. Branches by file_type — EPUB uses ebooklib + TOC anchors; PDF uses font-size heuristics. Use when wiring a new book into the reader, fixing a book whose TOC looks ugly, batch-processing a category, or extending the pipeline to support a new file type. |
Standardize Ebook Skill
Turn a parsed book into the reader-ready format. The pipeline branches on file_type because EPUB and PDF expose totally different signals — EPUB has semantic HTML + a TOC tree; PDF only has text-on-page coordinates and font sizes. Each branch produces the same output JSONL contract.
Branch decision (read this first)
ebooks.file_type | Source signals | Standardize script | Status |
|---|
epub | <h1-h4>, <b>/<em>, <p>, TOC tree with anchors | scripts/standardize_ebook.py | ✅ implemented |
pdf (text-extractable) | text + font size + bbox via PyMuPDF | scripts/standardize_pdf.py | ⚠ NOT YET BUILT — design below |
pdf (scanned) | image pages → Gemini Vision OCR JSON | scripts/ocr_with_gemini.py emits chunks; then run the PDF pipeline on those | ✅ OCR scheduled, PDF pipeline pending |
Both branches write the same JSONL shape to G:/我的雲端硬碟/資料/電子書/_chunks/{ebook_id}.jsonl so the reader doesn't care which path produced it.
Shared rules — title / boilerplate / merging (both pipelines reuse)
These rules run AFTER the file-specific extraction (HTML walk for EPUB, font heuristics for PDF) and BEFORE persistence. The PDF pipeline must reuse them — they're already implemented in standardize_ebook.py and should be factored into a shared module when standardize_pdf.py lands.
Chapter title derivation
Pulled out of derive_chapter_title() and normalize_chapter_title() in standardize_ebook.py. The TOC sidebar's chapter_path should always end up looking like one of:
| Cosmetic rename | Triggers |
|---|
版權頁 | content starts with 圖書在版編目 / 图书在版编目, OR title is literally CIP數據 / 版權信息 / 版权信息 / 版權頁 / 版权页 |
目錄 (or 目 錄, original spacing kept) | when chunk content's first short line is "目錄" — picked up automatically by first-line fallback |
後記, 序言, 致謝, 導論, 索引, etc. | first short line as-is, no special handling needed |
Selection priority for chapter title (first match wins):
- Markdown heading —
## title / ### title … in content (after publisher rename normalization)
- CIP detected anywhere in first 3 lines →
版權頁
- Earliest short non-banner line (≤30 chars, doesn't match
叢書|丛书|名著|系列|文集|文庫|出版社) — natural document order, so 「目錄」 beats a later「第一章」inside a TOC chunk
- Long chapter heading anywhere — lines matching
^第N(章|卷|編|册|冊|部|集|篇|節|节|回|课|課) accepted even if longer than 30 chars (君主論's chapter titles can be 30+ chars)
- First candidate ≤60 chars as last resort
- Filename fallback (e.g.
text/part0001.html) — should never reach here for well-formed books
Drop & dedupe (publisher noise)
- Hard drop (chunk is publisher-only ad with no value): currently
matches 上海譯文出版社 Digital Lab pages. Pattern set is in
HARD_DROP_PATTERNS. Add new publishers as they show up — keep patterns
narrow to avoid eating real content.
- Dedupe (keep first occurrence, drop rest): for sections that publishers
repeat per sub-volume:
^版权信息 / ^版權信息
圖書在版編目 / 图书在版编目 (CIP data — multi-volume sets repeat this per volume)
- Empty-doc handling: if a doc has < 5 chars of plain text:
- First cover-image-only page (
titlepage.xhtml or filename contains cover) → emit a ## 封面 placeholder once
- Later empty pages → drop silently
Continuation merge — 後記/索引 split fix
EPUBs (and OCR'd PDFs) sometimes split one logical section across multiple
files whose TOC titles are just 一 / 二 / 三 (續篇) or A / B / C (索引
字母分頁). Merge those into the previous chunk:
Before: [後記] + [二] + [索引] + [A] + [B] + … + [Z] ← 27 chunks
After: [後記] + [索引] ← 2 chunks, content concatenated
Detection regex (in is_continuation_title()):
- Single Chinese numeral:
[一二三四五六七八九十百千]+
- Single Latin letter:
[A-Za-z]
- 1-3 digits:
\d{1,3}
- Empty after stripping
Merge condition: same volume as the previous chunk (don't merge across
volume boundaries).
Heading rewrite for cosmetic renames
When derive_chapter_title() applied a rename (e.g. CIP→版權頁), also
rewrite the first markdown heading line in the chunk content so the page's
own h2 matches the sidebar label. Without this the user sees 版權頁 in
the TOC but ## 圖書在版編目(CIP)數據 rendered as the page title — the
reader caught this immediately.
Simplified → traditional + post-fix
Use to_traditional() (in standardize_ebook.py):
opencc.OpenCC("s2tw").convert() for the heavy lifting
- Apply
TRAD_FIXES table (24 entries — shared with parse_drive_inventory.py)
to fix s2tw over-conversions: 历史→曆史 (should be 歷史), 託爾斯泰→
should be 托爾斯泰, 慄田 should be 栗田, etc.
When you find a new mis-conversion, add it to parse_drive_inventory.py:TRAD_FIXES (single source of truth — used by both filename parsing and chunk content) and re-run the affected books.
Volume markers
Multi-volume books are detected by their EPUB top-level TOC entries
containing one of 卷 / 冊 / 部 / 集 / 篇. PDF pipeline can build the
equivalent from page.get_toc() (PDF bookmarks) when present.
If anchored TOC entries (href="part.html#K4") point to anchors that don't
actually exist in the HTML (broken EPUB, like 中國儒學史), fall back to
treating the doc as a doc-level volume start.
Hierarchical TOC support — parse_toc_hierarchical (preferred over flat)
When book.toc exposes top-level Sections with ≥2 child entries each, the
pipeline switches from the flat single-level path into a richer 2-level
splitter that exposes both 章 AND 節 in the sidebar.
Role detection. A top-level Section can mean either a volume or a
chapter, decided by title shape:
-
multi_volume — top Section titles are volume names (羅馬帝國衰亡史:
「全譯羅馬帝國衰亡史:1」). Split at child (chapter) anchors AND grandchild
(節) anchors. volume = top_title, chapter_path = chap_or_section_title.
Heading levels: chapters get ### (sidebar pl-7), 節 get #### (pl-11).
-
single_chapter — top Section titles look like printed-book chapters
(現代世界史: 「第1章 歐洲的興起」 — matches _CHAPTER_TITLE_RE). Split at
top (chapter) anchors AND child (節) anchors. volume = None,
chapter_path = chap_or_section_title. Heading levels: chapters get ##
(sidebar pl-3), 節 get ### (pl-7).
The decision uses _is_chapter_title() vs looks_like_volume() counts: if
≥50% of top Sections match the chapter pattern AND chapter > volume count,
it's single_chapter; else multi_volume.
Payload contract. Both roles emit 3-tuple anchor payloads
(vol_or_None, chap_title, level_str) so the standardize loop can normalize
heading depth uniformly via the target_level override (see "Heading
normalization" below).
Why this matters. Without hierarchical support, books like 現代世界史
(27 chapters × ~5 節 each) collapsed into 21 flat chunks; books like 羅馬帝國
衰亡史 (13 vols × 88 chapters) collapsed into 15 flat per-spine-doc chunks
with volume=None everywhere. After: 201 and 103 chunks respectively, with
correct nesting. Survey: 283/308 standardized EPUBs (92%) qualified for the
hierarchical path; 25 fall back to legacy parse_volume_toc.
Anchor splitting — deep walk + per-anchor dedup
split_body_at_anchors previously iterated only body.children (top-level
direct children). Many publishers wrap the entire chapter list inside one
<div> directly under body, so the loop only matched the first anchor
inside it — rest got swallowed (現代世界史: alternating chapters 第1, 第3,
第5… with even-numbered chapters lost).
The current implementation:
- Walks all body descendants in document order, dedupes anchor matches by
their
id value (publishers often emit the same id on both an <a> nav
target AND a <h2> heading — without dedup the same anchor fires twice
and creates a phantom chunk).
- String-splits the body at each match's tag-start position, preserving
the open/close
<body…> tags around each segment.
Heading normalization (hierarchical mode only)
The reader's loadToc derives sidebar nesting from each chunk's first
heading depth (## → level 2 → pl-3, ### → level 3 → pl-7, etc.). EPUBs
use whatever <h1>/<h2>/<h3> the publisher chose for chapter titles, which
varies — making some chapters render as level-2 entries and others as
level-3+ children of preceding chunks.
In hierarchical mode, the standardize loop forces a uniform heading level:
| Role | Chapter heading | 節 heading |
|---|
single_chapter | ## 第N章 … (level 2, pl-3) | ### N. 節名 (level 3, pl-7) |
multi_volume | ### 第N章 … (level 3, pl-7) — ## reserved for volume | #### 節名 (level 4, pl-11) |
If the chunk has no # heading at all in content, one is prepended at the
target level. This is what makes the 章/節 indentation visible in the
reader sidebar.
Same-chapter cross-spine merge
When the previous chunk has the EXACT same volume + chapter_path as the
current one, it's a continuation — typically cross-spine-doc spillover
(current_volume state carries from doc N's last anchor into doc N+1's
segment-0, which has no new transition). Strip the duplicate heading and
append to the previous chunk. Without this rule each chapter's title-image
spine doc becomes a phantom standalone chunk.
ebook_chunks DB previews
After writing JSONL + R2, refresh ebook_chunks rows for this book —
DELETE existing then INSERT first 200 chars of each chunk's content so
full-text search via /api/ebooks/search?mode=fulltext finds them.
Adaptive batch size (100 → 50 → 20 → 5 → 1) to handle Supabase IO budget
57014 timeouts on multi-volume books with 800+ chunks.
Output contract — the JSONL shape consumers depend on
Each line is one chunk:
{
"chunk_index": 0,
"chunk_type": "chapter",
"page_number": null,
"chapter_path": "第一卷 時 間",
"volume": "文明的歷史:發現者(上冊)",
"format": "markdown",
"content": "## 第一卷 時 間\n\n時間是最偉大的改革者。\n\n**——弗朗西斯·培根:《論革新》(1625)**"
}
| Field | Required | Notes |
|---|
chunk_index | yes | 0-based, contiguous |
chunk_type | yes | "chapter" for EPUB; "page" for PDF |
page_number | optional | Null for EPUB, set for PDF |
chapter_path | yes | First heading text, used for sidebar TOC entry |
volume | optional | Only present in multi-volume books; null = front matter or single-volume |
format | yes | "markdown" for standardized output |
content | yes | Markdown — supports ## ### ####, **bold**, *em*, > blockquote |
The reader's server/utils/ebook-chunks.ts loadToc() reads this format and produces the sidebar; the reader page renders markdown via its own limited renderer (h1-h4 / bold / em / blockquote / paragraphs).
EPUB pipeline (implemented)
What standardize_ebook.py does for EPUB input, in order:
- Read EPUB with
ebooklib — iterate spine docs in reading order.
- Parse top-level TOC (
book.toc):
- Strip entries titled
版权信息 / Digital Lab
- Filter to volume markers only: keep only entries whose title contains
卷 / 冊 / 部 / 集 / 篇. Without this filter, single-volume books promote 目錄 / 插頁 / 出版說明 into fake volume groups (this was a real bug found running the 哲學 batch).
- Remaining entries with
href (no #) → mark as volume start at this doc
- Entries with
href#anchor → mark as volume start at this anchor inside doc (split point)
- If fewer than 2 volume entries remain after filtering → flat (single-volume) layout
- For each spine doc:
- If TOC anchors point inside it, split body at those anchor elements into segments
- Each segment becomes a candidate chunk
- HTML → markdown (
el_to_md):
<h1> → ##, <h2> → ###, <h3>/<h4> → ####
<b>/<strong> → **…**, <em>/<i> → *…*
<p> → paragraph, <blockquote> → > …, <hr> → ---
- Images, footnote
<sup>, decorative <svg> are stripped
- Drop / dedupe:
- Hard drop if plain text matches
HARD_DROP_PATTERNS (currently tuned for 上海譯文 Digital Lab)
- Dedupe first-occurrence patterns (
版权信息 / 版權信息) — keep one, drop later copies
- Empty docs that are cover-image-only become a single 「封面」 placeholder chunk
- Convert simplified Chinese → traditional via
opencc s2tw (also normalizes 卷/編 names if needed).
- Pick
chapter_path from the first markdown heading; fall back to filename.
- Persist:
- Write JSONL to
G:/我的雲端硬碟/資料/電子書/_chunks/{ebook_id}.jsonl
- Gzip + PUT to R2
ebook-chunks/{ebook_id}.jsonl.gz
- Refresh
ebook_chunks in DB with 200-char previews (replace, don't append) so full-text search picks them up
- Update
ebooks row: chunk_count, total_chars, total_pages, parsed_at
How to run
Single book
python scripts/standardize_ebook.py <ebook_id>
python scripts/standardize_ebook.py <ebook_id> --dry-run
python scripts/standardize_ebook.py <ebook_id> --no-r2
Batch by category
python scripts/standardize_ebook.py --category 哲學
python scripts/standardize_ebook.py --category 哲學 --subcategory 近代哲學
python scripts/standardize_ebook.py --category 哲學 --limit 5 --dry-run
Auto-skips PDFs (the script can't parse them — they need a different path, see "Limitations").
Verify a result
python -c "
import json
from pathlib import Path
p = Path('G:/我的雲端硬碟/資料/電子書/_chunks/<ebook_id>.jsonl')
lines = p.read_text(encoding='utf-8').splitlines()
print(f'chunks: {len(lines)}')
for i in [0, 1, 2, len(lines)//2, len(lines)-1]:
c = json.loads(lines[i])
print(f'[{i}] vol={c.get(\"volume\")} title={c[\"chapter_path\"][:40]}')
print(c['content'][:200])
print()
"
Or open /ebook/<ebook_id> in the running dev server (restart first to clear LRU cache) and check:
- TOC sidebar groups by volume if the book is multi-volume
- Headings render bold + sized (h2 centered with rule, h3 left-aligned)
- Chinese is traditional throughout
- No
Digital Lab ad pages
PDF pipeline (TO BE BUILT — design)
PDFs lack the semantic markup EPUBs give us, so we infer structure from
typography. This is less reliable than EPUB but tractable for well-typeset
print books. Scanned PDFs go through OCR first (which produces text without
typography signals — see "OCR fallback" below).
Suggested script: scripts/standardize_pdf.py
Uses PyMuPDF (fitz) which is already a
dependency of parse_worker.py.
Pipeline order
- Open PDF, iterate pages.
- Per-page font analysis — for each page collect text spans with
(text, font_name, font_size, bbox, flags):
doc = fitz.open(path)
for page in doc:
for block in page.get_text("dict")["blocks"]:
for line in block.get("lines", []):
for span in line["spans"]:
...
- Build a global font histogram — across the whole book, count chars
per
font_size bucket. The most common size is the body text size.
- Classify spans by size relative to body:
≥ body_size + 6pt and short (≤ 30 chars) → h2 (chapter title)
≥ body_size + 3pt and short → h3 (section title)
≥ body_size + 1pt → h4 (subsection)
flags & 16 (bold) AND short → h3 (some publishers signal headings only by bold, not size)
- Else → body paragraph
- Build markdown content by emitting each classified span:
- Heading spans →
## title / ### title / #### title
- Body spans → join with newlines; merge same-paragraph spans (same y-bbox proximity, no heading between)
- Bold/italic body →
**text** / *text*
- Chunking strategy:
- Per-chapter (preferred): start a new chunk at every
h2. Keeps
chunks meaningfully sized; matches EPUB chunk granularity.
- Fallback per-page: if no headings detected anywhere (very plain PDF),
fall back to one chunk per page (
chunk_type="page"). Already what
parse_worker.py does today, so re-running this script just upgrades
the metadata.
- Reuse the EPUB pipeline's downstream pieces:
to_traditional() for s2tw + TRAD_FIXES (already in standardize_ebook.py)
derive_chapter_title() for CIP→版權頁 / banner skip / long heading recognition
- Continuation-merge for 後記/二, 索引 A-Z
volume detection — PDFs rarely have a programmatic volume marker, so
usually flat. Could later add a TOC-bookmark reader (page.get_toc())
for paginated multi-volume sets like 《文明的歷史(全五卷)》.
- Same
write_jsonl / push_to_r2 / update_db from standardize_ebook.py
— refactor those into a shared module if the PDF script grows.
Calibration tips when implementing
- Body size detection is the linchpin. Test on 5-10 books from different
publishers first. If a book has heavy use of footnotes (small font), they
shouldn't drown out the body. Filter by total character count per bucket,
not just frequency.
- Bold-only signaling is publisher-specific. Some books (商務印書館 漢譯名著)
put chapter titles in bold same-size as body. Only enable bold→heading
promotion when font-size signal is weak.
- Drop running headers/footers — text spans appearing at the same y-bbox
on most pages (page numbers, book title repeated). Detect by frequency.
- Page-spanning paragraphs — a paragraph that ends on page N and continues
on page N+1 should NOT be split. Track "did the previous page end mid-sentence"
using
not text.endswith(('。', '!', '?', '」', ')')).
- Footnotes — usually appear at bottom of page in smaller font. Either
drop them or move to end of chunk as
> [註] .... User preference.
Known PDFs to use as test cases
- 《尼采到底說了什麼?》by 羅伯特·所羅門 (
53625079-…) — straightforward 哲學 PDF
- 《從封閉世界到無限宇宙》by 柯瓦雷 — typical philosophy press PDF
- 《當代數學》— may have heavy formula content; PyMuPDF text extraction will be ugly
Current scope of哲學 PDF backlog
| Subcategory | PDF count |
|---|
| 哲學 (overall) | 58 PDFs (vs 51 EPUBs already standardized) |
| Across all categories | 422 text-extractable PDFs + 391 scanned-pending-OCR |
OCR fallback for scanned PDFs
Already wired: scripts/ocr_with_gemini.py (run daily 16:00 by Windows Task
Scheduler) writes JSONL with chunk_type="page" and format="text". After OCR
finishes:
- Re-running
standardize_pdf.py on those JSONLs would mostly be a no-op
(no font signals to work with) — it'd just apply s2tw + TRAD_FIXES +
derive_chapter_title() to upgrade them
- Each OCRed page might already include heading detection via the Gemini
prompt — could enhance the prompt to mark heading-line vs body-line if
needed (lower priority — current state is "readable but flat")
When NOT to bother
- One-off book that already reads OK in the current per-page format
- PDFs with heavy formula / table layout (PyMuPDF mangles those — Gemini
Vision is more robust but expensive)
Current state (snapshot 2026-05-08)
All 481 standardized EPUBs have been re-run with the post-processing
pipeline below. The 5 books that fail consistently have invalid Windows
file paths (Errno 22) — pre-existing data issue, not a script bug.
| Category | EPUBs | Status |
|---|
| 哲學 / 宗教學 / 世界宗教 / 心理學 / 人類生物學 / 自然科學 / 文學 / 社會政治學 / 歷史學 | 481 total | ✅ all standardized; 2-3 transient persist errors per batch are auto-handled |
| Hierarchical re-run | 283/308 EPUBs eligible (92%) | 🔄 batch in progress (scripts/batch_hier_standardize.py in tmp); 3 books with annotations excluded |
Reference books that exercise the hierarchical path:
- 羅馬帝國衰亡史 (吉本) —
multi_volume role: 13 vols × 88 chapters; was 15 flat → now 103 chunks with proper volume groups
- 現代世界史 (帕爾默·克萊默) —
single_chapter role: 27 chapters × ~5 節 each; was 21 flat → now 201 chunks with 章 (##) > 節 (###) sidebar nesting
- 卡夫卡著作集 (套裝10冊) — depth-5 TOC; previously fully flat
Reference books to spot-check (after each re-run):
- 《君主論》 — single-vol; NO volume groups in sidebar
- 《中國儒學史》 — 10-vol; groups by 「先秦卷/漢代卷/…」
- 《希臘羅馬神話》 (Hamilton) — fake-flat 第一部 promoted via
promote_implicit_volumes; cover has full subtitle/原書名/作者中英
- 《A State of Mixture》 (Payne) — chapters separated correctly (no longer eaten by 「Introduction」 super-chunk); 「In honor of beloved Virgil—」 / 「Publisher.xhtml」 / title-page repeats / epigraph / series banner all consolidated into 「出版資訊」
Post-processing pipeline (runs after the EPUB walk, before persist)
Order matters — see end of standardize():
promote_implicit_volumes — when an EPUB TOC has an unnamed top-level group (e.g. publisher omitted 「第一部」 but properly named 「第二部」+) the chapters end up flat. Scan vol=None chunks for 第N部/卷 dividers in their chapter_path and synthesize the missing volume.
apply_cover_enrichment — replace the placeholder ## 封面 (書本封面) chunk with structured markdown built from DB title/author + 版權頁 extraction (subtitle / original_title / author_en). Or insert one at index 0 if no cover chunk exists.
consolidate_frontmatter_into_publisher — when a CONTENTS-style chunk (目錄 / Contents) exists in the first ~12 entries AND no volume starts between cover and CONTENTS, fold all chunks [1..contents-1] into one synthesized 「出版資訊」 chunk. Substantive named entries (序 / 致謝 / 譯者序 / 推薦序 / Note on / Acknowledgments) stay separate because they appear AFTER CONTENTS.
derive_chapter_title smart fallback — already runs per-chunk during the main walk. Skips numeric/single-letter headings (academic EPUBs use <h2>1</h2><h1>Real Title</h1>); when only numeric headings exist AND a short content line follows (<h2>01</h2><p>王權的誕生</p>), combines into 「01 王權的誕生」.
- Continuation-merge size cap —
is_continuation_title matches still merge tiny 「二」 / 「A」 chunks into the previous chunk, but ONLY if the chunk's plain text is ≤ 800 chars (prevents a 130KB chapter file titled just 「1」 from being eaten).
Rich publisher metadata extraction (_extract_publisher_metadata)
Scans every chunk's content for 版權頁-style key-value lines and writes
the result to ebooks columns during update_db():
| Field | Regex source | ebooks column |
|---|
full_title (used for subtitle split) | 書名: … / Title: … | subtitle (only the post-: part) |
original_title | 原文書名: … / 原書名: … / Original Title: … | original_title |
author_en | 作者: 中文(English) parens capture | author_en |
translator | 譯者: … (stops at `│ | , ; / 、`) |
publisher | 出版: … / 出版社: … / Published by: … (rejects 出版日期 / 出版年 / 出版地) | publisher |
publish_year | 初版: …YYYY / 初版首刷: YYYY / 電子書: …YYYY | publication_year |
original_publish_year + original_author | Copyright © YYYY by AUTHOR | both |
Field-stop char class _FIELD_STOP = "\n│|,,;;//((、" keeps regexes
from greedy-eating siblings on packed lines like
作者:X│譯者:Y│出版者:Z│出版日期:YYYY年.
Auto-copy to books on excerpt creation
server/api/annotations/index.post.ts (POST handler with
save_as_excerpt: true) reads the rich columns from ebooks and copies
them into the auto-created books row, so books that get created by the
reader's 「+ 書摘」 button come out matching the richness of manually-
entered ones — translator / publisher / publish_year / original_title /
original_author / original_publish_year all populated when extractable.
When you tweak the extraction regexes here, re-run --all so existing
ebooks pick up the new fields, then any future excerpt save will produce
a rich books row.
Idempotency
Re-running on the same book is safe — it overwrites:
- Local JSONL (write replaces)
- R2 object (PUT replaces)
ebook_chunks rows (DELETE-then-INSERT)
ebooks row's chunk_count / total_chars / parsed_at
Annotations (annotations table) reference chunk_index which generally stays stable for the same book — but if you tweak HARD_DROP_PATTERNS and re-run, indices shift and existing annotations may land on wrong text. Avoid re-running once users have annotations.
Volume detection — known limits
The looks_like_volume() heuristic is conservative: a TOC entry must contain one of 卷 / 冊 / 部 / 集 / 篇 to be considered a volume. This works for:
- ✅ 文明的歷史:發現者(上冊) — has 「冊」
- ✅ 中國儒學史:先秦卷 — has 「卷」
- ✅ 五燈會元第N部 — has 「部」
It will miss books that group by other markers like:
- ❌ Series titled 「上」「中」「下」 alone (since 「上」 alone is too common in Chinese to safely match)
- ❌ Latin numerals like 「Volume I / II」 in mixed-language editions
- ❌ Publishers using 「輯」 or other terms not in the marker set
If a multi-volume book gets flattened by mistake, add the missing marker to VOLUME_MARKERS and re-run. Don't try to detect "this looks like a multi-volume set" structurally — the marker check is the only reliable signal in the EPUB TOC.
Anchor fallback (broken EPUB anchors)
Some EPUBs put #anchor fragments in their TOC hrefs but the HTML body never actually emits a matching id="..." — 中國儒學史 does this. The script handles it:
- After
parse_volume_toc(), every (file, anchor, title) entry is validated by reading the doc's HTML and find(attrs={"id": anchor}).
- If at least one anchor in a doc lands → keep the split-at-anchor behavior.
- If no anchors in a doc resolve → promote the first declared title to a doc-level volume start. The volume transition still fires, just at doc beginning instead of mid-doc.
Logged as (N anchored volume(s) had no resolvable id — promoted to doc-level starts) per book during a batch.
Tuning per publisher
The default boilerplate patterns target 上海譯文出版社. Adjust HARD_DROP_PATTERNS and DEDUPE_PATTERNS when you find a publisher whose noise pages slip through:
HARD_DROP_PATTERNS = [
r"Digital\s*Lab是上海译文出版社",
r"我们致力于将优质的资源送到读者手中",
r"上海译文出版社\|Digital\s*Lab",
]
Heuristic for what to add:
- Patterns must be narrow enough not to match real content (a copyright page is content the user wants kept; a publisher self-promo page is not)
- Test on 1 book with
--dry-run before running batch
Limitations
- PDFs not supported. The script raises
SystemExit for non-EPUB input. PDFs lack reliable HTML structure for headings; the OCR pipeline produces flat per-page text. A future PDF standardizer would need to LLM-detect headings — separate skill.
- Boilerplate detection is publisher-specific. Books from publishers not in
HARD_DROP_PATTERNS will keep their copyright/dedication/ad pages as visible chunks. Cosmetic, not breaking.
- TOC anchor accuracy varies. EPUBs that link
volume2 → file#anchor mid-document get split at the anchor. If the publisher placed the anchor at the wrong paragraph (e.g. 文明的歷史's K4 anchor is mid-上冊 but tagged as 下冊 start), the split is also wrong. Manual fix would need per-book overrides.
- No semantic dedupe. Two chunks with very similar but not identical 版權頁 content both get kept. The current dedupe is regex-based prefix matching only.
- EPUB images dropped entirely. Cover page becomes the placeholder text 「封面」. Inline figures (chart_img etc.) are stripped — content-only.
Recovery / re-runs
If a batch run partially fails (e.g. Supabase IO budget hits 57014), the same command re-run picks up where it left off — but only because each book is overwriting independently. There's no checkpoint file. The persisted state in DB / R2 / local is your progress marker.
To find books that were standardized vs not, after a batch:
Or ls G:/.../電子書/_chunks/*.jsonl mtime — recent files were just standardized.
When to NOT use this skill
- The book is a PDF (use OCR pipeline instead)
- The book has annotations from users (re-running can shift
chunk_index, breaking saved highlights)
- The book is single-language English with a normal structure (s2tw conversion would be a no-op, but boilerplate patterns mostly target Chinese — script still works, just less cleaning)
- You want to re-parse without re-uploading to R2 (use
--no-r2)
Recommended order for "I'm a new agent picking this up"
- Read this file end-to-end first.
- Read the
ebook-pipeline SKILL for context on where this fits in the broader system.
- Look at the Current state table above to see which categories are done vs not.
- Re-run the 哲學 batch first to apply the recent volume-marker fix:
python scripts/standardize_ebook.py --category 哲學
- Pick another category and dry-run to estimate scope:
python scripts/standardize_ebook.py --category <名> --dry-run
- Look at 3-5 sample chunks before committing to a full batch — different categories have different publisher mixes.
- After running, sample chunks visually (snippet in How to run / Verify). If you see boilerplate that should be dropped, extend
HARD_DROP_PATTERNS and re-run — idempotent.
- Don't re-run
standardize_ebook.py on books whose IDs have entries in the annotations table (re-running can shift chunk_index). Check first:
python -c "import requests, ... ; print(requests.get(URL+'/rest/v1/annotations?select=ebook_id', headers=H).json())"
Related