| name | ebook-pipeline |
| description | Operate the Know-Graph-Lab ebook pipeline end-to-end. Use when working on parsing books from Drive into Supabase, OCR'ing scanned PDFs (daily Gemini scheduler), back-filling DB previews from local JSONL, standardizing EPUBs into reader-ready markdown, or wiring the reader to chunks. The hub for everything book-content-related. |
Ebook Pipeline Skill
End-to-end pipeline that takes books from a local Drive folder all the way to the reader at /ebook/[id]. This file is the operational hub โ what runs, in what order, how to monitor it, and how to recover when it breaks. For the standardization step specifically (turning EPUBs into reader-ready markdown), see the standardize-ebook skill which is the detail-level companion.
Current state (snapshot 2026-05-07)
| Stage | Status | Numbers |
|---|
| Drive scan + author/title parse | โ
done | 1,309 books (initial sweep) |
| Daily z-lib drop ingest | โ
wired into daily scheduler | ingest_new_books.py โ see Workflow D |
| File rename in Drive | โ
done | all renamed to ไฝ่
๏ผๆธๅ.ext |
DB import (ebooks table) | โ
rolling | 1,326 rows (492 EPUB + 834 PDF) |
| First-pass parse (text-extractable) | โ
done | 921 parsed (484 EPUB + 437 text PDF) โ 3 timeout-failed EPUBs retried 2026-05-07 |
| mobi/azw3 โ epub conversion | โ
done | 0 remaining |
| OCR scanned PDFs | ๐ every-6h scheduled | 322 queued; auto-runs every 6 hours (13 */6 * * *). Default engine: Gemini (ocr_with_gemini.py, 4 rotating API keys, gemini-2.5-pro @ 4 RPM). Claude Haiku Vision only when user explicitly orders it โ strictly one book at a time. |
| Local JSONL written | โ
done for parsed books | G:/ๆ็้ฒ็ซฏ็กฌ็ข/่ณๆ/้ปๅญๆธ/_chunks/*.jsonl |
| R2 mirror of JSONL | โ
done | mirrored on parse / OCR / standardize |
ebook_chunks previews in DB | โ
caught up | back-fill via repopulate_chunk_previews.py retry-failed |
| Frontend reads chunks | โ
wired | loadChunk() from localโR2 fallback |
| Reader v2 UI (light theme + TOC + highlights + notes panel + bookshelf + tags + bookmarks) | โ
done | with auto-save, error toasts |
| Search by title / author / fulltext | โ
wired | /api/ebooks/search?mode=โฆ |
| EPUB standardize โ markdown reader format | โ
done | 484/492 EPUBs incl. all enrichment passes (8 remain: corrupted/bad-zip/no-text) |
| PDF standardize Plan A (lite) | โ
done | 437/437 text-extractable PDFs polished โ s2tw, spacing collapse, publisher metadata, page_number preserved |
| PDF standardize Plan B v0 (TOC-driven) | โ
done | standardize_pdf.py --all โ 152/437 chapter-chunked (OK 152, Skipped 285, Failed 0); Plan B v1 (font-driven, no-TOC subset) deferred |
| Online metadata enrichment | โ
done | enrich_book_metadata.py filled 8 publishers + 11 publish_years on 28-book backlog (89% / 87% coverage now) |
| books / excerpts library | โ
done | /excerpts/library, tags, Markdown export, daily bookmark |
Pipeline scripts overview (all in scripts/)
| Script | Phase | Purpose |
|---|
local_drive_pipeline.py | 1 โ ingest (initial sweep) | Scan Drive, parse ไฝ่
๏ผๆธๅ.ext, rename in place |
parse_drive_inventory.py | 1 โ ingest | Library: parse_filename(), to_traditional(), TITLE_AUTHOR_OVERRIDES |
import_local_to_supabase.py | 1 โ ingest (initial sweep) | data/local_inventory.json โ ebooks rows |
ingest_new_books.py | 1 โ ingest (daily) | Watches z-lib/ at the project root. Parses filename, classifies via Gemini (with keyword fallback), inserts ebooks row, moves file to G:/.../้ปๅญๆธ/{category}/. See Workflow D |
parse_worker.py | 2 โ parse | Main parser (PyMuPDF + ebooklib). init / run [--limit N] [--retry-errors] / status |
convert_mobi_to_epub.py | 2 โ parse | Calibre wrapper for mobi/azw3 โ epub. Already done; keep for new files |
ocr_with_gemini.py | 3 โ OCR (primary) | Gemini Vision OCR for scanned PDFs. Pushes JSONL to R2 inline. Exits with code 2 when daily quota hits (signals fallback) |
ocr_with_qwen.py | 3 โ OCR (fallback, disabled) | Local Qwen2.5-VL via Ollama. Code intact; bat trigger commented out โ vision compute graph (6.7 GiB) won't fit on 4050 Mobile (6 GiB). Re-enable on better GPU. See Workflow A-2 |
run_ocr_daily.bat + Task Scheduler | (orchestrator) | Windows daily runner โ runs ingest โ parse โ OCR (gemini only) in sequence |
standardize_ebook.py | 4 โ standardize | Re-parse EPUB โ markdown chunks, s2tw, drop boilerplate (see standardize-ebook skill) |
standardize_pdf_lite.py | 4 โ standardize | Plan A polish over per-page JSONL: s2tw + collapse spacing + extract publisher metadata. page_number preserved exactly. See standardize-pdf skill |
standardize_pdf.py | 4 โ standardize | Plan B TOC-driven re-chunking. Reads existing JSONL + PDF TOC โ emits chapter-level chunks with page_range. Falls back to Plan A on books without usable TOC. See standardize-pdf skill |
enrich_book_metadata.py | 4b โ backfill | Online lookup (Google Books โ Open Library) to fill missing publisher / publish_year on books rows. Idempotent; respects metadata_locked. status / run [--limit N] [--dry-run] [--book <id>] / probe --book <id> |
repopulate_chunk_previews.py | 5 โ DB | Back-fill ebook_chunks previews from local JSONL. run / retry-failed / status |
upload_chunks_to_r2.py | 5 โ R2 | One-shot bulk uploader for any books whose JSONL isn't on R2 yet |
offload_chunks.py | (history) | Did the original DB-truncate after JSONL offload. Don't run again |
DB schema
ebooks (
id uuid PK, title, author, file_type,
file_path,
category, subcategory,
total_pages, total_chars,
chunk_count,
parsed_at,
parse_error,
book_id,
cleaned_at
)
ebook_chunks (
id uuid PK, ebook_id FK,
chunk_index INT, chunk_type,
page_number, chapter_path,
content TEXT,
char_count
)
GIN index on to_tsvector('simple', content)
annotations (
id uuid PK, ebook_id FK, chunk_index,
selected_text, context_before, context_after,
note, color,
excerpt_id FK?,
created_at
)
Storage layout
- Local JSONL (source of truth):
G:/ๆ็้ฒ็ซฏ็กฌ็ข/่ณๆ/้ปๅญๆธ/_chunks/{ebook_id}.jsonl
- One JSON per line:
{chunk_index, chunk_type, page_number, chapter_path, volume?, format?, content}
- Auto-syncs to Google Drive cloud as backup
- Configured via
EBOOK_CHUNKS_DIR in .env (consumed by nuxt.config.ts โ runtimeConfig.ebookChunksDir)
- R2 mirror:
r2://{R2_BUCKET}/ebook-chunks/{ebook_id}.jsonl.gz (gzipped)
- Read at runtime by
server/utils/ebook-chunks.ts loadLines() when local file is unreachable (production, Zeabur, etc.)
- DB previews:
ebook_chunks.content first 200 chars only โ for fast SQL ilike full-text search
Critical constraints
- Supabase free tier 500 MB limit โ never reload full chunk text into DB. Always JSONL-on-disk + 200-char preview to DB.
- Supabase IO budget on free tier โ bulk inserts (>1K/s) hit
57014 "canceling statement". Both parse_worker.py and repopulate_chunk_previews.py use adaptive batch sizes (100 โ 50 โ 20 โ 5 โ 1) to ride out spikes.
- No Supabase Storage bucket โ user explicitly forbade it. Local files only (G: drive auto-syncs to Drive).
- Service-role key in
.env โ never hardcode. Old hardcoded keys cleaned in commit 7a35d07; if you spot more, scrub them.
- PostgREST 1000-row default cap โ server endpoints that list ebooks already use
.range(0, 1999) (server/api/ebooks/index.get.ts). Any new bulk read must do the same.
Workflow A โ OCR scanned PDFs (391 books)
scripts/ocr_with_gemini.py sends each scanned PDF to Gemini Vision via the Files API and gets back structured JSON {pages: [{page, text}]}. After OCR, it:
- Writes JSONL to local
_chunks/
- Pushes gzipped JSONL to R2 (inline โ see
push_to_r2() in the script)
- Inserts 200-char preview rows into
ebook_chunks
- Marks
ebooks.parsed_at + clears parse_error
Rate limits to respect: Gemini 2.5 Flash free tier โ 10 RPM, 250 RPD, 250K TPM. Default --rpm 8 leaves headroom. Daily quota resets at midnight Pacific Time (โ Taipei 16:00).
Quota-exhaustion fallback: when Gemini returns 429 / RESOURCE_EXHAUSTED, ocr_with_gemini.py prints "Quota/rate-limit hit. Stopping." and exits with code 2. The daily bat catches that exit code and runs ocr_with_qwen.py for --limit 5 books before giving up โ see Workflow A-2 below. Tomorrow's daily run starts with Gemini again on the fresh quota.
Scheduler (set up + running)
The bat is now a 3-stage runner: ingest_new_books โ parse_worker โ ocr_with_gemini.
Despite the historical name KGLab-OCR-Daily, it does more than OCR โ see scripts/run_ocr_daily.bat.
| Component | Path |
|---|
| Bat runner | scripts/run_ocr_daily.bat โ runs ingest โ parse โ OCR in sequence; logs to scripts/logs/ocr_YYYY-MM-DD.log |
| Toast helper | scripts/notify.ps1 โ Windows toast wrapper. Bat fires it twice: at run start, and again if Gemini hits 429 (so user knows when to expect tomorrow's resumption) |
| Windows Task | KGLab-OCR-Daily registered via Register-ScheduledTask |
| Trigger | Every 6 hours โ cron 13 */6 * * * (fires at 00:13, 06:13, 12:13, 18:13 Taipei time). Changed from daily-16:00 on 2026-05-07 to get more throughput per day. |
| Behavior | WakeToRun + StartWhenAvailable โ wakes from sleep, catches up if missed |
| Run as | Current user, Interactive logon (no password stored; only fires while logged in) |
| Cap | 12-hour ExecutionTimeLimit |
OCR engine policy (2026-05-07)
Default: Gemini (ocr_with_gemini.py) โ uses 4 rotating Google API keys; 503 = transient server overload (does NOT overwrite parse_error; next run retries); 429 = daily quota exhausted (script exits code 2).
Claude Haiku Vision: ONLY when user explicitly orders it. Always one book at a time โ launching multiple parallel Haiku agents simultaneously exhausted the entire Max subscription token quota with zero books completed (2026-05-07). When Haiku is ordered:
- Use
Agent(..., model="haiku", ...) for exactly one book
- Pass PDF path via
@file syntax to avoid Windows shell Chinese encoding issues
- Wait for completion before starting next book
# Inspect / control
schtasks /query /tn "KGLab-OCR-Daily" /v /fo list
Start-ScheduledTask -TaskName "KGLab-OCR-Daily" # manual fire (won't help mid-day if quota exhausted)
schtasks /delete /tn "KGLab-OCR-Daily" /f # tear it down
Manual operations
python scripts/ocr_with_gemini.py status
python scripts/ocr_with_gemini.py run --limit 1
python scripts/ocr_with_gemini.py run --rpm 8
python scripts/ocr_with_gemini.py run --model gemini-2.5-flash-lite --rpm 12
When OCR breaks
parse_error: 'no extractable text' โ still in queue (initial failure from parse_worker, picked up by ocr_with_gemini)
parse_error starts with OCR ok but R2 push failed: โ OCR succeeded but R2 write failed; book NOT marked parsed; next OCR run re-tries (cheap โ JSONL was kept locally)
parse_error starts with OCR: โ permanent Gemini failure (model returned 0 usable pages, file too big, etc.). Won't auto-retry; investigate manually.
parse_error starts with Qwen-OCR: โ permanent Qwen failure (similar; Qwen returned no text, or too many per-page failures).
- Quota stop: script exits with code 2, "Quota/rate-limit hit. Stopping." โ bat falls back to
ocr_with_qwen.py --limit 5 (Workflow A-2)
parse_error: 'file not found: ...' โ DB row references a Drive path that doesn't exist anymore (Drive sync disconnected, file moved/deleted, or rename divergence). Book is removed from OCR queue automatically; investigate by checking if G:\ is mounted
Workflow A-2 โ Local Qwen2.5-VL OCR fallback (DISABLED 2026-05-06)
Smoke-tested on RTX 4050 Mobile (6 GiB VRAM): qwen2.5vl:3b's vision compute graph alone needs 6.7 GiB โ Ollama scaled GPU layers from 20 โ 0 trying to fit, then loaded the whole thing on CPU. CPU-mode rate measured ~1 token/min (3 tokens / 292 s) โ ~8-15 hours per OCR page. Not viable on this hardware. The bat trigger is commented out; ocr_with_qwen.py and the gemini exit-code-2 plumbing stay intact for future hardware (โฅ 8 GiB VRAM, or after switching to a smaller VLM like moondream2).
When the fallback IS enabled, scripts/ocr_with_qwen.py takes over for the rest of the daily run when Gemini returns 429. Architecture:
- PyMuPDF renders each PDF page at DPI 150 โ JPEG bytes
- POST image + Chinese-aware OCR prompt to Ollama
/api/generate (model qwen2.5vl:3b by default)
- Aggregate non-empty pages โ same JSONL / R2 / DB-preview path as the Gemini script (helpers imported from
ocr_with_gemini)
Why limit 5 per run
Per-page Qwen latency on the dev machine (RTX 4050 Mobile, 6 GB VRAM) is ~10-30s. A 200-page book = 30-100 minutes. Setting --limit 5 in the bat caps a daily fallback session at a few hours and keeps the laptop usable. The 391-book backlog is not meant to be cleared by Qwen โ Gemini handles 250 books/day on the free tier when quota holds, so backlog converges in a couple of normal days. Qwen exists so progress doesn't completely stall when Gemini's down.
Model choice
| Model | VRAM (q4) | Chinese OCR quality | Notes |
|---|
qwen2.5vl:3b (default) | ~3 GB weights + ~6.7 GB compute graph | Decent โ handles modern simplified/traditional well | Doesn't fit on 4050 Mobile 6 GB VRAM (compute graph alone exceeds it) |
qwen2.5vl:7b | ~6 GB weights + larger graph | Notably better on dense traditional text | Even worse fit on 4050 Mobile |
| Llama 3.2 Vision | โ | Avoid โ English-heavy training, weak on ็น้ซ | |
Override with --model qwen2.5vl:7b if VRAM permits (close other GPU consumers first).
Manual operations
ollama list | grep qwen2.5vl
python scripts/ocr_with_qwen.py status
python scripts/ocr_with_qwen.py run
python scripts/ocr_with_qwen.py run --model qwen2.5vl:7b --dpi 200 --limit 3
Failure modes
- Ollama daemon not running โ script exits 1 with
โ Ollama not reachable. Launch the Ollama desktop app or run ollama serve.
- Model not pulled โ
ollama pull qwen2.5vl:3b.
- Per-page errors > 25% of pages โ book is marked
parse_error: 'Qwen-OCR: too many page failures' (won't auto-retry; investigate the PDF).
- VRAM OOM mid-run โ fall back to
--model qwen2.5vl:3b or close GPU-using apps.
Workflow D โ Daily z-lib drop ingest
The user drops freshly-acquired ebooks into z-lib/ at the project root (a local folder, not on Drive). Filename suffix (z-library.sk, 1lib.sk, z-lib.sk) from the source is preserved on disk and stripped during parse. scripts/ingest_new_books.py processes that folder once per day as part of run_ocr_daily.bat. For each ebook file (.pdf / .epub / .mobi / .azw3):
- Parse filename โ
(author, title, ext). Reuses parse_drive_inventory.parse_filename() after pre-stripping z-library suffixes like (z-library.sk, 1lib.sk, z-lib.sk) (the parent parser only knows the older (z-lib.org) form, and the inner commas trip its ๅ
จๅฝข/ๅๅฝข comma split).
- Classify into one of the 9 main categories. Two-tier:
- Keyword fallback first (free): hits
christ|church|bonhoeffer|syriac|nestorius|cyril|monophysite|chalcedon|ephrem|babai|homilies|patristic|apostolic|gospel|biblical|theology โ ๅฎๆๅญธ; hits zoroastr|avesta|islam|buddhis โ ไธ็ๅฎๆ.
- Gemini 2.5 Flash otherwise โ strict JSON output, prompt explains the 9 categories. LLM mistakes like "ๅบ็ฃๆ"/"็ฅๅญธ" get auto-mapped to ๅฎๆๅญธ.
- Insert an
ebooks row with category set, parsed_at = NULL, file_path pointing to the future Drive location (the move below puts it there).
- Move local file โ
G:/ๆ็้ฒ็ซฏ็กฌ็ข/่ณๆ/้ปๅญๆธ/{category}/{author}๏ผ{title}.{ext}. Because G: is the Drive sync mount, the move IS the upload (Drive client uploads in the background) AND the local-delete in one filesystem rename. No OAuth / Drive API setup needed.
After ingest, the new rows appear in ebooks with parsed_at = NULL. The next parse_worker.py run (triggered by step 2 of the daily bat, or manually) extracts text where possible. If extraction fails (parse_error LIKE '%no extractable text%'), ocr_with_gemini.py picks it up in step 3.
Manual operations
python scripts/ingest_new_books.py status
python scripts/ingest_new_books.py run --dry-run
python scripts/ingest_new_books.py run --limit 3
python scripts/ingest_new_books.py run
Failure modes
- DB insert fails โ file kept in
z-lib/; safe to re-run, no orphan row.
- Move fails after DB insert โ file kept in
z-lib/, DB row inserted but file not on Drive. Script prints both paths so you can either move manually or delete the row. Rare on Windows (cross-drive move = copy then delete).
- Gemini quota / 429 โ that single book is skipped (file kept), other books continue with the keyword fallback. Tomorrow's run picks up the skipped book.
- Filename can't be parsed (no usable title) โ logged "SKIP: could not parse title from filename"; file kept; add a manual override to
parse_drive_inventory.TITLE_AUTHOR_OVERRIDES or rename the file manually.
- Target file already exists on Drive (duplicate book) โ skip, file kept in
z-lib/. Manual cleanup needed.
Tuning notes
- The keyword fallback skews heavily toward Christian-studies content (current user backlog) โ if a different research area dominates a future drop, extend
fallback_category() in scripts/ingest_new_books.py.
- Gemini share the same daily quota with the OCR runner. Order in the bat is ingest first (small, ~1-5 calls/day) so OCR's heavy usage can't starve it. RPM is gentle (
time.sleep(0.5) between books).
- Junk files in
z-lib/ (e.g., Z-Library-latest.exe) are silently ignored โ only EBOOK_EXTS are touched.
Workflow B โ Standardize EPUB into reader-ready format
This is the second-pass transformation that makes a book look polished in /ebook/[id]: simplifiedโtraditional Chinese, publisher boilerplate stripped, multi-volume hierarchy preserved, <h1-h4>/<b>/<em> mapped to markdown.
See the standardize-ebook skill for the full contract (output format, drop/dedupe rules, idempotency, tuning per publisher).
Quick commands
python scripts/standardize_ebook.py <ebook_id>
python scripts/standardize_ebook.py <ebook_id> --dry-run
python scripts/standardize_ebook.py --category ๅฒๅญธ
python scripts/standardize_ebook.py --category ๅฒๅญธ --subcategory ่ฟไปฃๅฒๅญธ
python scripts/standardize_ebook.py --category ๅฒๅญธ --limit 5 --dry-run
Current standardization state
| Book(s) | Done? | Notes |
|---|
ๆๆ็ๆญทๅฒ (id 181798a6-โฆ) | โ
done | Reference example. K4 anchor in EPUB is misplaced โ ็ผ็พ่
ไธๅ only has 3 chunks (publisher data quirk, not a script bug) |
| ๅฒๅญธ category (51 EPUBs) | โ done with old volume logic | Need re-run with new looks_like_volume() heuristic โ single-volume books were picking up ใ็ฎ้/ๆ้ /ๅบ็่ชชๆใ as fake volumes. Re-run is safe (idempotent) |
| Other categories | โ not started | 9 categories; can be batched per category |
Recommended re-run after recent fix
python scripts/standardize_ebook.py --category ๅฒๅญธ
โ Caveat: re-running can shift chunk_index if drop/dedupe rules change, which would break existing annotations. Only 2 annotations exist book-wide right now, both on ๆๆ็ๆญทๅฒ (which won't be re-run unless fix is targeted). Safe for the ๅฒๅญธ batch.
Workflow C โ Back-fill ebook_chunks previews
Needed when full-text search must cover a book whose chunks are only on disk/R2 but not in DB previews. Happens to:
- Books that were parsed before R2 offload but DB was truncated
- Books that hit
57014 IO timeout during repopulate_chunk_previews.py's first run
python scripts/repopulate_chunk_previews.py status
python scripts/repopulate_chunk_previews.py run
python scripts/repopulate_chunk_previews.py retry-failed
python scripts/repopulate_chunk_previews.py run --book <ebook_id> --force
retry-failed is the safe re-run mode โ finds books whose ebook_chunks count is below their expected ebooks.chunk_count and only retries those, with adaptive batches that survive 57014.
Decision tree for "this book looks broken in the reader"
Book opens but no content?
โ Check ebook_chunks count for this book. If 0 โ run repopulate_chunk_previews.py --book <id>
โ If still missing, check local JSONL exists. If not โ re-parse (parse_worker.py or ocr_with_gemini.py)
Reader sidebar shows "็ฎ้/ๆ้ " as fake volumes?
โ standardize_ebook.py was run before the volume marker fix. Re-run for this book.
Reader shows "Digital Lab" page or other publisher noise?
โ Add the publisher's specific phrase to HARD_DROP_PATTERNS in standardize_ebook.py, re-run.
Search returns no fulltext hits but title/author work?
โ ebook_chunks doesn't have this book's previews. Run repopulate_chunk_previews.py.
A scanned PDF still shows "ๆญค้ ็กๅ
งๅฎน" 12+ hours after OCR scheduled?
โ Check scripts/logs/ocr_YYYY-MM-DD.log for errors.
โ If quota hit and bat fell back to Qwen, look for "--- gemini quota hit, falling back to ocr_with_qwen ---" line.
โ If quota hit BUT Qwen also failed, ensure Ollama daemon is running and qwen2.5vl:3b is pulled.
โ Otherwise: just wait โ tomorrow's run picks up.
Many books in OCR queue showing "file not found"?
โ Probably G: drive (Drive sync) is disconnected. Check: `Get-PSDrive -PSProvider FileSystem` should list G:.
โ Re-launch Google Drive client. The "file not found" parse_errors stay until you manually flip them back to "no extractable text" once Drive is back.
A new book dropped into z-lib/ never showed up in the reader?
โ Check scripts/logs/ocr_YYYY-MM-DD.log "--- ingest_new_books ---" section.
โ 'CLASSIFY FAILED' = Gemini quota; tomorrow retries automatically.
โ 'could not parse title' = filename pattern unsupported; rename or extend TITLE_AUTHOR_OVERRIDES.
โ 'target already exists on Drive' = duplicate book, manual cleanup.
Common pitfalls
- opencc s2tw over-converts:
ๅๅฒ โ ๆๅฒ (should be ๆญทๅฒ), surnames like ๆ โ ๆ
. Post-fix table in scripts/parse_drive_inventory.py:TRAD_FIXES.
- Chinese characters in shell output on Windows: cp950 codec errors. Always use
sys.stdout.reconfigure(encoding='utf-8') or write to UTF-8 file.
- PostgREST
Range: */0 is ambiguous โ could mean 0 rows OR unknown count. Don't trust if Prefer: count=exact is set; use the explicit count format.
server/utils/ebook-chunks.ts LRU cache (10 min TTL) means hot-edits to JSONL aren't visible immediately. Restart dev server after batch standardize / re-parse to clear cache.
- Filename collisions in same folder: when multiple files would have the same stripped title,
parse_filename() keeps the subtitle. Already handled.
- REST API row limit 1000: any new bulk read must use
.range()-based pagination. The existing index endpoint has .range(0, 1999) baked in.
Files NOT to touch unless user requests
data/local_inventory.json โ frozen Drive scan snapshot
data/parse_progress.txt โ auto-managed by parse_worker
G:/ๆ็้ฒ็ซฏ็กฌ็ข/่ณๆ/้ปๅญๆธ/_chunks/*.jsonl โ source of truth for full text (R2 mirrors these; if lost, must re-parse)
Recommended order for "I'm a new agent picking this up"
- Run all three status checks to see what's queued:
python scripts/ingest_new_books.py status
python scripts/repopulate_chunk_previews.py status
python scripts/ocr_with_gemini.py status
- If
z-lib/ has files waiting โ run python scripts/ingest_new_books.py run (Workflow D).
- Read
standardize-ebook SKILL before touching that pipeline.
- Don't re-run
standardize_ebook.py on books with annotations (annotations table). Currently only ๆๆ็ๆญทๅฒ has any.
- For categorized batch standardize (ๅฒๅญธ โ others), use
--dry-run first, then real run.
- Watch the daily scheduler each afternoon at 16:00 โ read the day's log file in
scripts/logs/. Logs are still named ocr_YYYY-MM-DD.log but now also contain ingest + parse output.
Reader-side features tied to the pipeline
Bookshelf + reading bookmarks
Per-user reading state. Schema in
database/bookshelf-and-bookmarks.sql.
RLS on; service-role endpoints filter by user_id.
user_reading_status (user_id, ebook_id, status, updated_at) โ status โ 'reading' | 'read'. PK enforces one row per user-book.
reading_bookmarks (id, user_id, ebook_id, chunk_index, created_at) โ date-stamped ใไปๆฅ่ฎๅฐ้่ฃกใ markers; multiple per book.
Endpoints:
PUT /api/ebooks/:id/reading-status body {status} โ null removes
GET /api/ebooks/:id/reading-status
POST/GET /api/ebooks/:id/bookmarks, DELETE /api/bookmarks/:id
GET /api/me/bookshelf โ merged listing for /bookshelf page + /ebook sidebar counts
Reader UX (pages/ebook/[id].vue):
- Toolbar status pill cycles ๐ โ ๐ (reading) โ โ
(read) โ ๐
- ใ๐
ไปๆฅ่ฎๅฐ้่ฃกใ button only when status is
reading
- TOC sidebar shows date badges next to chunks with bookmarks (hover-ร to delete)
- Auto-jump to latest bookmark on book open if status=
reading AND no ?page= in URL; skipped when status=read
/ebook sidebar has ๐ ้ฑ่ฎไธญ (n) / โ
ๅทฒ่ฎ (n) entries below the categories โ same URL-driven model as categories (?shelf=reading|read). app/router.options.ts adds scrollBehavior(savedPosition) so browser back from /ebook/:id โ /ebook returns to the previous scroll.
Excerpts auto-flow (save_as_excerpt: true on POST /api/annotations)
Reader's ใ+ ๆธๆใ button creates an annotation and:
- If the ebook's
book_id is null โ auto-creates a books row using ALL the rich metadata extracted by standardize (title / author / translator / publisher / publication_year / original_title / original_author / original_publish_year). Writes book_id back to ebooks.
- Inserts an
excerpts row with content + title (required, prompted via modal) + chapter (= pageChapter from reader) + page_number (= ็ฌฌ N ๆฎต) + the new book_id.
- Links annotation to excerpt via
excerpt_id.
Result: the book appears in /excerpts/library with rich metadata, excerpts grouped by chapter on /excerpts/library/[bookId], and search results stay fresh because that page refetches allBooks on visibilitychange.
Tags (tags + book_tags + excerpt_tags)
Cross-cut, network-style organization that complements the existing tree
categories. Schema in
database/tags.sql, applied via the
Management API (psycopg2 direct DB blocked by IPv6-only DNS โ see the
auto-memory reference_supabase_management_api.md).
tags (id, name, color, created_at) โ global, name unique
(case-insensitive lookup keeps the picker idempotent)
book_tags (book_id, tag_id) and excerpt_tags (excerpt_id, tag_id)
โ junction tables, CASCADE on entity delete
Endpoints:
GET/POST /api/tags, DELETE /api/tags/:id
GET/PUT /api/books/:id/tags, GET/PUT /api/excerpts/:id/tags
tagId query param on /api/books and /api/ebooks (resolves
book_tags to a set of book_ids first, then narrows the listing query)
Reusable components/TagPicker.vue
shows selected tags as chips, typeahead-filters existing tags, creates
new ones on Enter (server returns the existing row when name collides).
Wired into:
/excerpts/library/[bookId] book metadata block (book-level tags)
- per-excerpt card on the same page (excerpt-level tags)
URL-driven filter ?tag=<id>:
/excerpts/library shows a tag chip strip below search bar
/ebook shows a ใๆจ็ฑคใ sidebar section below ใๆ็ๆธๆซใ (only
rendered when at least one tag exists)
Markdown citation export
ใ๐ ๅฏๅบ Markdownใ button on /excerpts/library/[bookId] toolbar.
Builds a self-contained markdown document:
- Bibliographic header (title / author / translator / publisher / year
- Original publication line for translations)
- Chapter-grouped excerpt blocks with
> blockquote content and
โโใbookใ, chapter, page citation tails
Copies to clipboard AND triggers a <book-title>.md download in one
click. No backend involved (built client-side from already-loaded
book.value + chapterGroups).
Pending TODOs
Online metadata enrichment โ โ
shipped 2026-05-06
scripts/enrich_book_metadata.py reads books rows where
publisher IS NULL OR publish_year IS NULL and tries Google Books
(primary for CJK titles) โ Open Library. Each candidate must pass
title-match + author-match; subtitle-stripped variants are tried
when the precise query returns nothing. Only null fields get
written, and metadata_locked = true blocks all writes โ see the
metadata_locked and metadata_source columns added in the same
session.
First-pass results on the 28-book backlog:
| Field | Filled / total books rows | ฮ from extraction-only baseline |
|---|
publisher | 109 / 123 (89%) | +8 |
publish_year | 107 / 123 (87%) | +11 |
translator | 58 / 123 | (untouched โ ่ญฏ่
not online-enrichable) |
original_title / original_author / original_publish_year | 58 / 123 | (untouched โ same reason) |
Remaining 17 no-hits cluster as: article fragments (titled ใโฆใ),
Chinese-Buddhist works thinly indexed online, and translated Western
books whose Chinese edition is absent from Google Books while only
the English original is present (correctly rejected โ wrong publisher
data would be worse than null). These need manual fill via the
/excerpts/library/[bookId] UI.
Re-run is safe:
python scripts/enrich_book_metadata.py status
python scripts/enrich_book_metadata.py run
python scripts/enrich_book_metadata.py run --book <uuid>
python scripts/enrich_book_metadata.py probe --book <uuid>
Lock a manually-edited row so this script won't ever touch its
nulls again:
UPDATE books SET metadata_locked = true WHERE id = '...';
Optional improvement: set GOOGLE_BOOKS_API_KEY in .env for higher
quota (anonymous tier is fine for dozens of rows; useful only if
re-running over hundreds).
See also