| name | batch-translate |
| description | Batch process books through the complete pipeline - generate cropped images for split pages, OCR all pages, then translate with context. Use when asked to process, OCR, translate, or batch process one or more books. |
Batch Book Translation Workflow
Process books through the complete pipeline: Crop → OCR → Translate
Roadmap Reference
See .claude/ROADMAP.md for the translation priority list.
Priority 1 = UNTRANSLATED - These are highest priority for processing:
- Kircher encyclopedias (Oedipus, Musurgia, Ars Magna Lucis)
- Fludd: Utriusque Cosmi Historia
- Theatrum Chemicum, Musaeum Hermeticum
- Cardano: De Subtilitate
- Della Porta: Magia Naturalis
- Lomazzo, Poliziano, Landino
curl -s "https://sourcelibrary.org/api/books/roadmap" | jq '.books[] | select(.priority == 1) | {title, notes}'
Roadmap source: src/app/api/books/roadmap/route.ts
Overview
This workflow handles the full processing pipeline for historical book scans:
- Generate Cropped Images - For split two-page spreads, extract individual pages
- OCR - Extract text from page images using Gemini vision
- Translate - Translate OCR'd text with prior page context for continuity
API Endpoints
| Endpoint | Purpose |
|---|
GET /api/books | List all books |
GET /api/books/BOOK_ID | Get book with all pages |
POST /api/jobs/queue-books | Queue pages for Lambda worker processing (primary path) |
GET /api/jobs | List processing jobs |
POST /api/jobs/JOB_ID/retry | Retry failed pages in a job |
POST /api/jobs/JOB_ID/cancel | Cancel a running job |
POST /api/books/BOOK_ID/batch-ocr-async | Submit Gemini Batch API OCR job (50% cheaper, ~24h) |
POST /api/books/BOOK_ID/batch-translate-async | Submit Gemini Batch API translation job |
Processing Options
Option 1: Lambda Workers via Job System (Primary Path)
The primary processing path uses AWS Lambda workers via SQS queues. Each page is processed independently with automatic job tracking.
curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \
-H "Content-Type: application/json" \
-d '{"bookIds": ["BOOK_ID"], "action": "ocr"}'
curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \
-H "Content-Type: application/json" \
-d '{"bookIds": ["BOOK_ID"], "action": "translation"}'
curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \
-H "Content-Type: application/json" \
-d '{"bookIds": ["BOOK_ID"], "action": "image_extraction"}'
Model selection: Don't hardcode a model — prefer getModelForBook(book) from src/lib/types/ai-models.ts, which routes BPH books and non-Latin-script languages to gemini-3-flash-preview (full quality) and everything else to gemini-3.1-flash-lite-preview (50% cheaper, comparable quality on Latin script). If you must specify a model in a job payload, pick based on this rule. Never use Gemini below v3 (gemini-2.x is deprecated).
Option 2: Gemini Batch API (50% Cheaper, Automated Pipeline)
The post-import-pipeline cron uses Gemini Batch API for automated processing of newly imported books. Results arrive in ~24 hours at 50% cost.
| Job Type | API | Model | Cost |
|---|
| Single page | Realtime (Lambda) | per getModelForBook(book) | Full price |
| batch_ocr | Batch API | per getModelForBook(book) | 50% off |
| batch_translate | Batch API | per getModelForBook(book) | 50% off |
Stacking discounts: lite + Batch API on a Latin-script non-BPH book is ~75% off full flash realtime.
OCR Output Format
OCR uses Markdown output with semantic tags:
Markdown Formatting
# ## ### for headings (bigger text = bigger heading)
**bold**, *italic* for emphasis
->centered text<- for centered lines (NOT for headings)
> blockquotes for quotes/prayers
--- for dividers
- Tables only for actual tabular data
Metadata Tags (hidden from readers)
| Tag | Purpose |
|---|
<lang>X</lang> | Detected language |
<page-num>N</page-num> | Page/folio number |
<header>X</header> | Running headers |
<sig>X</sig> | Printer's marks (A2, B1) |
<meta>X</meta> | Hidden metadata |
<warning>X</warning> | Quality issues |
<vocab>X</vocab> | Key terms for indexing |
Inline Annotations (visible to readers)
| Tag | Purpose |
|---|
<margin>X</margin> | Marginal notes (before paragraph) |
<gloss>X</gloss> | Interlinear annotations |
<insert>X</insert> | Boxed text, additions |
<unclear>X</unclear> | Illegible readings |
<note>X</note> | Interpretive notes |
<term>X</term> | Technical vocabulary |
<image-desc>X</image-desc> | Describe illustrations |
Critical OCR Rules
- Preserve original spelling, capitalization, punctuation
- Page numbers/headers/signatures go in metadata tags only
- IGNORE partial text at edges (from facing page in spread)
- Describe images/diagrams with
<image-desc>, never tables
- End with
<vocab>key terms, names, concepts</vocab>
Step 1: Analyze Book Status
First, check what work is needed for a book:
curl -s "https://sourcelibrary.org/api/books/BOOK_ID" > /tmp/book.json
jq '{
title: .title,
total_pages: (.pages | length),
split_pages: [.pages[] | select(.crop)] | length,
needs_crop: [.pages[] | select(.crop) | select(.cropped_photo | not)] | length,
has_ocr: [.pages[] | select((.ocr.data // "") | length > 0)] | length,
needs_ocr: [.pages[] | select((.ocr.data // "") | length == 0)] | length,
has_translation: [.pages[] | select((.translation.data // "") | length > 0)] | length,
needs_translation: [.pages[] | select((.ocr.data // "") | length > 0) | select((.translation.data // "") | length == 0)] | length
}' /tmp/book.json
Detecting Bad OCR
Pages that were OCR'd before cropped images were generated have incorrect OCR (contains both pages of the spread). Detect these:
jq '[.pages[] | select(.crop) | select(.ocr.data) |
select(.ocr.data | test("two-page|spread"; "i"))] | length' /tmp/book.json
Step 2: Generate Cropped Images
For books with split two-page spreads, generate individual page images:
CROP_IDS=$(jq '[.pages[] | select(.crop) | select(.cropped_photo | not) | .id]' /tmp/book.json)
curl -s -X POST "https://sourcelibrary.org/api/jobs" \
-H "Content-Type: application/json" \
-d "{
\"type\": \"generate_cropped_images\",
\"book_id\": \"BOOK_ID\",
\"book_title\": \"BOOK_TITLE\",
\"page_ids\": $CROP_IDS
}"
Process the job:
curl -s -X POST "https://sourcelibrary.org/api/jobs/JOB_ID/process"
Step 3: OCR Pages
Option A: Using Job System (for large batches)
OCR_IDS=$(jq '[.pages[] | select((.ocr.data // "") | length == 0) | .id]' /tmp/book.json)
curl -s -X POST "https://sourcelibrary.org/api/jobs" \
-H "Content-Type: application/json" \
-d "{
\"type\": \"batch_ocr\",
\"book_id\": \"BOOK_ID\",
\"book_title\": \"BOOK_TITLE\",
\"model\": \"gemini-3.1-flash-lite-preview\",
\"language\": \"Latin\",
\"page_ids\": $OCR_IDS
}"
Option B: Using Lambda Workers with Page IDs
curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \
-H "Content-Type: application/json" \
-d '{
"bookIds": ["BOOK_ID"],
"action": "ocr",
"pageIds": ["PAGE_ID_1", "PAGE_ID_2"],
"overwrite": true
}'
Lambda workers automatically use cropped_photo when available.
Step 4: Translate Pages
Option A: Using Job System
TRANS_IDS=$(jq '[.pages[] | select((.ocr.data // "") | length > 0) | select((.translation.data // "") | length == 0) | .id]' /tmp/book.json)
curl -s -X POST "https://sourcelibrary.org/api/jobs" \
-H "Content-Type: application/json" \
-d "{
\"type\": \"batch_translate\",
\"book_id\": \"BOOK_ID\",
\"book_title\": \"BOOK_TITLE\",
\"model\": \"gemini-3.1-flash-lite-preview\",
\"language\": \"Latin\",
\"page_ids\": $TRANS_IDS
}"
Option B: Using Lambda Workers (Recommended)
Lambda FIFO queue automatically provides previous page context for translation continuity:
curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \
-H "Content-Type: application/json" \
-d '{"bookIds": ["BOOK_ID"], "action": "translation"}'
The translation Lambda worker processes pages sequentially via FIFO queue and fetches the previous page's translation for context.
Complete Book Processing Script
Process a single book through the full pipeline using Lambda workers:
#!/bin/bash
BOOK_ID="YOUR_BOOK_ID"
BASE_URL="https://sourcelibrary.org"
echo "Fetching book..."
BOOK=$(curl -s "$BASE_URL/api/books/$BOOK_ID")
TITLE=$(echo "$BOOK" | jq -r '.title[0:40]')
echo "Processing: $TITLE"
NEEDS_OCR=$(echo "$BOOK" | jq '[.pages[] | select((.ocr.data // "") | length == 0)] | length')
if [ "$NEEDS_OCR" != "0" ]; then
echo "Queueing OCR for $NEEDS_OCR pages..."
curl -s -X POST "$BASE_URL/api/jobs/queue-books" \
-H "Content-Type: application/json" \
-d "{\"bookIds\": [\"$BOOK_ID\"], \"action\": \"ocr\"}"
echo "OCR job queued!"
fi
NEEDS_TRANS=$(echo "$BOOK" | jq '[.pages[] | select((.ocr.data // "") | length > 0) | select((.translation.data // "") | length == 0)] | length')
if [ "$NEEDS_TRANS" != "0" ]; then
echo "Queueing translation for $NEEDS_TRANS pages..."
curl -s -X POST "$BASE_URL/api/jobs/queue-books" \
-H "Content-Type: application/json" \
-d "{\"bookIds\": [\"$BOOK_ID\"], \"action\": \"translation\"}"
echo "Translation job queued!"
fi
echo "Jobs queued! Monitor progress at $BASE_URL/jobs"
Fixing Bad OCR
When pages were OCR'd before cropped images existed, they contain text from both pages. Fix with:
BAD_OCR_IDS=$(jq '[.pages[] | select(.crop) | select(.ocr.data) |
select(.ocr.data | test("two-page|spread"; "i")) | .id]' /tmp/book.json)
curl -s -X POST "https://sourcelibrary.org/api/jobs/queue-books" \
-H "Content-Type: application/json" \
-d "{\"bookIds\": [\"BOOK_ID\"], \"action\": \"ocr\", \"pageIds\": $BAD_OCR_IDS, \"overwrite\": true}"
Processing All Books
Use the Lambda worker job system for bulk processing:
#!/bin/bash
BASE_URL="https://sourcelibrary.org"
BOOK_IDS=$(curl -s "$BASE_URL/api/books" | jq -r '[.[].id]')
curl -s -X POST "$BASE_URL/api/jobs/queue-books" \
-H "Content-Type: application/json" \
-d "{\"bookIds\": $BOOK_IDS, \"action\": \"ocr\"}"
curl -s -X POST "$BASE_URL/api/jobs/queue-books" \
-H "Content-Type: application/json" \
-d "{\"bookIds\": $BOOK_IDS, \"action\": \"translation\"}"
Monitor progress at https://sourcelibrary.org/jobs
Monitoring Progress
Check overall library status:
curl -s "https://sourcelibrary.org/api/books" | jq '[.[] | {
title: .title[0:30],
pages: .pages_count,
ocr: .ocr_count,
translated: .translation_count
}] | sort_by(-.pages)'
Troubleshooting
Empty Strings vs Null (CRITICAL)
In jq, empty strings "" are truthy! This means:
select(.ocr.data) matches pages with "" (WRONG)
select(.ocr.data | not) does NOT match pages with "" (WRONG)
- Use
select((.ocr.data // "") | length == 0) to find missing/empty OCR
- Use
select((.ocr.data // "") | length > 0) to find pages WITH OCR content
Rate Limits (429 errors)
Gemini API Tiers
| Tier | RPM | How to Qualify |
|---|
| Free | 15 | Default |
| Tier 1 | 300 | Enable billing + $50 spend |
| Tier 2 | 1000 | $250 spend |
| Tier 3 | 2000 | $1000 spend |
Optimal Sleep Times by Tier
| Tier | Max RPM | Safe Sleep Time | Effective Rate |
|---|
| Free | 15 | 4.0s | ~15/min |
| Tier 1 | 300 | 0.4s | ~150/min |
| Tier 2 | 1000 | 0.12s | ~500/min |
| Tier 3 | 2000 | 0.06s | ~1000/min |
Note: Use ~50% of max rate to leave headroom for bursts.
API Key Rotation
The system supports multiple API keys for higher throughput:
- Set
GEMINI_API_KEY (primary)
- Set
GEMINI_API_KEY_2, GEMINI_API_KEY_3, ... up to GEMINI_API_KEY_10
- Keys rotate automatically with 60s cooldown after rate limit
With N keys at Tier 1, you get N × 300 RPM = N × 150 safe req/min
Function Timeouts
- Jobs have
maxDuration=300s for Vercel Pro
- If hitting timeouts, reduce
CROP_CHUNK_SIZE in job processing
Missing Cropped Photos
- Check if crop job completed successfully
- Verify page has
crop data with xStart and xEnd
- Re-run crop generation for specific pages
Bad OCR Detection
Look for these patterns in OCR text indicating wrong image was used:
- "two-page spread"
- "left page" / "right page" descriptions
- Duplicate text blocks
- References to facing pages