| name | progress |
| description | Check pipeline processing progress — all phases including OCR, translation, image extraction, enrichment, and more. Shows real page-level verification, not just job counters. Use when asked "how's it going?", "progress?", "status?", "now?", or any progress check. |
Pipeline Progress Check
Report on processing progress with verified page-level data across ALL pipeline phases. Job counters lie — always cross-reference against actual page records.
Also remind the user: full pipeline dashboard is live at https://sourcelibrary.org/analytics (Pipeline tab).
Critical Diagnostic Rules
- Never trust job counters alone. A job showing
status: 'processing' with progress.completed: 0 for hours is STUCK, not "in progress".
- Never attribute global throughput to a specific operation. Translation throughput from
pages.translation.updated_at includes ALL sources (pipeline cron, Lambda workers, backfill jobs). Filter by book IDs to attribute correctly.
- Spot-check actual pages. For any "processing" job, check if its pages actually have
translation.data or ocr.data. This is ground truth.
- Flag stale jobs explicitly. Any job in
processing for >2 hours with 0 progress is stuck. Say so clearly.
Snapshot-First Approach (PREFERRED)
A pre-computed snapshot exists at system_config._id: 'enrichment_snapshot' (updated every 2h by Hetzner cron). Always try this first — it's a single document read vs 15+ slow aggregations.
const snapshot = await db.collection('system_config').findOne({ _id: 'enrichment_snapshot' });
If the snapshot exists and computed_at is within the last 3 hours, use it directly. Display the computed_at timestamp so the user knows how fresh the data is.
The snapshot contains: funnel, enrichment, milestones, gallery, active_jobs, stuck_jobs, throughput, paused, paused_phases, and computation_ms.
Only fall back to live queries if:
- The snapshot doesn't exist
- The snapshot is older than 3 hours
- The user explicitly asks for "live" or "fresh" data
For live queries, prefer running the enrichment-snapshot script directly (node scripts/workers/enrichment-snapshot.mjs) rather than inline queries — it writes the snapshot and you can read it back.
What to Check (Live Fallback)
Run a single MongoDB script that reports ALL of the following. Use set -a; source .env.production.local; set +a; node -e "..." to run.
1. Pipeline Funnel (bird's eye)
const funnel = await db.collection('books').aggregate([
{ $match: { 'pipeline_auto.status': { $exists: true } } },
{ $group: { _id: '$pipeline_auto.status', count: { $sum: 1 } } },
]).toArray();
IMPORTANT: Display in lifecycle order, not sorted by count. Use this fixed order (skip stages with 0 books):
queued → archiving → archive_complete → ocr_submitted → ocr_complete →
metadata_enriched → ft_verifying → ft_verified → translate_submitted →
translate_complete → enriching → enriched → chapters → chapters_complete →
images_submitted → images_complete → complete
Then separately: needs_attention, failed, plus any other statuses (e.g. empty_shell, paused, replaced_by_*).
This matches the Pipeline Funnel chart in the /analytics dashboard.
2. Enrichment Coverage (all phases)
const enrichment = await db.collection('books').aggregate([
{ $match: { status: { $ne: 'deleted' }, pages_count: { $gt: 0 } } },
{ $group: {
_id: null,
total: { $sum: 1 },
has_ocr: { $sum: { $cond: [{ $gt: ['$pages_ocr', 0] }, 1, 0] } },
has_translation: { $sum: { $cond: [{ $gt: ['$pages_translated', 0] }, 1, 0] } },
has_metadata: { $sum: { $cond: [{ $ifNull: ['$ai_metadata.enriched_at', false] }, 1, 0] } },
has_ft_verification: { $sum: { $cond: [{ $ifNull: ['$translation_verification.verified_at', false] }, 1, 0] } },
has_summary: { $sum: { $cond: [{ $ifNull: ['$index.generatedAt', false] }, 1, 0] } },
has_chapters: { $sum: { $cond: [{ $ifNull: ['$chapters', false] }, 1, 0] } },
has_collections: { $sum: { $cond: [{ $ifNull: ['$collection_scores', false] }, 1, 0] } },
has_quality_score: { $sum: { $cond: [{ $ifNull: ['$quality_score', false] }, 1, 0] } },
has_faceted_tags: { $sum: { $cond: [{ $ifNull: ['$faceted_tags', false] }, 1, 0] } },
has_author_entity: { $sum: { $cond: [{ $ifNull: ['$author_entity_id', false] }, 1, 0] } },
fully_translated: { $sum: { $cond: [{ $and: [{ $gt: ['$pages_translated', 0] }, { $gte: ['$pages_translated', { $subtract: [{ $ifNull: ['$pages_ocr', 0] }, { $ifNull: ['$pages_blank', 0] }] }] }] }, 1, 0] } },
pipeline_complete: { $sum: { $cond: [{ $eq: ['$pipeline_auto.status', 'complete'] }, 1, 0] } },
}}
]).toArray();
2b. First Translation Milestones
const firstTranslations = await db.collection('books').countDocuments({
is_first_translation: true
});
console.log(`First translations into English: ${firstTranslations}`);
const near90 = await db.collection('books').countDocuments({
status: { $ne: 'deleted' },
pages_count: { $gt: 0 },
pages_translated: { $gt: 0 },
$expr: { $gte: ['$pages_translated', { $multiply: [{ $subtract: [{ $ifNull: ['$pages_ocr', 0] }, { $ifNull: ['$pages_blank', 0] }] }, 0.9] }] }
});
console.log(`>90% translated: ${near90}`);
IMPORTANT: The first-translation flag is is_first_translation at the book root level (NOT translation_verification.is_first_translation). It's a boolean derived from the translation_verification.disposition field — true for confirmed_first, first_complete_translation, and first_modern_translation.
3. Image Extraction & Gallery
const booksWithImages = await db.collection('gallery_images').distinct('book_id');
console.log(`Books with extracted images: ${booksWithImages.length}`);
const galleryCount = await db.collection('gallery_images').estimatedDocumentCount();
console.log(`Gallery images: ${galleryCount}`);
const imageJobs = await db.collection('jobs').aggregate([
{ $match: { type: 'image_extraction', status: { $in: ['pending', 'processing'] } } },
{ $group: { _id: '$status', count: { $sum: 1 }, pages: { $sum: '$progress.total' }, done: { $sum: '$progress.completed' } } }
]).toArray();
4. Active Jobs by Type (all types)
const activeJobs = await db.collection('jobs').aggregate([
{ $match: { status: { $in: ['pending', 'processing'] } } },
{ $group: {
_id: { type: '$type', note: '$note' },
count: { $sum: 1 },
totalPages: { $sum: '$progress.total' },
completedPages: { $sum: '$progress.completed' },
failedPages: { $sum: '$progress.failed' },
oldest: { $min: '$created_at' }
}}
]).toArray();
5. Stuck Job Detection (CRITICAL)
const twoHoursAgo = new Date(Date.now() - 2 * 3600000);
const stuckJobs = await db.collection('jobs').countDocuments({
status: 'processing',
'progress.completed': 0,
created_at: { $lt: twoHoursAgo }
});
6. Real Throughput (page-level verification)
for (const hours of [1, 3, 6]) {
const since = new Date(Date.now() - hours * 3600000);
const count = await db.collection('pages').countDocuments({
'translation.updated_at': { $gte: since }
});
console.log(`Translations last ${hours}h: ${count} (${Math.round(count/hours)}/hr)`);
}
for (const hours of [1, 3, 6]) {
const since = new Date(Date.now() - hours * 3600000);
const count = await db.collection('pages').countDocuments({
'ocr.updated_at': { $gte: since }
});
console.log(`OCR last ${hours}h: ${count} (${Math.round(count/hours)}/hr)`);
}
7. Backfill-Specific Progress (when applicable)
For any named backfill operation (identified by note field on jobs), verify progress against actual pages:
const sampleJobs = await db.collection('jobs').find({
note: 'near-complete backfill',
status: 'processing'
}).limit(5).toArray();
for (const job of sampleJobs) {
const bookPages = await db.collection('pages').countDocuments({
book_id: job.book_id,
'translation.data': { $exists: true, $ne: '' }
});
const totalPages = await db.collection('pages').countDocuments({ book_id: job.book_id });
console.log(`${job.book_title?.substring(0,40)} — ${bookPages}/${totalPages} pages actually translated, job says ${job.progress.completed}/${job.progress.total}`);
}
8. ETA Calculation
const oneHourAgo = new Date(Date.now() - 3600000);
const hourlyRate = await db.collection('pages').countDocuments({
'translation.updated_at': { $gte: oneHourAgo }
});
const remaining = activeJobs
.filter(j => j._id.type === 'translation')
.reduce((sum, j) => sum + (j.totalPages - j.completedPages - j.failedPages), 0);
const etaHours = hourlyRate > 0 ? remaining / hourlyRate : Infinity;
9. Pause Status
const control = await db.collection('system_config').findOne({ _id: 'processing_control' });
if (control?.paused) console.log('PIPELINE PAUSED');
if (control?.paused_phases?.length) console.log('Paused phases: ' + control.paused_phases.join(', '));
10. Batch API Health (IMPORTANT)
The batch-collector writes system_config._id: 'batch_health' every 10 minutes with real Gemini-side state. Always read this.
const batchHealth = await db.collection('system_config').findOne({ _id: 'batch_health' });
if (batchHealth) {
const age = ((Date.now() - new Date(batchHealth.updated_at).getTime()) / 60000).toFixed(0);
console.log(`Batch health (${age}m old):`);
console.log(` Gemini active: ${batchHealth.geminiActive} (by key: ${batchHealth.geminiActiveByKey?.join(', ')})`);
console.log(` DB active: ${batchHealth.dbActive} | Zombies: ${batchHealth.dbZombies}`);
console.log(` Completions: ${batchHealth.recentCompletions1h}/1h, ${batchHealth.recentCompletions6h}/6h`);
console.log(` OCR pages saved: ${batchHealth.recentPagesSaved1h}/1h, ${batchHealth.recentPagesSaved6h}/6h`);
console.log(` Healthy: ${batchHealth.healthy}`);
if (batchHealth.issues?.length) console.log(` Issues: ${batchHealth.issues.join('; ')}`);
}
Key signals:
healthy: false = something needs attention. Check issues array.
geminiActive > 0 but recentCompletions1h === 0 = pipeline may be frozen
dbZombies > 0 = jobs created but never submitted (auto-cleaned if >1h old)
orphansCancelled > 0 = drift detected and auto-fixed
Output Format
Present results as a concise status report:
Pipeline Status — [timestamp]
Dashboard: https://sourcelibrary.org/analytics (Pipeline tab)
Pipeline Funnel (in lifecycle order, skip 0s):
archiving: 13,869
archive_complete: 12,014
ocr_submitted: 118
metadata_enriched: 5,914
translate_submitted: 7
translate_complete: 19
complete: 3,713
---
needs_attention: 2,624
failed: 57
other: 50 (empty_shell, paused, replaced_by_*)
Milestones:
First translations (EN): X books
>90% translated: X books
Fully translated: X books
Enrichment coverage (of Y total books):
OCR: X (Z%)
Translation: X (Z%)
Metadata: X (Z%)
FT Verification: X (Z%)
Summary & Index: X (Z%)
Chapters: X (Z%)
Collections: X (Z%)
Quality Score: X (Z%)
Image Extraction: X books | Gallery: N images
Faceted Tags: X (Z%)
Author Entities: X (Z%)
Pipeline Complete: X (Z%)
Active jobs:
Translation (backfill): X jobs, Y/Z pages done [rate/hr, ETA]
Translation (pipeline): X jobs, Y/Z pages done
OCR: X jobs
Image Extraction: X jobs
...
STUCK JOBS: X jobs processing with 0 progress for >2h ← only if > 0
Throughput (verified from pages):
Translation: X/hr (1h), Y/hr (3h), Z/hr (6h)
OCR: X/hr (1h), Y/hr (3h)
Spot check (5 random processing jobs):
"Book Title" — 45/50 pages translated (job says 43/50)
"Book Title" — 0/30 pages translated (job says 0/30) STUCK
Batch API Health (updated Xm ago):
Gemini active: X (key0: X, key1: X, key2: X) ← real Gemini state
DB active: X | Zombies: X
Completions: X/1h, X/6h | OCR pages: X/1h, X/6h
Status: HEALTHY (or: ISSUES — list problems)
Pauses: none (or list)
Common Issues
- Jobs stuck at 0 progress: SQS messages consumed but Lambda never processed them. Cancel stuck jobs, clear
book.job locks, re-submit.
- Throughput looks good but backfill isn't moving: You're seeing pipeline cron throughput, not backfill. Filter by backfill book IDs.
pages_translated stale: sync-page-counts cron runs every 6h. For real-time count, query pages directly.
- Book.job lock prevents re-submission: Clear with
$unset: { job: '' } on affected books.
- Image extraction stuck: Check if
images phase is in paused_phases. Also check Lambda CloudWatch for image-extraction-processor errors.
- Enrichment coverage low for summaries/chapters: These run via Hetzner
enrich-worker.mjs (every 5 min, flock /tmp/sl-enrich.lock). Check tail /var/log/sourcelibrary/enrich.log on Hetzner, NOT the old Vercel enrich-books cron (deprecated since ~March 2026).