一键在 Manus 中运行任何 Skill

vlm-ocr-evaluation

星标22

分支0

更新时间2026年6月26日 09:52

Compare OCR systems before a bulk run: candidate set, stratified ground truth, CER/WER, normalization, per-language and per-stratum accuracy.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

scdenney

scdenney/open-science-skills

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

SKILL.md

readonly

同仓库更多 Skills

同仓库

research-repo

scdenney/open-science-skills

Scaffold or audit an entire research project repository organized around its source library. Use whenever the user is starting, structuring, organizing, or reviewing a whole project — "set up a research repo", "how should I structure/organize this project", "initialize my sources folder", "new paper or literature-review project", "audit my repo structure", "is my sources folder set up right", "check my project layout". Builds the full tree from the sources spine outward — sources/{og,md,unprocessed}, references.bib, a PDF→Markdown convert script (OpenDataLoader PDF), a process-source intake command, CLAUDE.md/AGENTS.md, .gitignore, .venv — plus the analysis, manuscript, and review folders; or audits an existing repo and reports what is present, partial, or missing. NOT for intaking or converting a single PDF (use process-source) or building a publication replication package (use replication-package).

2026-06-2722

llm-calibration-logprobs

scdenney/open-science-skills

LLM token logprobs and calibration: per-decision confidence, ECE, Brier, reliability diagrams, low-confidence triage.

2026-06-2622

model-council-voting

scdenney/open-science-skills

LLM council/panel voting: multi-model coders, consensus rules, inter-rater agreement (kappa, alpha), correlated-error diagnostics.

2026-06-2622

fact-check

scdenney/open-science-skills

Fact-check a manuscript's claims against the cited sources themselves: locate each source's knowledge-base Markdown file and verify the in-text claim is actually supported. Runs a pre-flight gate that refuses unless a per-source Markdown knowledge base exists and is clean (PDFs converted via process-source); then runs citation-check; then audits claim support, overclaiming, direction, scope, and misattribution.

2026-06-1422

citation-check

scdenney/open-science-skills

Audit citation existence and fabrication risk, in-text/reference parity, DOIs, claim support, and style.

2026-06-1422

paper-tex

scdenney/open-science-skills

Typeset a working paper or journal submission in house-style LaTeX from any draft — Markdown, Word (.docx), TeX, ODT, RTF, or HTML. Convert with pandoc, wrap in an EB Garamond template, build the PDF with latexmk, and prepare for a specific journal (spacing, page limit, anonymization, disclosures, citation style). Use for "format/typeset/convert my paper to LaTeX", "make a working paper", "prepare this for submission to <journal>".

2026-06-1322

name	vlm-ocr-evaluation
description	Compare OCR systems before a bulk run: candidate set, stratified ground truth, CER/WER, normalization, per-language and per-stratum accuracy.
argument-hint	[describe your corpus, the candidate OCR systems, and the languages/scripts to evaluate]

VLM-OCR Evaluation: Comparing OCR Systems Before You Commit

Instructions

Before running any OCR model across a whole corpus, run a controlled comparison on a small, human-transcribed sample and let the measured error rates pick the model. This skill is the selection gate that precedes the vlm-ocr-pipeline skill: use this to choose a model and document why, then use vlm-ocr-pipeline to run the chosen model at scale and post-ocr-cleanup to clean its output. For the hardest pages, where no single model is reliable, the multi-model voting logic in model-council-voting can be applied to OCR transcriptions as well.

1. Run a Comparison Before You Commit

Treat model choice as an empirical question, not a default — test, do not guess. A model that tops a general vision-language leaderboard, or that read one language well, can still fail on your script, era, or page layout. The only trustworthy signal is its error rate on pages like yours.
The comparison is cheap insurance. Evaluating a 50–100 page sample once prevents discovering, after a 50,000-page bulk run, that the chosen model silently dropped every table or collapsed on degraded scans.
Scope the comparison to the decision. The user's ocr_pipeline/comparison/ setup runs nine systems on 64 pages spanning two languages and seven decades before committing to a bulk pipeline — enough to rank the candidates on the strata that matter, not a full corpus.
The output is a go/no-go gate: keep the model whose accuracy clears your threshold on the strata you care about, or conclude that no single model does and split the corpus by script or era (§6).

2. Assemble the Candidate Set

Include three kinds of system: several open-weight VLMs, one or two proprietary APIs, and a traditional OCR baseline (Tesseract). The gei comparison's MODELS registry in run_comparison.py holds six open-weight VLMs (Qwen3.5-35B, Qwen3-VL-32B, Qwen3.5-9B, Gemma, MiniCPM-V, DeepSeek-OCR) plus Tesseract, with two proprietary APIs (GPT-4.1, Claude) run by separate scripts — nine systems total.
Pick the VLM candidates from OCR benchmarks (e.g., OCRBench), not general multimodal leaderboards. General vision-language ability does not predict transcription fidelity on dense historical print; the vlm-ocr-pipeline skill covers benchmark-grounded model selection in detail.
Always keep a traditional baseline. Even when it loses, Tesseract anchors what "hard" means for your corpus and shows where a VLM actually earns its extra cost and latency.
Record an exact registry entry per system — name, HuggingFace id or dated API identifier, quantization, and loader — as the gei registry does (name/hf_id/type/loader per model). Family-name-only reporting ("we used Qwen and Gemma") is not reproducible.

3. Build a Stratified Ground-Truth Set

Human-transcribe a stratified sample, not a convenience sample. Stratify on the dimensions that actually drive OCR difficulty: language/script, era (a decade bracket), and content type — running body text, multi-column tables, illustrated or captioned pages, degraded or water-damaged print, and front matter / title pages. The gei manifest tags each page with language, year, decade, subject, and content_type and samples across all of them.
Size: roughly 50–100 pages is a workable default for a handful of candidate systems (a house default that balances transcription effort against per-stratum cell sizes, not a cited figure). More strata require more pages so that each cell holds enough pages to mean something.
Store ground truth as one UTF-8 .txt per page, keyed by a stable page id (ground_truth/<page_id>.txt in compute_accuracy.py). Transcribe faithfully — preserve the characters actually on the page (hanja alongside hangul, diacritics) — and decide up front how to render non-text regions (tables, figures) so the reference and the OCR output are scored on the same basis.
Transcribe before looking at any model output, so the reference is not anchored to a model's guesses. Two independent transcribers on a subset, with disagreements reconciled, guard against a single transcriber's systematic errors (the inter-coder logic in the text-classification skill applies here too).

4. Character and Word Error Rate

CER and WER are edit-distance metrics. CER = Levenshtein(reference, hypothesis) ÷ length(reference); WER is the same at the word level (a word-level dynamic-programming edit distance ÷ reference word count). compute_accuracy.py implements both — a character Levenshtein distance for CER and a word-level DP for WER (Levenshtein 1966).
Declare normalization before scoring; it changes the numbers. The gei normalize_text applies Unicode NFC, strips markdown artifacts (headers, bold/italic, links — VLMs routinely emit markdown), and collapses whitespace, but is deliberately case-sensitive (no lowercasing) to preserve OCR fidelity. State each choice (NFC vs NFKC, case sensitivity, punctuation and markdown handling) and apply it identically to every system. Comparing a markdown-emitting VLM against a plain-text baseline without stripping markup unfairly penalizes the VLM.
Report both CER and WER, and report the distribution, not a single mean. Give mean and median with the page count (n) per cell. The median resists the blank-page and repetition-loop outliers that wreck a mean, while the mean exposes how bad the tail gets.
Interpretation bands (house defaults, consistent with the vlm-ocr-pipeline skill): CER below ~5% is excellent, below ~10% is usable with cleanup, and above that the text needs heavy correction or a different model. These are planning guides, not cited cutoffs — set the operative threshold from what your downstream analysis tolerates.
Score every stratum, not just the overall mean. The gei evaluation aggregates CER/WER by model, by language × model, by decade bracket × model, and by content type × model — because a model can win overall and still fail on tables, on one script, or on the oldest decade.

5. Run the Comparison Efficiently

Load and unload models sequentially to fit one GPU. The gei runner imports torch lazily and releases each model (a gc pass) before loading the next, so several large VLMs are scored on a single card without holding them all in memory at once. State the exact quantization per model (GPTQ-Int4, NF4, BF16) — it affects both fit and accuracy.
Serve via vLLM or Ollama for batched throughput where the model supports it (run_qwen35_vllm.py runs Qwen3.5 through a vLLM OpenAI-compatible server); run proprietary APIs through their own rate-limited scripts.
Measure speed (seconds per page) alongside accuracy, but do not compare a traditional baseline's speed to a VLM's as if equal. Tesseract is fast because it does far less, and it fails on non-Latin script. Report speed as context for cost, never as a quality signal.
Make the run resumable and idempotent: write one output file per page per model and skip pages already done, so a crash partway through the comparison does not restart everything.

6. Interpret: There Is No Single Best Model

Expect no universal winner. The lesson of the gei comparison is fit-to-script-and-corpus: on Latin-script Polish most models cluster within a couple of percent CER, while on Korean the field splits by roughly 25% even among the best. When that happens, pick per script or per era rather than forcing one model.
Skill does not transfer across scripts. A model that reads Polish beautifully can collapse on Korean hanja–hangul. Never generalize a single-language result to a script the model was not measured on.
The "OCR specialist" is not guaranteed to win. In the gei run a document-only OCR model produced the most blank pages and repetition loops, and general-purpose VLMs beat it. Judge on measured CER for your pages, not on a system's category label.
Open can rival proprietary and also reproduce. The best open-weight models matched the proprietary APIs while remaining pinnable and re-runnable; the proprietary APIs are fast and capable but change underneath you between versions (the reproducibility argument the vlm-ocr-pipeline and model-council-voting skills also make). Weigh accuracy against reproducibility, not accuracy alone.
Decide and document: the chosen model(s), the threshold they cleared, the strata where they win and lose, and any split (e.g., model A for Polish, model B for Korean).

7. Reproducibility and Reporting

Publish the model registry — exact ids, quantization, loaders or dated API identifiers, decoding settings, and seeds — and the ground-truth set (or a precise description of how it was built and transcribed).
Report the full results table: CER and WER, mean and median, n, broken down by model and by stratum (language, era, content type), with the normalization recipe stated alongside.
Report speed per system for cost context, carrying the traditional-baseline caveat.
Report per-model failure modes — blank pages, repetition loops, garbled or hallucinated script. These are decision-relevant and invisible in an averaged CER.
For the bulk run that follows, compose with vlm-ocr-pipeline (production OCR with the chosen model) and post-ocr-cleanup (cleaning its output); for the methods-section disclosure of the comparison, compose with methods-reporting.

vlm-ocr-evaluation

VLM-OCR Evaluation: Comparing OCR Systems Before You Commit

Instructions

1. Run a Comparison Before You Commit

2. Assemble the Candidate Set

3. Build a Stratified Ground-Truth Set

4. Character and Word Error Rate

5. Run the Comparison Efficiently

6. Interpret: There Is No Single Best Model

7. Reproducibility and Reporting

Quality Checks

VLM-OCR Evaluation: Comparing OCR Systems Before You Commit

Instructions

1. Run a Comparison Before You Commit

2. Assemble the Candidate Set

3. Build a Stratified Ground-Truth Set

4. Character and Word Error Rate

5. Run the Comparison Efficiently

6. Interpret: There Is No Single Best Model

7. Reproducibility and Reporting

Quality Checks