一键导入
vlm-ocr-evaluation
Compare OCR systems before a bulk run: candidate set, stratified ground truth, CER/WER, normalization, per-language and per-stratum accuracy.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Compare OCR systems before a bulk run: candidate set, stratified ground truth, CER/WER, normalization, per-language and per-stratum accuracy.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
| name | vlm-ocr-evaluation |
| description | Compare OCR systems before a bulk run: candidate set, stratified ground truth, CER/WER, normalization, per-language and per-stratum accuracy. |
| argument-hint | [describe your corpus, the candidate OCR systems, and the languages/scripts to evaluate] |
Before running any OCR model across a whole corpus, run a controlled comparison on a small, human-transcribed sample and let the measured error rates pick the model. This skill is the selection gate that precedes the vlm-ocr-pipeline skill: use this to choose a model and document why, then use vlm-ocr-pipeline to run the chosen model at scale and post-ocr-cleanup to clean its output. For the hardest pages, where no single model is reliable, the multi-model voting logic in model-council-voting can be applied to OCR transcriptions as well.
ocr_pipeline/comparison/ setup runs nine systems on 64 pages spanning two languages and seven decades before committing to a bulk pipeline — enough to rank the candidates on the strata that matter, not a full corpus.MODELS registry in run_comparison.py holds six open-weight VLMs (Qwen3.5-35B, Qwen3-VL-32B, Qwen3.5-9B, Gemma, MiniCPM-V, DeepSeek-OCR) plus Tesseract, with two proprietary APIs (GPT-4.1, Claude) run by separate scripts — nine systems total.vlm-ocr-pipeline skill covers benchmark-grounded model selection in detail.name/hf_id/type/loader per model). Family-name-only reporting ("we used Qwen and Gemma") is not reproducible.language, year, decade, subject, and content_type and samples across all of them..txt per page, keyed by a stable page id (ground_truth/<page_id>.txt in compute_accuracy.py). Transcribe faithfully — preserve the characters actually on the page (hanja alongside hangul, diacritics) — and decide up front how to render non-text regions (tables, figures) so the reference and the OCR output are scored on the same basis.text-classification skill applies here too).compute_accuracy.py implements both — a character Levenshtein distance for CER and a word-level DP for WER (Levenshtein 1966).normalize_text applies Unicode NFC, strips markdown artifacts (headers, bold/italic, links — VLMs routinely emit markdown), and collapses whitespace, but is deliberately case-sensitive (no lowercasing) to preserve OCR fidelity. State each choice (NFC vs NFKC, case sensitivity, punctuation and markdown handling) and apply it identically to every system. Comparing a markdown-emitting VLM against a plain-text baseline without stripping markup unfairly penalizes the VLM.n) per cell. The median resists the blank-page and repetition-loop outliers that wreck a mean, while the mean exposes how bad the tail gets.vlm-ocr-pipeline skill): CER below ~5% is excellent, below ~10% is usable with cleanup, and above that the text needs heavy correction or a different model. These are planning guides, not cited cutoffs — set the operative threshold from what your downstream analysis tolerates.gc pass) before loading the next, so several large VLMs are scored on a single card without holding them all in memory at once. State the exact quantization per model (GPTQ-Int4, NF4, BF16) — it affects both fit and accuracy.run_qwen35_vllm.py runs Qwen3.5 through a vLLM OpenAI-compatible server); run proprietary APIs through their own rate-limited scripts.vlm-ocr-pipeline and model-council-voting skills also make). Weigh accuracy against reproducibility, not accuracy alone.n, broken down by model and by stratum (language, era, content type), with the normalization recipe stated alongside.vlm-ocr-pipeline (production OCR with the chosen model) and post-ocr-cleanup (cleaning its output); for the methods-section disclosure of the comparison, compose with methods-reporting..txt per page idnvlm-ocr-pipeline, post-ocr-cleanup, methods-reporting)Scaffold or audit an entire research project repository organized around its source library. Use whenever the user is starting, structuring, organizing, or reviewing a whole project — "set up a research repo", "how should I structure/organize this project", "initialize my sources folder", "new paper or literature-review project", "audit my repo structure", "is my sources folder set up right", "check my project layout". Builds the full tree from the sources spine outward — sources/{og,md,unprocessed}, references.bib, a PDF→Markdown convert script (OpenDataLoader PDF), a process-source intake command, CLAUDE.md/AGENTS.md, .gitignore, .venv — plus the analysis, manuscript, and review folders; or audits an existing repo and reports what is present, partial, or missing. NOT for intaking or converting a single PDF (use process-source) or building a publication replication package (use replication-package).
LLM token logprobs and calibration: per-decision confidence, ECE, Brier, reliability diagrams, low-confidence triage.
LLM council/panel voting: multi-model coders, consensus rules, inter-rater agreement (kappa, alpha), correlated-error diagnostics.
Fact-check a manuscript's claims against the cited sources themselves: locate each source's knowledge-base Markdown file and verify the in-text claim is actually supported. Runs a pre-flight gate that refuses unless a per-source Markdown knowledge base exists and is clean (PDFs converted via process-source); then runs citation-check; then audits claim support, overclaiming, direction, scope, and misattribution.
Audit citation existence and fabrication risk, in-text/reference parity, DOIs, claim support, and style.
Typeset a working paper or journal submission in house-style LaTeX from any draft — Markdown, Word (.docx), TeX, ODT, RTF, or HTML. Convert with pandoc, wrap in an EB Garamond template, build the PDF with latexmk, and prepare for a specific journal (spacing, page limit, anonymization, disclosures, citation style). Use for "format/typeset/convert my paper to LaTeX", "make a working paper", "prepare this for submission to <journal>".