Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

paper-references-scanning

Estrellas5

Forks1

Actualizado22 de junio de 2026, 19:01

Scan paper library for citation health: D8 (bib entry count), D10a (cite-to-bib match rate), orphans, zombies. Class of tasks: LaTeX reference integrity auditing.

Instalación

Instalar con Codex o Claude Copia este prompt, pégalo en Codex, Claude u otro asistente, y deja que revise la página de la skill y la instale por ti.

Ejecutar en Manus

Fuente

yakeworld

yakeworld/Synthos

Abrir repositorio de GitHub Ver repositorios del creador

Descarga

Ejecutar en Manus

Explorador de archivos

25 archivos

SKILL.md

readonly

Más de este repositorio

mismo repositorio

citation-verification

yakeworld/Synthos

引用三验 — 参考文献是否存在(L1) + 引用是否得当(L2) + 引用是否全面(L3)。三位一体验证管线，从DOI验真到语义审查到遗漏检测。

2026-06-235

automation-skills

yakeworld/Synthos

**触发条件**: 对一批论文（10-34 篇）批量处理 `step_quality_check.md` 中的 quality_score 并写入 `state.json`。

2026-06-235

notebooklm-cli

yakeworld/Synthos

子skill | NotebookLM CLI全功能指南 — Q&A知识提取、内容生成(报告/视频/音频/信息图/幻灯片)、文献检索。响应paper-pipeline的P1阶段调用。

2026-06-225

productivity

yakeworld/Synthos

生产力工具 — Airtable、Google Workspace、Linear、Notion、Jupyter等。

2026-06-225

paper-pipeline

yakeworld/Synthos

Complete paper pipeline: retrieval, extraction, quality review, analysis, and publication.

2026-06-225

skill-absorption

yakeworld/Synthos

双循环进化：内部反思(P0) + 外部吸收(P1)。Cross-project absorption methodology — multi-round cross-project comparison, active project tracking, self-expanding keyword discovery. 动灵驱动吸收(Entelechy-Driven Absorption v4.3).

2026-06-225

name	paper-references-scanning
description	"Scan paper library for citation health: D8 (bib entry count), D10a (cite-to-bib match rate), orphans, zombies. Class of tasks: LaTeX reference integrity auditing."

IO_CONTRACT

input: paper_dir: str, scan_type: str — 用户请求描述、上下文信息
output: result: dict — D8, D10a, orphans, zombies

对应原则：P2（机械原子暴露输入输出规范）

Paper References Scanning (D8/D10a)

Scope

Scanning LaTeX paper directories for reference integrity metrics.

Pitfalls

DOI regex is case-insensitive and handles multiple formats

Bib DOI fields appear as doi = {value}, DOI = {value}, or doi="value".

WRONG: r'[Dd][Oo][Ii]\s*=\s*"\\{?([^}"\s]+)\}?"?' — the "? around { is too strict; it requires the quote to match the brace style, causing zero matches.
CORRECT: r'(?:doi|DOI)\s*=\s*(?:["{]?([^}"\s]+?)["}]?)' — matches all common DOI formats regardless of quote/brace nesting.
Even simpler: r'doi\s*=\s*\{([^}]+)\}' for standard BibTeX format.

Bib files exist in multiple locations beyond the canonical search path

The canonical skill searches 06-references/references.bib → {name}.bib → references.bib → enhanced_refs/enhanced.bib. In practice, bib files are also found:

At paper root level (e.g., <paper>/references.bib)
In 09-manuscript/<paper>/references.bib subdirectories
In any .bib file directly under the paper directory
In deeply nested directories like 01-manuscript/<paper>/references.bib

When scanning, expand to: first check 06-references/references.bib, then <paper>/references.bib, then any .bib in <paper>/, then fall back to 01-manuscript/<paper>/ and 09-manuscript/<paper>/.

Papers with \bibitem{} in thebibliography but zero \cite{} calls

Some papers list references via raw \bibitem{} in the thebibliography environment but never call them with \cite{} in the body. This results in D8 counting the bibitems but D10a=0% (no citations to match). These are real papers with references — the \cite{} may be missing, or the references may be uncited. Detect this pattern: cited_count == 0 and bibitem_count > 0.

Repair: Previously labeled "Cannot auto-repair" — but a systematic mapping from bibitem keys to prose mentions works when the prose already names the referenced authors/findings. See references/zero-citation-auto-repair.md for the full 5-step methodology (proven on 182-accommodation-ciliary-muscle-ODE: 0% → 100%, 13/13).

`\\cite` regex uses single-escape in raw strings

re.compile(r'\\\\cite') — two backslashes in source, regex engine interprets as one literal backslash.

r'\\\\\\\\cite' → matches double-backslash → WRONG for standard tex files
r'\\cite' → \c is invalid regex escape → SyntaxError
Correct: r'\\\\cite'

Comment detection must distinguish `%` from `\%`

%% and % at line start = comment. But $13.9\%$ or ~\% are NOT comments. Check line.lstrip().startswith('%') only.

⚠️ Scan Script Discrepancy: v3 vs v7 Produce Different Metrics (2026-06-20)

Two scan scripts exist with different algorithms, producing non-comparable results:

Script	D8 Definition	D10a Definition	When no-cites
`paper-d8-d10a-scan/scripts/scan.py` (v3)	`1 - orphans/cites` (match rate)	`1 - zombies/bib` (consistency)	Not applicable (separate status)
`paper-references-scanning/scripts/d8d10a-scan.py` (v7)	Raw bib entry count	`matched/total_cites`	Auto=100%

Key difference: v7 with no-cites auto-sets D10a=100% (self-referenced), producing 91/93 papers at D10a=100%. v3 produces only 12/94 PASS (D8=1.0 AND D10a=1.0). These are measuring different things.

Rule: G1-G7 pipeline gates use v3 metrics (paper-d8-d10a-scan). Use v7 only for raw inventory and structural audit. When reporting scan results, always state which script was used and what its D8/D10a definitions are.

Reference directory must be normalized before scanning (2026-06-18 实战)

pima-crispdm管线中06-references/根目录包含7个旧管线遗留PDF（不属于当前Bib），pdfs子目录有44个PDF。这种目录结构错乱会导致D10a扫描结果不可靠——PDF数量与Bib条目数完全不匹配。

修复流程（在扫描前执行）：

读取当前Bib的bib keys
删除根目录中不属于当前Bib的所有文件（包括PDF、元数据文件）
从子目录（pdfs/）中提取与当前Bib匹配的PDF到根目录
对名称不匹配的Bib key创建符号链接
确认根目录保持扁平：PDF + .bib（空子目录可选）
然后再运行D8/D10a扫描

参考: paper-pipeline Skill.md Trap #35-#36（参考文献目录标准化模式）。

Bib files exist in multiple locations beyond the canonical search path

Bib keys can contain hyphens

Use [^\s,;]+? instead of \w+ for bib key capture (Norgeot2020MI-CLAIM has -).

Bib key prefix stripping

bib_key_re captures @EntryType{key, including the prefix. The script strips it by finding {, taking after it, then finding , and taking before it. Without this, @Article{Ahmad2020, won't match cite key Ahmad2020, causing false orphans and zombies.

Commented `\\\\bibitem{}` must be excluded from D8

Only count uncommented \\bibitem{} for D8.

External .bib papers: bibitems live in .bbl, NOT in .tex (2026-06-23)

Papers using \bibliography{references.bib} compile via bibtex, which generates a .bbl file containing the real \bibitem{} entries. The .tex file may contain only template placeholder bibitems like \bibitem{label} and \bibitem{lamport94} — these are NOT the actual references.

Trap: Grepping the .tex file for \bibitem{} on an external-bib paper finds only 2 template entries while the paper has 30+ real references. D10a appears to be 0% when it's actually 100%.

Fix: Always check for .bbl file first. If .bbl exists, extract bibitems from it. Only fall back to inline thebibliography in .tex when no .bbl is found. The scripts/d8d10a-scan.py handles this automatically — do NOT use ad-hoc shell grep.

Detection heuristic: If grep -c '\\bibitem' paper.tex returns ≤3 but grep -c '\\cite' paper.tex returns >10, the paper almost certainly uses external .bib → look for .bbl.

Stale .bbl from older tex revision (2026-06-22)

When a paper undergoes multiple revisions, the .bbl filename may not match the .tex filename (e.g., revision20241117.bbl but tex is revision20241118v3.tex). The scan script finds the stale .bbl (alphabetically first or same basename guess) and reports D10a < 100% even though all orphan keys exist in the .bib file.

Symptom: D10a 65-85% with 5-20 orphans, all of which ARE found in the .bib file. The .bbl filename differs from the .tex filename.

Fix: Delete the stale .bbl. Recompile. The new .bbl will have the correct basename and all entries.

Empty .bbl (0 bytes) from tex without \bibliography command (2026-06-22)

A degenerate sub-case of stale .bbl: papers using inline \begin{thebibliography} + \bibitem{} (no external .bib) can still have a .bbl file if someone ran bibtex on them. Since the .tex has no \bibliography{} command, bibtex produces an empty .bbl (0 bytes, "I found no \bibdata command" in .blg). The scan script prioritizes .bbl over tex thebibliography → finds 0 bibitems → D10a=0% despite perfectly matched thebibliography entries.

Symptom: D10a=0.0%, source=bbl, .blg says "I found no \bibdata command" + "I found no \bibstyle command", .bbl is 0 bytes. All orphan keys ARE found as \bibitem{} entries in the tex's thebibliography.

Fix: Delete the empty .bbl. The scan script falls back to extracting bibitems from the thebibliography in the tex. No recompilation needed (thebibliography papers don't use bibtex).

Detection: ls -la paper.bbl → 0 bytes. grep 'bibdata' paper.blg → "I found no \bibdata command". Cites extracted from tex all match thebibliography bibitems when checked with manual Python regex.

Real case: 093-saccade-target-shift-PINN (14 cites, 14 bibitems, 0% D10a → 100% after deleting 0-byte .bbl). endolymph-hydropressure-ode (15/15, same pattern).

Wrong .tex file selected in multi-tex directories (2026-06-22)

When a paper directory contains multiple .tex files (e.g., a LaTeX template + the real manuscript), get_main_tex() may pick the wrong one. Templates like Sage_LaTeX_Guidelines.tex or elsarticle-template-num.tex may have placeholder cites (R1, R2, R3 or lamport94) and no substantive content, causing D10a=0% (3 cites R1/R2/R3 vs 30 bibitems).

Detection: D10a=0% or nonsensical results (tiny cite count with huge bibitem count). Cite keys are template placeholders.

Fix: Prefer the .tex with the most \cite{} calls AND \begin{document}. If multiple candidates exist, check for realistic citation keys (author-year format, not R1/R2/R3). The scripts/d8d10a-scan.py uses a scoring heuristic — if it consistently picks the wrong tex, force the correct one by naming it paper.tex or moving templates to a subdirectory.

⚠️ Multi-tex version scoring bias: older heavy tex beats newer lean tex (2026-06-23)

A sub-trap of multi-tex selection: when a paper has versioned tex files (e.g., articlev1.tex and articlev2.tex), the cite-count + has-document heuristic can pick the older, template-heavier version. The older draft may have more \cite{} calls (33 vs 18) simply because it bundles more background citations or legacy content — not because it's the canonical version.

Symptom: Scan picks articlev1.tex (33 cites, older) over articlev2.tex (18 cites, newer by 2 months). D10a reports 93.8% because bbl is articlev2.bbl (24 zombie bibitems from v2's bib, but v1's cites don't all match). However articlev2.tex + articlev2.bbl = 100% D10a.

Detection: When multiple tex files have valid \begin{document}, check file modification times. If a newer file exists whose basename matches the .bbl filename, prefer it regardless of cite count.

Fix: Add a version bonus to the scoring: +500 for any tex whose basename contains a higher version number suffix (v2, v3, etc. over v1). Or simply prefer the tex with the matching bbl basename (if articlev2.bbl exists, prefer articlev2.tex over articlev1.tex). The scripts/article-todo-d10a-scan.py implements both heuristics.

Real case: ~/桌面/article_todo/SCC 3D Reconstruction — articlev1.tex (33 cites, Mar 2025) outscored articlev2.tex (18 cites, May 2025). Fix: deleted stale articlev1.bbl, scan picked articlev2.tex (100%).

⚠️ Queue-vs-scan D10a divergence: paper-queue.json claims 100% but scan finds 0% (2026-06-23)

A subtle trap: the paper-queue.json notes may claim D10a was fixed to 100% on a prior date, but a fresh scan finds D10a=0% with bib_source=none. This is NOT a scan script bug — it means the filesystem state has diverged from the metadata.

Root causes (in order of likelihood):

Bib source moved/deleted after fix: The .bib file that was present during the fix was later removed or renamed. The scan script can't find any bib source → bib_source=none → D10a=0%.
Fix applied to wrong tex: The prior repair edited a tex file that the scan script doesn't select (e.g., a versioned copy), leaving the canonical tex with its original 0% state.
state.json manually set without verification: state.json was updated to claim D10a=100% but the actual bib/tex were never fixed.
Queue notes describe an intent, not an outcome: The note "D10a 0%→100%" records what was attempted, not what was verified post-repair.

Detection: Queue notes say D10a fixed, but scan shows bib_source=none or D10a=0%. Check: does the paper directory contain ANY .bib file? Does the tex have \bibliography{...} or \begin{thebibliography}?

Fix workflow:

Search for .bib files: find <paper_dir> -name '*.bib'
Check the tex for bibliography command: grep 'bibliography\|thebibliography' <paper>.tex
If bib_source=none but queue claims fix → the fix was ephemeral (bib file lost). Treat as fresh repair.
After any fix, run the scan AGAIN to verify — don't trust queue metadata alone.

Real case: orthokeratology-corneal-remodeling-ODE-paper-117 — queue note dated 2026-06-15 says "D10a 0%→100%. qs → 60. gate FAIL→CONDITIONAL", but 2026-06-23 scan shows D10a=0%, bib_source=none. The fix evaporated — likely the bib file was deleted or the fix targeted a non-canonical tex.

Directory count ≠ Paper count (ephemeral snapshot)

The /media/yakeworld/sda2/Synthos/outputs/papers/ directory accumulates non-paper subdirectories over time (ML projects, old papers, scripts, templates, _archive_*, _docs, etc.). The get_main_tex() function silently filters these out by looking for a valid .tex file. Never assume total_directories == papers_scanned — always use the __metadata__ entry from the scan JSON:

meta = [x for x in data if x.get('__metadata__')][0]
true_paper_count = meta['papers_scanned']  # NOT len(data) - 1
non_paper_count = meta['non_paper_count']

The absolute numbers change with each scan run. For historical context, see references/scan-results-2026-06-20.md. Consumers of the JSON must skip the __metadata__ entry and use papers_scanned for the true paper count.

Steps

Discover paper dirs with 01-manuscript/paper.tex (excludes _* and lit-reviews)
Extract cite keys from uncommented lines: \cite{}, \citep{}, \cite[opt]{}
Find ONE best bib file: 06-references/references.bib → {name}.bib → references.bib → enhanced_refs/enhanced.bib
Also extracts \bibitem{} from thebibliography in tex
Computes D8, D10a, orphans, zombies, bbl status

Canonical Skill

This IS the canonical scan skill. The scan script scripts/d8d10a-scan.py handles all resolution: inline theinlinebibliography, external .bib files, single-bib-per-paper priority, comment filtering. Run it directly.

Support Files

scripts/d8d10a-scan.py — Canonical pipeline D8/D10a scan (132 papers, /media/yakeworld/sda2/Synthos/outputs/papers/).
scripts/article-todo-d10a-scan.py — 2026-06-23: Targeted D10a scan for ~/桌面/article_todo/ writing workspace. Handles multi-tex version scoring bias (v2 bonus), template filtering, stale .bbl detection. Proven on 11 papers (2 fixes: 64.7%→100%, 93.8%→100%).
references/zero-citation-auto-repair.md — 2026-06-20: Proven 5-step method to repair papers with complete thebibliography but zero \cite{} commands. Maps bibitem keys to prose mentions systematically. Tested on 182-accommodation-ciliary-muscle-ODE (13/13, 0%→100%).
references/article-todo-d10a-check.md — 2026-06-22: Targeted D10a check methodology for ~/桌面/article_todo/ workspace papers. Covers multi-tex selection, template artifact filtering (<label>, lamport94), stale .bbl detection, and wrong-tex-selection pitfalls specific to the writing workspace.
references/ocular-blood-flow-paper-116.md — Paper 116 ocular blood flow 2-ODE reference (R2=0.993, Cc=0.45 bifurcation)
references/scan-script-discrepancy-v3-vs-v7.md — CRITICAL: Two scan scripts produce incompatible D8/D10a metrics. Always state which script is used. Includes metric cross-mapping table.
references/scan-results-2026-06-21.md — Latest cross-validated scan results (v3+v7), updated 2026-06-21
references/scan-results-2026-06-20.md — Latest scan results (two-script cross-validation)
references/paper-repair-cron-workflow.md — 2026-06-23: Standard paper-repair cron workflow: two-scan approach (pipeline + article_todo), direction constraint filtering, common fix patterns, cron-specific pitfalls.

paper-references-scanning

Más de este repositorio

Más de este repositorio

IO_CONTRACT

Paper References Scanning (D8/D10a)

Scope

Pitfalls

DOI regex is case-insensitive and handles multiple formats

Bib files exist in multiple locations beyond the canonical search path

Papers with \bibitem{} in thebibliography but zero \cite{} calls

\\cite regex uses single-escape in raw strings

Comment detection must distinguish % from \%

⚠️ Scan Script Discrepancy: v3 vs v7 Produce Different Metrics (2026-06-20)

Reference directory must be normalized before scanning (2026-06-18 实战)

Bib files exist in multiple locations beyond the canonical search path

Bib keys can contain hyphens

Bib key prefix stripping

Commented \\\\bibitem{} must be excluded from D8

External .bib papers: bibitems live in .bbl, NOT in .tex (2026-06-23)

Stale .bbl from older tex revision (2026-06-22)

Empty .bbl (0 bytes) from tex without \bibliography command (2026-06-22)

Wrong .tex file selected in multi-tex directories (2026-06-22)

⚠️ Multi-tex version scoring bias: older heavy tex beats newer lean tex (2026-06-23)

⚠️ Queue-vs-scan D10a divergence: paper-queue.json claims 100% but scan finds 0% (2026-06-23)

Directory count ≠ Paper count (ephemeral snapshot)

Steps

Canonical Skill

Support Files

IO_CONTRACT

Paper References Scanning (D8/D10a)

Scope

Pitfalls

DOI regex is case-insensitive and handles multiple formats

Bib files exist in multiple locations beyond the canonical search path

Papers with \bibitem{} in thebibliography but zero \cite{} calls

\\cite regex uses single-escape in raw strings

Comment detection must distinguish % from \%

⚠️ Scan Script Discrepancy: v3 vs v7 Produce Different Metrics (2026-06-20)

Reference directory must be normalized before scanning (2026-06-18 实战)

Bib files exist in multiple locations beyond the canonical search path

Bib keys can contain hyphens

Bib key prefix stripping

Commented \\\\bibitem{} must be excluded from D8

External .bib papers: bibitems live in .bbl, NOT in .tex (2026-06-23)

Stale .bbl from older tex revision (2026-06-22)

Empty .bbl (0 bytes) from tex without \bibliography command (2026-06-22)

Wrong .tex file selected in multi-tex directories (2026-06-22)

⚠️ Multi-tex version scoring bias: older heavy tex beats newer lean tex (2026-06-23)

⚠️ Queue-vs-scan D10a divergence: paper-queue.json claims 100% but scan finds 0% (2026-06-23)

Directory count ≠ Paper count (ephemeral snapshot)

Steps

Canonical Skill

Support Files

`\\cite` regex uses single-escape in raw strings

Comment detection must distinguish `%` from `\%`

Commented `\\\\bibitem{}` must be excluded from D8

`\\cite` regex uses single-escape in raw strings

Comment detection must distinguish `%` from `\%`

Commented `\\\\bibitem{}` must be excluded from D8