双循环进化：内部反思(P0) + 外部吸收(P1)。Cross-project absorption methodology — multi-round cross-project comparison, active project tracking, self-expanding keyword discovery. 动灵驱动吸收(Entelechy-Driven Absorption v4.3).

2026-06-225

name	bib-integrity-audit
description	Audit `.bib` reference files across a paper library for:
version	1.0.0
license	MIT
author	Synthos
metadata	{"synthos":{"signature":"task_desc: str, params: dict -> result: dict","atom_type":"skill","priority":"P2","related_skills":[]}}

IO_CONTRACT

input: bib_file: str — 用户请求描述、上下文信息
output: audit_report: dict — Bib完整性审计

对应原则：P2（机械原子暴露输入输出规范）

Bib Integrity Audit

Scope

Audit .bib reference files across a paper library for:

DOI completeness: percentage of entries with DOI fields
Suspicious entries: malformed @misc, Kaggle publishers, URL-as-year, arXiv preprints without IDs
Cross-file dedup: same entry key appearing in multiple bib files with inconsistent metadata
Known-DOI completion: entries with complete metadata (title, author, year, journal) but missing DOI

Step-by-Step Workflow

1. Discover all `.bib` files

find <paper-root> -name '*.bib' -type f

Include ALL .bib files, not just 06-references/references.bib
The Synthos papers use varied locations: reference4.bib, referencefinal.bib, reference3.bib, ref.bib, ref_orig.bib

2. For each bib file, compute metrics

Entry count: grep -c '^@[A-Za-z]' file
DOI count: count entries with doi = {...} field
DOI coverage: DOI count / entry count × 100%

3. Detect suspicious entries (mark, do NOT delete)

Check each entry against these signals:

Signal	Pattern	Example
`@misc` with auto-generated key	`auto\d{4}` in key	`@misc{auto2024...}`
Kaggle publisher	`publisher = {...kaggle...}`	`publisher={Kaggle}`
URL as year field	`year = {https?://...}`	`year={http://biometrics.idealtest.org/}`
arXiv preprint without arXiv ID	`journal = {arXiv preprint` but no `arXiv:XXXX`	`journal={arXiv preprint}` without ID
Incomplete `@misc`	missing `author` or `title`	dataset citations without proper fields
No author field	no `author = {` in entry	orphan entries
No title/booktitle field	no `title = {` or `booktitle = {` in entry
Empty DOI	`doi = {}`
Duplicate key across files	same key in multiple .bib files	`swirski2013fully` in 3+ files

4. Cross-file deduplication

For each unique entry key across ALL bib files:

Check if the same key appears in multiple files
For duplicates: check consistency of title, author, year, journal, DOI
Flag entries with inconsistent metadata across files

5. Known-DOI completion

For entries with complete metadata but missing DOI, use known DOI database:

Key	Known DOI	Source Type
`daugman2009iris`	`10.1016/b978-0-12-374457-9.00025-1`	Springer chapter
`proencca2009ubiris`	`10.1109/TPAMI.2009.66`	IEEE TPAMI
`lu2022neural`	`10.1109/ISMAR55827.2022.00053`	IEEE ISMAR
`dierkes2018novel`	`10.1145/3281417.3281423`	ACM ETRA
`tsukada2011illumination`	`10.1109/ICCVW.2011.6139507`	IEEE ICCVW

For entries with complete metadata but unknown DOI, use OpenAlex API for DOI lookup:

https://api.openalex.org/works?search={title}&select=title,doi,author_institutions,institutions

6. Output report

Generate a markdown report with:

Summary table: paper | entries | DOIs | DOI coverage | suspicious count
Suspicious items detail: file | key | type | issue
Known DOI completions: paper | key | title | suggested DOI
DOI gaps requiring API lookup: entries with complete metadata
Cross-file duplicates: key | locations | inconsistency
Priority recommendations: P0/P1/P2 action items

Pitfalls

Paper-level reference file duplication

Papers often carry multiple versions of the same reference list (e.g. ref.bib vs ref_orig.bib, referencefinal.bib vs reference4.bib). These share the same entry keys — the audit will flag them as "cross-file duplicates" but they are not inconsistent, they are redundant copies. Treat the version with fewer entries as the likely cleaned copy; the longer one is usually the raw export. Recommend consolidating to a single source of truth per paper.

`find -name '*.bib'` may miss files in non-standard paths

Synthos papers store .bib files in varied subdirectories (e.g., 投稿文件final/, latexnew/)
Use explicit find with path walk, don't assume 06-references/ structure
Unicode path trap: Directories with non-ASCII characters (Chinese, curly apostrophes ') cause cd, glob expansion (*.bib), and ls *.bib to silently fail. When shell commands fail on a known directory, switch to Python os.listdir() for traversal. Avoid cd into unicode-named dirs entirely; always use absolute paths from Python or find -print | while read patterns.

arXiv preprint format varies

Some use journal = {arXiv preprint arXiv:XXXX} (has ID, no DOI)
Some use eprint = {XXXX} + archivePrefix = {arXiv} (proper BibTeX format)
Some use doi = {10.48550/arXiv.XXXXX} (has DOI already)
The "arXiv preprint without ID" check should only flag entries where journal says "arXiv preprint" but NO arXiv:ID appears anywhere in the entry body

Dataset citations are often malformed

CASIA, MMU, Kaggle dataset references frequently use @misc with URL-as-year or missing author/title
These are COMMON and should be flagged but not auto-removed
Recommended fix: use @dataset entry type or properly formatted @misc with URL and accessed date

`@Comment{jabref-meta:...}` entries

JabRef adds a trailing comment entry — must be skipped during analysis

Entry type classification for DOI sourcing

Entry type	DOI expected	Verification source
Journal articles (IEEE, Elsevier, Springer)	Yes — always	Crossref API (high success rate)
Conference proceedings (ACM ETRA, IEEE, Springer LNCS)	Yes — always	Semantic Scholar → Crossref fallback
arXiv preprints	No DOI until published	Keep arXiv ID in `eprint` or `journal` field
Datasets (CASIA, UBIRIS, OpenEDS, Kaggle)	No DOI	Keep as `@misc` with URL, no DOI expected
Book chapters	Yes	Crossref API (may require full book ISBN for lookup)
Technical reports	Sometimes	Crossref, but not always indexed

API-based DOI verification workflow

For entries with complete metadata but missing DOI, use a tiered approach:

Semantic Scholar first (most reliable for modern papers):
```
GET https://api.semanticscholar.org/graph/v1/paper/search
?query={title}&limit=3&fields=title,authors,year,externalIds
Header: User-Agent: synthos-audit/1.0 (yakeworld@wmu.edu.cn)
```
- Returns externalIds.DOI and externalIds.SEMANTIC_SCHOLAR
- S2 ID can be used as fallback identifier when DOI is absent
- Best for: arXiv papers, recent conference papers, preprints
Crossref API (for journal articles and book chapters):
```
GET https://api.crossref.org/works
?query.title={title}&rows=3&mailto=yakeworld@wmu.edu.cn
```
- High success rate for IEEE, Elsevier, Springer journal articles
- DOI format: doi = {10.xxxx/yyyy} — clean \\_ to _ before lookup
- Rate limit: add mailto: header, 5 req/sec max
PubMed (for medical/clinical papers):
```
GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
?db=pubmed&term={title}&retmax=5
```
- Use when Semantic Scholar returns no results for medical papers
- Then efetch for full details including DOI
Fallback: If no API source returns a match, the entry may be:
- Too old (pre-DOI era, before 2000)
- Very niche conference without online indexing
- Non-English venue not in major databases
- In this case, flag as "DOI not found — manual verification required"

Known DOI pitfalls

arXiv DOIs (e.g., 10.48550/arXiv.2408.17231) are VALID DOIs but NOT in Crossref database. Crossref returns 404. Keep them as-is in the .bib file.
Springer LNCS DOIs (e.g., 10.1007/978-3-031-37660-38) may return 404 if the book/chapter is not yet indexed or if there's a digit error. Verify by searching Semantic Scholar or Google Scholar first.
ACM ETRA DOIs follow pattern 10.1145/XXXXX.XXXXXX — always start with 10.1145
IEEE DOIs follow pattern 10.1109/XXXXX.YYYYYYY or 10.1109/ACCESS.XXXXXXX
Elsevier DOIs follow pattern 10.1016/j.xxxx.yyyy.zz
Crossref 404 is not always wrong — may be due to: new publication not yet indexed, wrong last digit in DOI, publisher moved/rebranded, or the paper truly doesn't have a DOI in Crossref

DOI field whitespace varies

Entries use variable whitespace: doi = {value} (multiple spaces) vs doi = {value} (single space). A naive doi = { regex will miss entries with 2+ spaces. Use doi\s*=\s*\{ or grep -c 'doi' for counting instead.

Cross-file references may use different keys for the same paper

e.g., lu2022neural vs Lu2022Neural3G for the same ISMAR paper
Check by title match, not just key match
Unicode title trap: When parsing entries from bib files with unicode characters (Chinese directories, LaTeX escapes like {\\c{c}}), entry titles may parse as N/A or garbled in simple grep-based parsers. Use Python for robust multi-file dedup comparison — shell awk/grep can silently drop or corrupt unicode content in titles.

BibTeX entry-block regex trap — non-greedy `(.+?)` fails across braces

The broken pattern: r'^@([A-Za-z]+)\s*\{\s*([^,]+?),\s*(.*?)(?=\n@[A-Za-z]+\s*\{|$)'
- \s* after ,\s* consumes newlines, so (.+?) starts from the first field.
- .*? is non-greedy — the } inside title={...} can satisfy the lookahead prematurely. The body is truncated after the first field.
- Symptom: DOI coverage reads as 0%, many false "no title" errors. All entries appear malformed.
The fix: r'^@([A-Za-z]+)\s*\{\s*([^,]+?),\s*([\s\S]*?)(?=\n\s*\}\s*\n|\n\s*\}\s*$)'
- [\s\S]*? matches ANY character including newlines.
- Lookahead matches } on its own line (BibTeX entry terminator), not @.
- Correctly captures full body: multi-line author fields, DOIs, bare-value fields.
Field parsing caveat: The field regex r'(\w+)\s*=\s*(?:\{([^}]*)\}|"[^"]*")' without re.DOTALL misses multi-line author={Name and Name}. Use line-by-line parsing with re.DOTALL, or add bare-value third alternative (.+).

Output Format

🧹 Bib标准化报告 (YYYY-MM-DD)

| 论文 | 条目数 | DOI覆盖率 | 可疑条目 | 已补DOI |
|:-----|:------:|:--------:|:--------:|:-------:|
| pima-crispdm | 33 | 94% | 0 | 0 |

可疑条目明细:
- paper-xyz: Key2024 (journal: arXiv preprint 无ID)

已补DOI明细:
- paper-abc: Key2020 → 10.XXXX/...

Anti-patterns (what NOT to do)

DO NOT auto-delete suspicious entries — always report and let the user confirm
DO NOT assume all papers use 06-references/references.bib — audit finds what exists
DO NOT skip entries just because they're conference proceedings or tech reports — these still need DOI tracking
DO NOT treat arXiv preprints as having DOIs unless 10.48550/arXiv DOI is explicitly present

Support Files

references/bib-suspicious-patterns.md — Detailed catalog of suspicious entry patterns with examples
references/session-report-2026-06-06.md — Session report for cross-file dedup workflow
references/session-report-2026-06-12.md — Session report for DOI completion workflow
references/synthos-known-dois.md — Pre-verified DOI mappings for known Synthos paper references (updated with new DOIs and entry type classification)
references/api-lookup-workflow.md — Complete API workflow for DOI lookup: Semantic Scholar, Crossref, PubMed endpoints, parameters, error handling, DOI patterns
references/doi-patterns.md — DOI pattern reference guide: publisher patterns, common issues (escaped underscores, arXiv in Crossref, Springer digit errors), classification logic
scripts/bib-audit-v2.py — Automated audit script: scans .bib files, computes DOI coverage, detects suspicious entries, cross-file deduplicates, OpenAlex DOI lookup, markdown report output
scripts/bib-audit.py — Original audit script (legacy)
scripts/bib-verify.py — DOI verification script: verifies existing DOIs via Crossref, classifies entries by type (journal/conference/dataset/preprint), generates verification report

When to Use

Before paper submission (LaTeX compilation ready check)
After adding new references to a paper
Periodic integrity audit of a growing paper library
When merging references from multiple papers (collaborative work)
Prior to running quality-gate on a paper's references section

bib-integrity-audit

المزيد من هذا المستودع

المزيد من هذا المستودع

IO_CONTRACT

Bib Integrity Audit

Scope

Step-by-Step Workflow

1. Discover all .bib files

2. For each bib file, compute metrics

3. Detect suspicious entries (mark, do NOT delete)

4. Cross-file deduplication

5. Known-DOI completion

6. Output report

Pitfalls

Paper-level reference file duplication

find -name '*.bib' may miss files in non-standard paths

arXiv preprint format varies

Dataset citations are often malformed

@Comment{jabref-meta:...} entries

Entry type classification for DOI sourcing

API-based DOI verification workflow

Known DOI pitfalls

DOI field whitespace varies

Cross-file references may use different keys for the same paper

BibTeX entry-block regex trap — non-greedy (.+?) fails across braces

Output Format

Anti-patterns (what NOT to do)

Support Files

When to Use

IO_CONTRACT

Bib Integrity Audit

Scope

Step-by-Step Workflow

1. Discover all .bib files

2. For each bib file, compute metrics

3. Detect suspicious entries (mark, do NOT delete)

4. Cross-file deduplication

5. Known-DOI completion

6. Output report

Pitfalls

Paper-level reference file duplication

find -name '*.bib' may miss files in non-standard paths

arXiv preprint format varies

Dataset citations are often malformed

@Comment{jabref-meta:...} entries

Entry type classification for DOI sourcing

API-based DOI verification workflow

Known DOI pitfalls

DOI field whitespace varies

Cross-file references may use different keys for the same paper

BibTeX entry-block regex trap — non-greedy (.+?) fails across braces

Output Format

Anti-patterns (what NOT to do)

Support Files

When to Use

1. Discover all `.bib` files

`find -name '*.bib'` may miss files in non-standard paths

`@Comment{jabref-meta:...}` entries

BibTeX entry-block regex trap — non-greedy `(.+?)` fails across braces

1. Discover all `.bib` files

`find -name '*.bib'` may miss files in non-standard paths

`@Comment{jabref-meta:...}` entries

BibTeX entry-block regex trap — non-greedy `(.+?)` fails across braces