| name | paper-wiki |
| description | Academic literature management and survey system. Supports journal organization,
tag management, literature survey report generation, and submission recommendation.
Trigger: user mentions "literature", "paper", "survey", "journal", "submission",
"paper-wiki", or requests scan/report/recommendation on an initialized vault.
|
paper-wiki — Academic Literature Survey Skill
Manage a local Markdown literature vault. Scan, organize, tag, survey, and recommend — all from one skill.
What This Skill Does
paper-wiki turns a folder of paper Markdown files into a structured, indexed literature vault with:
- Journal organization: auto-sort papers into
paper/{direction}/{journal_abbr}/
- Tag management: multi-dimensional tagging (task, method, dataset, domain, signal, etc.)
- Survey reports: journal reports, direction reports, method/dataset stats, idea novelty surveys
- Submission guidance: journal scoring, revision suggestions based on local literature evidence
Quick Start
- "Initialize vault" → init
- "Scan Battery papers" → scan-organize
- "Generate RESS journal report" → journal-report
- "Recommend submission target for my paper" → submission-recommend
Configuration
All config is in config.json at project root. Key fields:
{
"output_lang": "zh",
"templates": {
"regeneration_threshold": 0.2,
"registry": {}
},
"web_search": {
"default_top": 10,
"min_citations": 5,
"openalex_email": "",
"openalex_api_key": "",
"semantic_scholar_api_key": "",
"clipper_inbox": "workspace/web-inbox",
"output_root": "paper/web_search",
"arxiv_fulltext_default": true,
"arxiv_output_root": "paper/web_search",
"arxiv_fulltext_priority": ["html", "tex", "pdf", "api"],
"domain_profiles": {
"Battery": {"strict": true, "required_groups": [], "negative_keywords": [], "preferred_venues": []},
"TimeSeries": {"strict": false, "keywords": []}
},
"sources": ["openalex", "semanticscholar", "arxiv"]
}
}
output_lang: "zh" (Chinese, default) or "en" (English) — controls all user-facing output
templates.registry: tracks domain-specific template generation status
templates.regeneration_threshold: ratio of new papers that triggers template refresh
web_search.default_top: default number of online results to save
web_search.min_citations: OpenAlex citation threshold for default web-find
web_search.openalex_api_key: optional OpenAlex API key for normal quota-based access
web_search.openalex_email: optional OpenAlex contact email / mailto metadata
web_search.semantic_scholar_api_key: optional Semantic Scholar key
web_search.clipper_inbox: Obsidian Web Clipper Markdown import folder
web_search.output_root: research material layer for web search metadata and incomplete arXiv results
web_search.arxiv_fulltext_default: default to arXiv full-text capture for arXiv results
web_search.arxiv_output_root: root folder for web-search-only arXiv Markdown
web_search.arxiv_fulltext_priority: arXiv capture priority, default html > tex > pdf > api
web_search.domain_profiles: direction-level keywords, required keyword groups, negative keywords, and preferred venues used by web-find and web-digest domain filtering
Output Language Rules
All internal content (this SKILL.md, schemas, scripts, template structure) is in English.
All user-facing output follows output_lang from config.json:
zh → Chinese reports, console messages, status summaries
en → English equivalents with identical structure
File paths, directory names, variable names, and frontmatter keys remain in English regardless of output_lang.
Report Citation Policy
All report-generation workflows must maintain a citation registry while writing the report.
- Add a numeric citation marker at every place where a paper is used as evidence in prose, tables, comparisons, trend claims, journal recommendations, or revision suggestions.
- Use
[1], [2], ... numbered by first appearance in the report. Reuse the same number for the same paper throughout the report.
- The final
References / Reference List section must include every paper cited in the report body and only papers cited in the report body.
- Do not list papers that were retrieved or matched but not used as evidence. If a paper is included in the analysis, either cite it where it supports a claim or omit it from the final references.
- Reference entries use:
[N] Title. Journal, Year. DOI: xxx. URL: xxx. Source: local path or [arxiv-web].
- If DOI is missing, omit the DOI field. If URL is missing, use
source_path. If journal is missing, use journal_abbr or arXiv. If year is missing, use published_date; if unavailable, use n.d.
Before finalizing a report, verify that citation numbers are continuous, every in-text citation has one reference entry, and every reference entry is cited at least once.
Direction-review reference-count defaults:
- Standard
direction-review: target 40-80 references unless the user explicitly asks for a shorter note or the corpus is too sparse to support that range.
- Deep
direction-review: target 80-120 references when the user explicitly asks for a deep review, comprehensive review, full survey, or journal-style long review.
- Do not inflate the reference list artificially. Every reference must still be cited in the body and used as evidence.
Full-Coverage Report Policy
For journal-report and direction-report, the evidence pipeline is required before the report is considered complete. Prompt checkpoints guide the Agent, but report_family.py --complete is the authoritative completion gate.
Bundle preparation initializes all required evidence files under evidence_dir. The workflow then fills them in stages; verification.json starts with checks set to "pending" and must only be changed to "passed" after count equality is verified.
Stage 1: Evidence Pipeline (REQUIRED BEFORE ANY REPORT WRITING)
-
Create screening.jsonl under evidence_dir with one entry per record:
{"ref_id": "R001", "decision": "confirmed_included", "reason": "..."}
{"ref_id": "R032", "decision": "uncertain_needs_review", "reason": "DOI pattern anomaly"}
{"ref_id": "R050", "decision": "metadata_only_duplicate", "reason": "same DOI as R047"}
-
Create paper_notes.jsonl with one brief evidence note per confirmed paper.
-
Create coverage_ledger.json using the validator-compatible top-level schema:
{
"candidate_count": 51,
"confirmed_included": ["R001", "R002", "..."],
"metadata_only": ["R050"],
"excluded_wrong_scope": [],
"skipped_unreadable": [],
"uncertain_needs_review": ["R032"]
}
-
Output partition counts before report writing:
Evidence pipeline complete:
- candidate_count: 51
- confirmed_included: 49
- metadata_only: 1
- excluded_wrong_scope: 0
- skipped_unreadable: 0
- uncertain_needs_review: 1
Leave verification.json checks as "pending" during Stage 1.
Stage 2: Report Writing
After Stage 1 completes:
- Create
Paper Coverage Matrix table listing every confirmed_included paper with citation marker
- Write report body citing each confirmed paper at least once
- Fill
synthesis_notes.md with key findings from the confirmed evidence set
Stage 3: Pre-Finalize Verification (REQUIRED BEFORE OUTPUT)
Verify the following equalities and OUTPUT the check result:
Verification check:
- coverage_ledger.confirmed_count: 49
- unique_cited_paper_count (count distinct [N] markers in body): 49
- reference_entry_count (count Reference List entries): 49
- All 49 confirmed papers appear in Coverage Matrix: YES
Status: PASS / FAIL
After these checks pass, update verification.json with validator-compatible pass keys:
{
"citation_check": "passed",
"coverage_check": "passed",
"evidence_consistency_check": "passed",
"candidate_count": 51,
"confirmed_included_count": 49,
"metadata_only_count": 1,
"uncertain_needs_review_count": 1,
"unique_cited_paper_count": 49,
"coverage_matrix_entry_count": 49,
"reference_entry_count": 49
}
If any check fails, STOP and fix before finalizing.
Stage 4: Hard Completion Gate
After the report Markdown exists and all evidence files are filled, run:
python scripts/report_family.py --mode journal --journal Energy --complete
or the equivalent direction command. The report is not complete until this command passes.
Allowed Exception
User explicitly requests --metadata-only or "brief/selective report" → skip full coverage requirements, document boundary in Coverage / Source Set.
Workflow Routing
Route user intent to the appropriate workflow:
| User Intent | Workflow | Standalone? |
|---|
| "initialize vault" / "初始化文献库" | → init | ✅ |
| "scan papers" / "扫描文献", "organize by journal" / "整理期刊" | → scan-organize | ✅ |
| "ingest" / "入库", "process paper" / "处理论文" | → ingest | ✅ |
| "tag" / "标签", "assign tags" / "打标" | → tag | ✅ |
| "journal report for XX" / "XX期刊报告" | → journal-report | ✅ |
| "direction report for XX" / "XX方向报告" | → direction-report | ✅ |
| "review paper" / "literature review" / "综述" / "direction review" | → direction-review | ✅ |
| "method stats" / "方法统计", "dataset stats" / "数据集统计" | → stat-report | ✅ |
| "idea survey" / "novelty survey" / "idea调研" | → idea-survey | ✅ |
| "read this paper" / "单篇文献精读" / "阅读这篇文献" | → paper-read | ✅ |
| "web find" / "联网检索" / "查找论文" / "search papers" | → web-find | ✅ |
| "daily digest" / "今日 arxiv" / "最新预印本" | → web-digest | ✅ |
| "import web clipper" / "导入 web clipper" | → web-import-clipper | ✅ |
| "submission recommendation" / "投稿推荐" | → submission-recommend | ✅ |
| "revision suggestions" / "修改建议" | → revision-suggest | ✅ |
| "vault status" / "文献库状态" | → status | ✅ |
| "health check" / "检查" / "lint" | → lint | ✅ |
| "full pipeline" / "完整流程" | → pipeline | ✅ |
Important: Every workflow can run independently. If preconditions are unmet, output a clear message (e.g., "No source manifest found. Run scan-organize first.") but do NOT auto-chain unless the user explicitly requests "full pipeline".
Routing note:
- Use direction-review when the user wants a review paper, literature review, or survey-style direction synthesis.
- Use idea-survey when the user wants novelty or similarity analysis for a specific idea rather than a direction-level review.
Precondition Matrix
| Workflow | Requires | Auto-generated by |
|---|
| init | — | — |
| scan-organize | init | — |
| ingest | init | — |
| tag | ingest (≥1 canonical page) | ingest |
| journal-report | ingest (canonical pages for target journal; optional direction/query filters) | ingest |
| direction-report | ingest (canonical pages) + query; optional direction filter | ingest |
| direction-review | ingest (canonical pages with readable source_path for the direction; optional focus) | ingest; web supplementation happens inside the workflow |
| stat-report | ingest + tag | ingest, tag |
| idea-survey | init + source papers; canonical/index preferred; web search optional | ingest (preferred), web-find (optional) |
| paper-read | init + one source or canonical paper | ingest (preferred) |
| web-find | init + query; CLI requires existing direction, Agent may bootstrap a missing one after confirmation | init |
| web-digest | init + query; CLI requires existing direction, Agent may bootstrap a missing one after confirmation | init |
| web-import-clipper | init + existing direction | init |
| submission-recommend | journal-report (for candidate journals) | journal-report |
| revision-suggest | journal-report (for target journal) | journal-report |
| status | init | — |
| lint | init | — |
| pipeline | — | — |
When a workflow's precondition is not met, output:
zh: "前置条件未满足:需要先运行 {workflow_name}。"
en: "Precondition not met: run {workflow_name} first."
Template System
Generic Templates
Located in templates/generic/. Domain-agnostic, ship with the skill.
Used as fallback when no domain-specific template exists.
Templates use {{variable}} placeholders as reusable report structure references.
For report_family.py, the default formal CLI path for journal-report and direction-report now prepares a full-text run bundle in workspace/cache/fulltext-report/.
Agent or LLM workflows must not read records[*].source_path directly for report writing. They must run records[*].source_read_command to create the Agent-safe temporary Markdown view, then read that temporary file. source_path remains the original Markdown path for traceability and human reading.
Use --metadata-only only when a deterministic canonical-metadata report is explicitly desired.
Available generic templates:
paper_canonical.md — standard literature page
journal_report.md — journal survey report
direction_report.md — research direction report
direction_review.md — direction-level literature review scaffold
idea_survey_report.md — idea novelty survey
paper_reading.md — single-paper reading note
stat_report.md — method/dataset/experiment statistics
submission_report.md — submission recommendation
revision_report.md — revision suggestions
Domain-Specific Templates
Located in templates/domains/{domain_name}/.
Current codebase status:
config.json can track template registry metadata
status_report.py and lint_vault.py report registry state and staleness signals
- the current formal CLI path does not auto-generate new domain templates
Practical rule:
- Prefer existing generic templates as the stable reference structure
- If a domain template already exists and the user explicitly wants it, treat it as a manual or Agent-selected reference
- Do not describe domain-template auto-generation as an implemented default behavior
Workflow 1: init
Purpose
Initialize the vault structure. Creates missing directories and seed files.
Steps
- Check if
E:\paper has config.json — if not, create default config
- Create missing directories:
schema/
library/papers/, library/reports/journal/, library/reports/direction/, library/reports/review/, library/reports/idea/, library/reports/paper/, library/reports/submission/, library/reports/vault/
library/indexes/
workspace/cache/, workspace/cache/fulltext-review/, workspace/manifests/, workspace/logs/, workspace/legacy/
templates/generic/, templates/domains/
- Create
schema/tag_taxonomy.json if missing (empty initial structure)
- Create
schema/keyword_rules.json if missing (empty rules array)
- Create
schema/paper_frontmatter.schema.md if missing
- Skip creating generic templates if they already exist
- Update
paper-library.md with skeleton if needed
Output (zh)
文献库初始化完成!路径:E:\paper
已创建目录:{list}
已创建文件:{list}
接下来你可以:
- "扫描文献" — 扫描 paper/ 目录
- "整理期刊" — 按期刊缩写归类文件
- "入库" — 处理论文并生成 canonical 页
Workflow 2: scan-organize
Status: Implemented (main flow) | duplicate detection: Implemented
Purpose
Scan paper/ for all Markdown files and optionally organize them into journal folders.
Sub-triggers
- "scan papers" / "扫描文献" → steps 1–3 only (scan + plan)
- "organize by journal" / "整理期刊" → steps 1–6 (scan + move + index)
- "check duplicates" / "检查重复" → steps 1, 4 only
Steps
-
Run: python scripts/scan_sources.py
- Output:
workspace/manifests/source_manifest.json
-
Run: python scripts/organize_by_journal.py --all --dry-run
- Output:
workspace/manifests/journal_move_plan.json
-
Display plan summary to user:
- Files to move (count by action: move/skip/warn/conflict)
- Journal distribution
-
(If "check duplicates"): Run python scripts/detect_duplicates.py --all:
- Compare file SHA256 checksums (exact) and normalized title+year (probable)
- Generate
workspace/manifests/duplicate_report.json and .md
-
(If "organize by journal"): Ask user to confirm, then:
- Run:
python scripts/organize_by_journal.py --all --apply
-
Run: python scripts/rebuild_indexes.py
- Triggers domain profile update (see Template System)
Output (zh)
扫描完成:{N} 个文件
按操作分类:
- 移动:{move_count}
- 跳过:{skip_count}
- 冲突:{conflict_count}
- 警告:{warn_count}
计划已保存:workspace/manifests/journal_move_plan.json
是否执行移动?(y/n)
Workflow 3: ingest
Purpose
Process paper Markdown files: extract metadata, generate canonical pages, convert HTML tables.
Input
- A specific file path, or "all new papers", or a journal/direction scope
Steps
-
Identify target files:
- If path given → process that file
- If "all" → scan
paper/ for files without corresponding canonical pages in library/papers/
- If journal/direction given → filter accordingly
-
For each target file:
a. Read file and parse frontmatter using the same logic as scripts/common.py
b. Extract/complete metadata:
title: from frontmatter or filename
journal, journal_abbr: from resolve_journal() logic
published_date: parse from frontmatter published, created, or body text
doi: extract from frontmatter source or body DOI patterns
url: from frontmatter source
c. Generate paper ID: {direction}-{year}-{journal_abbr}-{slug}
- slug = first 5 significant words of title, lowercase, hyphenated
d. Identify tag candidates from title, abstract, keywords, highlights:
- Match against
schema/keyword_rules.json patterns
- Agent or LLM review can optionally suggest extra tags during a manual path
- Rule-based tagging is the only built-in automatic write-back path
e. Convert HTML tables to Markdown:
- Run:
python scripts/html_table_to_md.py <file_path> (if HTML tables found)
- Or an Agent can convert simple tables inline during a manual path
f. Generate canonical page to library/papers/{direction}/{paper_id}.md:
- Use template:
templates/generic/paper_canonical.md
- Fill frontmatter fields, abstract, keywords
- Set
source_path to link back to original source file
- Preserve
## User Notes section if existing canonical page already exists
-
Domain profile update: Count papers by domain tag, update config.json template registry
Batch Command
python scripts/ingest_batch.py --direction Battery --journal Energy --apply-tags --rebuild-indexes
python scripts/ingest_batch.py --file paper/Battery/arxiv/example.md --apply-tags
python scripts/ingest_batch.py --direction Battery --dry-run
Output (zh)
入库完成:处理了 {N} 篇论文
新增 canonical 页:
- {paper_id_1}
- {paper_id_2}
标签候选(需确认):
- "transfer learning" → method [新标签]
- "CALCE" → dataset [已有标签]
确认添加新标签?(y/n)
Workflow 4: tag
Status: Implemented (read-only audit and write-back paths clarified)
Purpose
Manage the tag system: view, edit, batch-assign, and analyze tags.
Sub-triggers
- "view tags" / "查看标签" → display tag_taxonomy.json summary
- "batch tag" / "批量打标" → run keyword rules on all canonical pages
- "add tag" / "添加标签" → add a custom tag to taxonomy
- "tag stats" / "标签统计" → show tag frequency distribution
Scripts
| Script | Purpose | Command |
|---|
scan_tags.py | Read-only audit: scan tag coverage, rule hits | python scripts/scan_tags.py --direction Battery --rules |
ingest_batch.py --apply-tags | Write-back entry: batch tag canonical pages | python scripts/ingest_batch.py --direction Battery --apply-tags |
Note: scan_tags.py is read-only and never modifies canonical pages. Tag write-back must use ingest_batch.py --apply-tags.
Steps (batch tag)
-
Load schema/tag_taxonomy.json and schema/keyword_rules.json
-
For each canonical page in library/papers/:
a. Read frontmatter tags
b. Apply keyword rules to title, abstract, keywords sections
c. Optional Agent review may suggest additional tags for manual confirmation
d. Merge rule: preserve user tags; add keyword-rule hits without overwriting existing user edits
-
Apply tag updates through python scripts/ingest_batch.py --direction Battery --apply-tags
-
Rebuild indexes when batch tagging changes canonical frontmatter
Commands
python scripts/ingest_batch.py --direction Battery --apply-tags --rebuild-indexes
python scripts/scan_tags.py --direction Battery
python scripts/scan_tags.py --direction Battery --rules --include-empty
Output (zh)
批量打标完成:更新了 {N} 篇论文的标签
标签分布:
- task: SOH estimation (45), RUL prediction (38), SOC estimation (12), ...
- method: LSTM (28), Transformer (22), GPR (15), PINN (12), ...
- dataset: NASA (35), CALCE (30), Oxford (18), ...
新增标签:{list}
Workflow 5: journal-report
Purpose
Prepare a full-text literature survey workflow for a specific journal.
Formal CLI
python scripts/report_family.py --mode journal --journal RESS
python scripts/report_family.py --mode journal --journal RESS --direction Battery --query "soh"
python scripts/report_family.py --mode journal --journal RESS --metadata-only
python scripts/report_family.py --mode journal --journal RESS --complete
Steps
- Load canonical records from
library/indexes/canonical_pages.json (or rebuild in memory if the index is missing)
- Filter by journal name or abbreviation
- If
--direction is set, further restrict to that exact direction
- If
--query is set, further restrict the journal subset using canonical query matching
- Partition selected records into:
- readable records with valid
source_path
- skipped records with missing/unreadable source files
- Audit journal identity before writing:
- Do not rely only on the parent folder or canonical
journal_abbr
- Confirm journal identity from DOI, URL, source-page journal title, or full-text journal heading
- For Elsevier journals, treat DOI prefix and ScienceDirect journal page evidence as stronger than folder name
- Partition records into:
confirmed_included: journal identity confirmed
metadata_only: duplicate or metadata-only records (use metadata_only_duplicate screening decision, map to metadata_only ledger partition)
excluded_wrong_scope: wrong journal or out of scope
skipped_unreadable: missing or unreadable source files
uncertain_needs_review: ambiguous journal identity
- Save a single disposable run bundle:
workspace/cache/fulltext-report/{run_key}.json
- Agent runs
records[*].source_read_command for each readable record to create the Agent-safe temporary Markdown view, reads that temporary file, applies the journal audit, and writes the final report to:
library/reports/journal/{journal_key}-report-{date}.md
- Before writing the final report, fill all required evidence files under
bundle.evidence_dir: screening.jsonl, paper_notes.jsonl, coverage_ledger.json, synthesis_notes.md, verification.json
- Include a
Paper Coverage Matrix in the final report:
- Every
confirmed_included paper must appear in the matrix with a numeric citation marker
- Excluded, skipped, and uncertain records must be summarized separately without reference-list entries
- After the final report exists, run:
python scripts/report_family.py --mode journal ... --complete
- Write a compact preparation/completion log entry to
workspace/logs/report_generation.md
Notes
--mode journal --journal RESS means journal-only selection: select all canonical papers from that journal and do not apply extra filtering
--direction and --query only narrow the already-selected journal subset when explicitly provided
- Missing source files are skipped silently; detailed reasons stay only in the run-bundle JSON
- Final journal-report conclusions must come from full-text evidence, not from canonical metadata alone
- Journal reports default to full coverage of
confirmed_included records; write a representative-only report only when explicitly requested
--complete is the formal close-out step after the final Markdown report has been written
--metadata-only keeps the old deterministic canonical-metadata report path
Workflow 6: direction-report
Purpose
Prepare a full-text direction or topic status report from local canonical pages.
Formal CLI
python scripts/report_family.py --mode direction --query "soh"
python scripts/report_family.py --mode direction --direction Battery --query "soh"
python scripts/report_family.py --mode direction --direction Battery --query "soh" --metadata-only
python scripts/report_family.py --mode direction --direction Battery --query "soh" --complete
Steps
- Load canonical records from the whole vault
- If
--direction is set, restrict to that exact direction
- Apply query matching using title, abstract, keywords, and tag fields
- Partition selected records into readable vs. skipped by
source_path
- Save a single disposable run bundle:
workspace/cache/fulltext-report/{run_key}.json
- Agent runs
records[*].source_read_command for each readable record to create the Agent-safe temporary Markdown view, reads that temporary file, and writes the final report to:
library/reports/direction/{topic_slug}-report-{date}.md
- Before writing the final report, fill all required evidence files under
bundle.evidence_dir: screening.jsonl, paper_notes.jsonl, coverage_ledger.json, synthesis_notes.md, verification.json
- After the final report exists, run:
python scripts/report_family.py --mode direction ... --complete
- Write a compact preparation/completion log entry to
workspace/logs/report_generation.md
Notes
--mode direction --query "soh" allows cross-journal local screening from the whole vault
--direction only narrows the query scope when explicitly provided
- Missing source files are skipped silently; detailed reasons stay only in the run-bundle JSON
- Final direction-report conclusions must come from full-text evidence, not from canonical metadata alone
--complete is the formal close-out step after the final Markdown report has been written
--metadata-only keeps the deterministic canonical-metadata report path
Workflow 7: stat-report
Purpose
Generate deterministic statistics reports for one tag dimension.
Formal CLI
python scripts/report_family.py --mode stat --direction Battery --dimension method
python scripts/report_family.py --mode stat --direction Battery --dimension dataset --cross-dimension method
Steps
- Load canonical records for the direction
- Aggregate by the requested dimension:
- count papers per tag value
- compute yearly trend
- compute cross-tabulation with another tag dimension
- Generate deterministic Markdown tables
- Save to
library/reports/direction/{dimension}-stats-{date}.md
Workflow 8: idea-survey
Purpose
Survey literature similarity to a user idea and assess novelty through LLM/full-text reading.
Current execution path
This workflow is LLM/Agent-driven, not implemented as a report_family.py screening mode.
Steps
- Use canonical pages only as an index to locate source files through
source_path
- Let the LLM/Agent read relevant source Markdown files under
paper/
- During reading, extract problem, method, dataset, experiment setup, baselines, metrics, results, and limitations
- Let the LLM/Agent decide which papers are truly similar or dissimilar after reading full evidence
- If web supplementation is needed, run
web_search.py find first and then let the LLM/Agent read the resulting source Markdown
- Generate the final report at
library/reports/idea/{idea_slug}-survey-{date}.md only after the full-text review pass
Notes
- Do not use keyword/tag similarity as the final idea candidate filter
- Do not treat metadata-only matches as novelty evidence
- Final novelty judgment must be grounded in source Markdown, not only canonical abstracts or tags
Workflow 9: web-find
Status: Implemented (CLI fail-fast; Agent may bootstrap missing direction)
Purpose
Search academic web sources and save results as Markdown files in the local vault.
Prerequisites
- CLI path:
paper/{direction}/ directory must already exist before running web_search.py
- Agent path: if direction is missing, the Agent may guide the user to choose one of two direction options and create it after confirmation
Data Layers
- Formal full-text library:
paper/{direction}/{journal_abbr}/ for clipped journal papers and arXiv papers with extracted full text
- Web-search research layer:
paper/web_search/{direction}/{source}/ for OpenAlex/Semantic Scholar metadata and arXiv non-full-text fallbacks
- Knowledge layer:
library/ for canonical pages, indexes, and reports
Input
- Required:
--direction {existing_direction}
- Required: query text, e.g.
web-find --direction Battery --query "battery RUL transformer" --top 10
- Optional:
--top N, --source mixed|openalex|semanticscholar|arxiv|venues, --fulltext, --no-fulltext, --arxiv-id, --no-domain-filter, --show-filtered, --dry-run
Steps
-
CLI behavior: validate that paper/{direction}/ already exists. If missing, web_search.py fail-fast with guidance to create the direction before running web-find.
-
Agent recovery branch for a missing direction:
- inspect the user query, configured directions, and
web_search.domain_profiles
- offer two direction options:
- a best-match existing direction when one is plausible
- a suggested new direction name; if no existing direction fits, offer two suggested new names
- wait for user confirmation before making any filesystem or config changes
- after confirmation, append the chosen direction to
config.json -> directions, create paper/{direction}/ and paper/web_search/{direction}/, and seed a minimal web_search.domain_profiles.{direction} stub
- rerun the original
web-find command with the confirmed direction
-
Fetch results:
- Primary: OpenAlex (
search, citation-filtered by web_search.min_citations; send openalex_api_key and optional openalex_email when configured)
- In mixed/openalex mode, query both high-citation classic papers and recent papers, then merge and deduplicate
- Secondary: Semantic Scholar, only when
semantic_scholar_api_key is configured
--source venues: keep only OpenAlex candidates whose venue matches web_search.domain_profiles.{direction}.preferred_venues
- Fallback: arXiv only when
--source arxiv, or when --source mixed returns fewer than --top
-
Domain-filter, rank, and deduplicate:
- Evaluate each candidate against
web_search.domain_profiles.{direction}
- Strict profiles require all configured
required_groups and reject strong negative keyword hits
- Save only accepted candidates unless
--no-domain-filter is set
- arXiv ID is the primary identity for arXiv results
- DOI is the primary identity
- normalized title + year is the fallback identity
- never overwrite existing source Markdown files
-
Save OpenAlex / Semantic Scholar results to:
paper/web_search/{direction}/openalex/{year}-{first_author}-{title_slug}.md
paper/web_search/{direction}/semanticscholar/{year}-{first_author}-{title_slug}.md
- API results include metadata, abstract, DOI/URL, source ID, and a note that formal full text should be supplied via Obsidian Web Clipper when needed
-
Save arXiv results by full-text status:
full_text_extracted → paper/{direction}/arxiv/{year}-{first_author}-{title_slug}-{arxiv_id}.md
pdf_saved_only / abstract_only / failed → paper/web_search/{direction}/arxiv/{year}-{first_author}-{title_slug}-{arxiv_id}.md
- Try
html > tex > pdf > api unless --no-fulltext is set
-
Generate canonical pages and rebuild indexes only for formal source saves under paper/{direction}/
-
Generate a web-find report:
library/reports/web/{date}-{direction}-find-report.md
- Include local duplicates, arXiv full-text downloads, OA/SS metadata findings, skipped and failed records, and filtered-out candidates with reasons
-
Write manifest and log:
workspace/manifests/arxiv_fulltext_results.json for arXiv full text
workspace/manifests/web_search_results.json for OA/SS metadata saves
- Include
filtered_out so rejected candidates are auditable
workspace/logs/web_search.md
Command
python scripts/web_search.py find --direction Battery --query "battery RUL transformer" --top 10
python scripts/web_search.py find --direction Battery --source arxiv --arxiv-id 2502.18807v7 --fulltext
Workflow 10: web-digest
Status: Implemented (CLI fail-fast; Agent may bootstrap missing direction)
Purpose
Fetch recent arXiv papers for a direction and save them as Markdown sources plus a digest report.
Input
- Required:
--direction {existing_direction}
- Required:
--query "topic"
- Optional:
--top N, --no-domain-filter, --show-filtered, --dry-run
Steps
- CLI behavior: validate that
paper/{direction}/ already exists. If missing, web_search.py fail-fast with guidance to create the direction before running web-digest.
- Agent recovery branch for a missing direction:
- reuse the same bootstrap flow as
web-find
- analyze the query and existing direction/profile context
- offer two direction options and wait for user confirmation
- after confirmation, create the direction folders and config/profile stub, then rerun
web-digest
- Build a profile-aware arXiv query from
web_search.domain_profiles.{direction}
- Query arXiv by submitted date
- Apply the same domain filter and ranking path used by
web-find
- Save arXiv full-text successes to
paper/{direction}/arxiv/
- Use the same
html > tex > pdf > api fallback as web-find
- Save PDF-only, abstract-only, and failed fallback records to
paper/web_search/{direction}/arxiv/
- Generate canonical pages and rebuild indexes only when a full-text arXiv paper entered the formal library
- Generate digest report at
library/reports/web/{date}-{direction}-digest.md, including filtered-out candidates
- Log the operation
Command
python scripts/web_search.py digest --direction Battery --query "battery health prognosis" --top 10
Workflow 11: web-import-clipper
Purpose
Import Obsidian Web Clipper Markdown into the vault as full-text source Markdown.
Input
- Required:
--direction {existing_direction}
- Optional:
--inbox path, default from web_search.clipper_inbox
- Optional:
--dry-run
Steps
- Read
.md files from workspace/web-inbox/ or the provided inbox path
- Extract title, authors, year, journal, DOI, URL, and abstract from frontmatter/body
- Deduplicate by DOI or normalized title + year
- Save normalized Markdown to
paper/{direction}/{journal_abbr}/
- Preserve clipped body content and add missing vault metadata
- Generate canonical pages, rebuild indexes, and write:
workspace/manifests/web_clipper_import.json
workspace/logs/web_search.md
The importer archives successfully imported inbox files to workspace/web-inbox/imported/. Dry-run and skipped-existing files are not moved.
Command
python scripts/web_import_clipper.py --direction Battery
Workflow 12: submission-recommend
Purpose
Recommend suitable journals for a user's paper based on local literature evidence.
Input
Path to the user's paper (Markdown file)
Steps
-
Read and analyze the user's paper:
- Extract: research topic, methods, datasets, key results, reference list
-
Identify candidate journals from the vault (all journals with ≥ 5 papers)
-
For each candidate journal:
a. Read the journal report if it exists; otherwise prompt user to generate one first
b. Score across 6 dimensions:
| Dimension | Weight | Description |
|---|
| topic_fit | 0.25 | Does the paper's topic match the journal's recent hotspots? |
| method_fit | 0.20 | Does the method type align with the journal's preferences? |
| novelty_fit | 0.20 | Is the paper differentiated from the journal's recent work? |
| experiment_fit | 0.15 | Do datasets, metrics, and experiment scale meet journal norms? |
| citation_fit | 0.10 | Does the reference list cover the journal's key papers? |
| risk | -0.10 | Scope mismatch, insufficient novelty, experimental gaps |
c. Compute total score (0–100 scale)
-
Rank top 5 journals by total score
-
Generate report → library/reports/submission/{paper_slug}-recommend-{date}.md
- Use template:
templates/generic/submission_report.md
- Apply the Report Citation Policy: cite journal reports and underlying papers where they support scoring evidence or recommendations, and list all cited evidence in
References.
Output (zh)
投稿推荐报告
论文:{title}
推荐期刊 Top 5:
1. {journal_1}({score_1}/100)— {reason}
2. {journal_2}({score_2}/100)— {reason}
3. {journal_3}({score_3}/100)— {reason}
4. {journal_4}({score_4}/100)— {reason}
5. {journal_5}({score_5}/100)— {reason}
报告已保存:library/reports/submission/{paper_slug}-recommend-{date}.md
Workflow 13: revision-suggest
Purpose
Generate targeted revision suggestions for a paper aimed at a specific journal.
Input
- Path to the user's paper (Markdown file)
- Target journal name or abbreviation
Steps
-
Read the user's paper
-
Read the target journal's literature collection (canonical pages)
-
Read the journal report if available
-
Compare across 5 dimensions:
a. Formatting (排版):
- Section structure vs. typical papers in the journal
- Figure/table style conventions
- Length norms
b. Method Writing Style (方法写作风格):
- How methods are typically presented in this journal
- Level of mathematical formalism
- Pseudo-code vs. text description preferences
c. Research Method Sufficiency (研究方法充足性):
- Number of baselines typically compared
- Ablation study expectations
- Statistical significance testing norms
d. Introduction Alignment (引言适配):
- Research motivation framing typical of this journal
- Literature review scope and depth
- Problem statement style
e. Reference Coverage (参考文献覆盖):
- Key papers from this journal that should be cited
- Reference count norms
- Self-citation patterns
-
Prioritize suggestions by impact: critical → important → optional
-
Generate report → library/reports/submission/{paper_slug}-revision-for-{journal}-{date}.md
- Use template:
templates/generic/revision_report.md
- Apply the Report Citation Policy: cite target-journal papers where they support formatting, method, experiment, introduction, reference-coverage, or action-list suggestions, and list all cited evidence in
Evidence Basis.
Output (zh)
面向 {journal} 的修改建议
评估维度:
- 排版:{score}/5 — {summary}
- 方法写作:{score}/5 — {summary}
- 研究方法:{score}/5 — {summary}
- 引言适配:{score}/5 — {summary}
- 参考文献:{score}/5 — {summary}
关键修改建议(共 {N} 条):
1. [关键] {suggestion_1}
2. [关键] {suggestion_2}
3. [重要] {suggestion_3}
...
报告已保存:library/reports/submission/{paper_slug}-revision-for-{journal}-{date}.md
Workflow 14: status
Purpose
Display vault status summary.
Formal CLI
python scripts/status_report.py
python scripts/status_report.py --direction Battery
Steps
- Load source records from
library/indexes/papers.json
- Load canonical records from
library/indexes/canonical_pages.json
- Summarize:
- source / canonical counts
- direction and journal distribution
- tag coverage across
tags_*
- recent web/report activity from
workspace/logs/
- template registry state from
config.json
- Save:
library/reports/vault/status-{date}.md
workspace/manifests/status_report.json
Output (zh)
文献库状态
论文总数:{total}
按方向:
- Battery: {count}
- TimeSeries: {count}
按期刊(前 5):
- Energy: {count}
- JES: {count}
- RESS: {count}
- AppliedEnergy: {count}
- JPS: {count}
Canonical 页:{canonical_count} / {total}({pct}% 已入库)
标签覆盖率:{tagged_count} / {canonical_count}({pct}%)
领域模板:
- battery: ✅ 已生成({date},{paper_count} 篇时生成)
- timeseries: ❌ 未生成
最近操作:
{last_5_log_entries}
Workflow 15: lint
Purpose
Health check for the vault.
Formal CLI
python scripts/lint_vault.py
python scripts/lint_vault.py --direction Battery
Checks
- Orphan canonical pages: canonical page exists but source file is missing
- Missing canonical pages: source file exists but no canonical page
- Tag inconsistencies: tags in canonical pages not in
tag_taxonomy.json
- Stale indexes: index files older than the latest source/canonical files
- Missing frontmatter: canonical pages missing required fields
- Broken source_path: canonical points to a non-existent source file
- Template staleness: registry entries whose paper counts have outgrown their recorded baseline
Output files
library/reports/vault/lint-{date}.md
workspace/manifests/lint_vault.json
Output (zh)
文献库健康检查报告
✅ 通过:
- 索引更新状态
- 标签一致性
⚠️ 警告:
- {N} 篇论文未入库(无 canonical 页)
- {N} 个标签未在 taxonomy 中注册
- 领域模板 battery 已过时(新增 {pct}% 论文)
❌ 错误:
- {N} 个 canonical 页找不到源文件
建议操作:
1. 运行 "入库" 处理未入库论文
2. 运行 "标签" 更新标签体系
3. 运行 "重建索引" 刷新索引
Workflow 16: pipeline
Purpose
Execute the full preprocessing pipeline in sequence.
Steps
Execute in order, stopping on errors:
- init
- scan-organize (scan only, no move unless user confirms)
- ingest (all unprocessed papers)
- tag (batch tag)
- rebuild indexes (via
python scripts/rebuild_indexes.py)
- status (show final state)
Output (zh)
完整流程执行完成
1. ✅ 初始化
2. ✅ 扫描:{N} 个文件
3. ✅ 入库:{N} 篇新增
4. ✅ 打标:{N} 篇更新
5. ✅ 索引重建
6. 当前状态:{summary}
Workflow 17: paper-read
Purpose
Read one paper deeply and generate a structured reading note for single-paper understanding.
Answer these questions:
- What problem does this paper solve?
- Why is this problem important?
- What method or model does it use?
- Why can this method or model solve the problem?
- What are the core conclusions?
- What can be done next?
Input
- Required: one paper path, title, DOI, canonical id, or source path
- Preferred: canonical page under
library/papers/{direction}/
- Allowed: source Markdown under
paper/{direction}/...
- Optional: user research context, such as "focus on battery SOH" or "focus on method novelty"
Steps
-
Locate the paper:
- If input is a path, read it directly.
- If input is a title, id, or DOI, search canonical pages first, then source files.
- If both source and canonical exist, prefer canonical metadata and inspect source/full text when needed.
-
Read metadata:
- title, authors, journal, year, DOI, source path, tags.
-
Read content:
- Abstract
- Introduction / motivation
- Method / model
- Experiments / datasets
- Results
- Limitations / discussion / conclusion
- Full text via
source_path if canonical page provided
-
Generate the reading note using templates/generic/paper_reading.md.
-
Evidence discipline:
- Ground every answer in the paper text.
- If a point is inferred rather than explicitly stated, label it as inference.
- If information is missing, write "Not available in the provided paper text" rather than guessing.
- Do not use external literature unless the user explicitly asks for comparison.
- Because this workflow reads one paper, do not apply the multi-paper Report Citation Policy unless additional papers are used.
-
Save output when the user asks for a file or when a durable note is useful:
library/reports/paper/{date}-{paper_id}-reading.md
Output (zh)
单篇文献精读:{title}
1. 这篇文章解决了什么问题?
{answer}
2. 这个问题为什么重要?
{answer}
3. 本文使用了什么方法或模型?
{answer}
4. 为什么这个方法或模型能解决这个问题?
{answer}
5. 核心结论是什么?
{answer}
6. 下一步可以怎么做?
{answer}
阅读笔记已保存:library/reports/paper/{date}-{paper_id}-reading.md
Workflow 18: direction-review
Purpose
Prepare a review-writing bundle for one direction, combining local full-text evidence, related vault context, and default web supplementation.
Current execution path
This workflow is Agent-driven with a lightweight preparation script, not a report_family.py mode.
Formal preparation command
python scripts/prepare_direction_review.py --direction Battery
python scripts/prepare_direction_review.py --direction Battery --focus "battery SOH"
python scripts/prepare_direction_review.py --direction Battery --focus "battery SOH" --top 6 --dry-run
Input
- Required:
--direction {existing_direction}
- Optional:
--focus "topic"
- Optional:
--top N for approximate total web supplementation volume
- Optional:
--dry-run
Steps
- Validate that the direction exists and already has canonical pages with usable
source_path. If canonical pages are missing, prompt the user to run ingest first.
- Load canonical pages for the direction.
- If
--focus is set, further restrict the local set using title, abstract, keyword, and tag matching.
- Partition the selected local set into:
- readable records with valid
source_path
- skipped records with missing/unreadable source files
- Derive 1-3 web queries from:
- the direction name
- the optional focus text
- top local task / method / application tags when available
- Run default web supplementation using the existing
web_search.py logic:
- reuse current ranking, domain filtering, and arXiv full-text behavior
- allow both new formal full-text saves and web-layer metadata saves into the review bundle
- Gather related context paths from
library/reports/journal/, library/reports/direction/, library/reports/idea/, and library/reports/web/.
- Build review hints from the local reading set:
- candidate method categories
- common datasets
- common metrics
- common applications
- suggested comparison tables
- Save the preparation outputs:
workspace/cache/fulltext-review/{run_key}.json
workspace/manifests/direction_review_prepare.json
- Agent reads every readable record in the bundle and writes the final review to:
library/reports/review/{direction-or-focus}-review-{date}.md
Writing rules
- This workflow borrows the review-writing mode from a review-oriented skill, but it must stay domain-agnostic.
- Do not hardcode field-specific method taxonomies or section names from any pre-existing domain template.
- Method categories, datasets, metrics, and application groupings must be inferred from the current direction corpus.
- Every major section should include at least one comparison table.
- Every major method category should end with a limitations paragraph.
- Unless the user explicitly asks for a deep review, a standard
direction-review should target 40-80 cited references.
- When the user explicitly asks for a deep / comprehensive / full survey review, target 80-120 cited references.
- Related reports in
library/reports/ are secondary context only; all substantive conclusions must still come from the paper text.
Notes
direction-review is not a replacement for direction-report.
direction-report remains the direction/topic status report workflow.
direction-review is the survey-style literature review workflow.
--dry-run still writes the preparation bundle and manifest, but it does not write the final review Markdown and may include preview-only web records that are not yet readable on disk.
Output (zh)
方向综述准备完成
方向:{direction}
聚焦主题:{focus_or_none}
本地可读论文:{local_readable}
本地跳过论文:{local_skipped}
网络补充记录:{web_count}
Bundle:workspace/cache/fulltext-review/{run_key}.json
Manifest:workspace/manifests/direction_review_prepare.json
最终综述目标:library/reports/review/{direction-or-focus}-review-{date}.md
Script Reference
Scripts are in scripts/ and use Python standard library only.
| Script | Purpose | Usage |
|---|
scan_sources.py | Scan paper sources | python scripts/scan_sources.py |
organize_by_journal.py | Journal-based file organization | python scripts/organize_by_journal.py --all --dry-run |
detect_duplicates.py | Exact/probable duplicate detection | python scripts/detect_duplicates.py --direction Battery |
rebuild_indexes.py | Rebuild indexes | python scripts/rebuild_indexes.py |
html_table_to_md.py | Convert HTML tables | python scripts/html_table_to_md.py <file> |
resolve_journal.py | Inspect journal resolution for one source file | python scripts/resolve_journal.py paper/Battery/Energy/example.md |
ingest_batch.py | Batch-generate canonical pages and optionally apply keyword-rule tags | python scripts/ingest_batch.py --direction Battery --apply-tags |
scan_tags.py | Scan canonical tag coverage and keyword-rule hits | python scripts/scan_tags.py --direction Battery --rules |
export_summaries.py | Export titles, metadata, abstracts, and keywords for review | python scripts/export_summaries.py --direction Battery --format json |
report_family.py | Full-text journal/direction run-bundle preparation by default; --complete closes the run after the final report exists; deterministic reports with --metadata-only; stat reports unchanged | python scripts/report_family.py --mode journal --journal RESS |
prepare_direction_review.py | Prepare a direction-level literature review bundle with default web supplementation for Agent writing | python scripts/prepare_direction_review.py --direction Battery --focus "battery SOH" |
status_report.py | Vault status summary (Markdown + JSON) | python scripts/status_report.py |
lint_vault.py | Vault health check (Markdown + JSON) | python scripts/lint_vault.py |
web_search.py | Search OpenAlex/Semantic Scholar/arXiv and save Markdown papers | python scripts/web_search.py find --direction Battery --query "topic" |
arxiv_fulltext.py | Fetch arXiv HTML/TeX/PDF/API fallback full text | Imported by web_search.py |
web_import_clipper.py | Import Obsidian Web Clipper Markdown | python scripts/web_import_clipper.py --direction Battery |
common.py | Shared utilities | Imported by other scripts |
All scripts use the vault root as project root (auto-detected from script location via Path(__file__).resolve().parents[1]).
Historical draft scripts and one-off migration tools now live under workspace/legacy/ and are not part of the formal workflow surface.
Schema Reference
tag_taxonomy.json
Defines tag dimensions and known tags:
{
"dimensions": {
"task": { "label": "Task", "abbr_map": {} },
"method": { "label": "Method", "abbr_map": {} },
...
},
"tags": {
"task": ["SOH estimation", ...],
"method": ["LSTM", ...],
...
}
}
keyword_rules.json
Maps text patterns to tags:
{
"rules": [
{ "pattern": "state of health|SOH", "tag": "SOH estimation", "dimension": "task" },
...
]
}
journal_aliases.json
Maps journal full names to abbreviations. Already exists with 27+ entries.
paper_frontmatter.schema.md
Documents required and optional frontmatter fields for canonical pages.
Canonical Page Format
Canonical pages in library/papers/{direction}/{paper_id}.md serve as index anchors linking to source files. They contain metadata, tags, abstract, and user notes — not full text.
Full text reading and evidence retrieval should use source_path to access the original Markdown in paper/.
---
id: battery-2025-ress-bayesian-calibrated-pinn
title: "Paper title"
direction: Battery
source_path: "paper/Battery/RESS/paper.md"
source_checksum: sha256...
journal: "Reliability Engineering & System Safety"
journal_abbr: "RESS"
published_date: "2025-12"
published_year: 2025
doi: "10.1016/..."
url: "https://..."
tags_task: []
tags_method: []
tags_dataset: []
tags_domain: []
tags_signal: []
tags_application: []
tags_metric: []
tags_custom: []
status: "unread"
reading_priority: "medium"
updated_at: timestamp
---
<!-- User-maintained section. Scripts must never overwrite this. -->
Safety Rules
- Never delete files in
paper/ — these are the user's original paper Markdown files
- File moves require
--dry-run first, then user confirmation before --apply
- Preserve
## User Notes — never overwrite content under this heading in any file
- Log everything — all file operations go to
workspace/logs/
- Tag priority — user tags > rule tags > manually confirmed extra suggestions (never remove user tags)
- No cross-direction moves — files stay within their research direction unless user explicitly requests
- No overwrites — if target file exists during organize, log as conflict, do not overwrite