| name | paper-reviewer |
| version | 0.1.0 |
| description | Generate publication-grade peer reviews of academic papers (PDF) via a multi-stage neuro-symbolic pipeline: PDFโMarkdown ingestion, five orthogonal analyses (story, presentation, evaluations, correctness, significance) with accumulated context, synthesis, self-critique, revision, and final quality check. Each analytical stage runs with dual grounding (original PDF for visual context + Markdown for symbolic structure) and stage-appropriate tool augmentation (code interpreter for evaluations/correctness, web search for significance). Preserves the closed-loop pattern `decompose โ analyze โ synthesize โ self-verify โ revise` that distinguishes self-improving inference from one-shot generation.
Use this skill whenever the user asks to: review a paper, write a peer review, critique a manuscript, generate an AI review, evaluate a submission for a conference/journal, check a paper's correctness or significance, assess a thesis or preprint, or any variant of "review this PDF" where the input is an academic paper. Triggers on: "review this paper", "peer review", "critique this manuscript", "evaluate this submission", "AI review", "check this paper", "assess this preprint", "review my thesis", "่ฟ็ฏ่ฎบๆ่ฏๅฎก", "ๅธฎๆๅฎก็จฟ", "ๅไธชๅ่ก่ฏๅฎก", or upload of a paper PDF with review intent โ even if the user does not explicitly say "review". Also use for iterative revision of an existing draft review, or for auditing a review for unsupported claims.
|
paper-reviewer
Multi-stage academic paper reviewer implementing a closed-loop
decomposition-and-error-correction pipeline. Designed as a neuro-symbolic
agent workflow with dual-representation input (PDF + Markdown), stage-wise
tool augmentation, and monadic context accumulation across stages.
When to use
Trigger on any request where the input is an academic paper PDF and the
deliverable is a substantive review โ regardless of venue (conference,
journal, thesis committee, arXiv preprint). Also use for:
- Auditing an existing draft review for unsupported claims or citation errors
- Revising a draft review based on a critique
- Running just a subset of stages (e.g., only correctness check)
Pipeline
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PAPER PDF (input) โ
โโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Stage 0: PDF โ Markdown โ
โ markitdown | pandoc | source-text-to-markdown | olmOCR โ
โ Produces: normalized .md + resampled .pdf (250dpi) โ
โโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ context = {pdf, markdown, history:[]}
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Stages 1โ5 (sequential, context accumulates) โ
โ โ
โ 1. story โ problem formulation & narrative validity โ
โ 2. presentation โ clarity, structure, readability โ
โ 3. evaluations โ datasets, baselines, metrics, stats โ
โ + tool: python code interpreter โ
โ 4. correctness โ equations, proofs, algorithms, tables โ
โ + tool: python code interpreter โ
โ 5. significance โ novelty, prior-work comparison โ
โ + tool: web search โ
โ โ
โ Prompt at each stage = base + stage + ALL prior outputs โ
โโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Stage 6: Synthesize initial review โ
โ โ Title, Summary, Overall, Strengths, Weaknesses, Refs (APA) โ
โโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Stage 7: Self-critique โ
โ Detect: unsupported claims, missing evidence, โ
โ inconsistencies, hallucinated/incorrect citations โ
โโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Stage 8: Revise โ final review โ
โโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Stage 9: Critic quality check โ
โ Bias, identity leakage, structure, hallucinated citations โ
โ โ optional human inspection gate โ
โโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
Final Review (.md + .docx)
How to run
Quick path (one command)
python scripts/run_review.py \
--pdf path/to/paper.pdf \
--out /mnt/user-data/outputs/ \
--converter markitdown
This orchestrates all stages and saves:
review_final.md โ the polished review
review_final.docx โ Word version (via docx skill if requested)
trace/ โ per-stage outputs for audit
context.json โ accumulated history
Stage-by-stage (when Claude runs the LLM stages directly)
The LLM-reasoning stages (1โ8) are executed by Claude itself, not by a
subprocess. The orchestrator script prepares inputs, loads the right prompt
from references/, and writes the output back. A typical run looks like:
- Stage 0 โ run
scripts/pdf_to_markdown.py (deterministic, runs via
bash_tool).
- Stages 1โ5 โ for each stage: read
references/stage_prompts.md for
that stage, read accumulated context.history, perform the reasoning,
append the result to context.history. For stages with tool augmentation,
actually invoke the tool:
evaluations / correctness โ use bash_tool to run Python
(verify reported numbers, re-derive equations, sanity-check tables).
significance โ use web_search to check claimed novelty against
prior work.
- Stage 6 โ read
references/synthesis_prompt.md, synthesize initial
review using the full history.
- Stage 7 โ read
references/critique_prompt.md, self-critique the
initial review against the paper.
- Stage 8 โ read
references/revision_prompt.md, produce the final
review.
- Stage 9 โ read
references/critic_check_prompt.md, run quality
check; if issues flagged, surface them to the user for human inspection.
Skipping stages is fine for partial reviews (e.g., --stages correctness for
a correctness-only audit). Always run Stage 9.
Key design patterns (preserved from source paper)
- Monadic context accumulation. Every stage sees all prior stage outputs,
so later reasoning is conditioned on earlier findings. Do not drop history
between stages.
- Dual representation. PDF gives visual grounding (figures, layout,
tables as rendered); Markdown gives symbolic structure (tokens, headings,
equations as LaTeX). Both are passed at every stage.
- Stage-wise tool augmentation. Don't give all tools to all stages โ
that dilutes reasoning. Tools attach where they sharpen a specific check
(numbers in
evaluations, math in correctness, claims in significance).
- Decompose-then-synthesize. Never generate the review in one shot.
Always run the five orthogonal analyses first, then synthesize.
- Self-verify before finalizing. The self-critique step is not optional โ
it is where most citation and support errors are caught.
- Retry with exponential backoff. All LLM/tool calls use the retry
helper in
scripts/llm_client.py.
Stage 0 converter options
| Converter | When to use | Notes |
|---|
markitdown | Fast, well-structured PDFs, no math | Microsoft's tool; pip install markitdown |
pandoc | Generic fallback | Robust, widely available |
source-text-to-markdown | Already in skill ecosystem | Wraps pandoc + metadata extraction + YAML front matter; preferred when available |
olmocr | Scanned PDFs, heavy math, complex layout | Original paper's choice; slowest but highest fidelity |
Auto-selection: the runner picks source-text-to-markdown if installed,
falls back to markitdown if available, then pandoc, then errors out
with instructions.
Output contract
Final review follows templates/review_template.md:
# <Paper Title>
## Summary
## Overall Assessment
## Strengths
## Weaknesses
## Detailed Comments
### Story / Framing
### Presentation
### Evaluations
### Correctness
### Significance
## References (APA)
## Metadata
- model: <llm-name>
- converter: <stage0-tool>
- stages_run: [...]
- critic_flags: [...]
KSTAR mapping
This pipeline is a direct KSTAR instance. See references/kstar_mapping.md
for the full morphism; the short version:
| Pipeline element | KSTAR component |
|---|
| Paper PDF + Markdown | Situation S |
| "Write a review" | Task T |
| Stage instructions + prior reviews | Knowledge K |
| Stage outputs | Action-plan decomposition ร |
| Initial review | Expected result Rฬ |
| Self-critique | ฮR / ฮE detection |
| Revision | Learning/update loop |
| Final review | Actual result R |
| Critic check | Validation oracle |
The full morphism: F(K, S, T) โ ร โ Rฬ โ Critique โ ฮR โ Update โ R.
Reference files
references/stage_prompts.md โ stage-specific instructions (Table 3 from source paper)
references/synthesis_prompt.md โ how to compose the initial review
references/critique_prompt.md โ self-critique checklist
references/revision_prompt.md โ revision instructions
references/critic_check_prompt.md โ post-hoc quality gate
references/kstar_mapping.md โ full KSTAR/Functorism morphism
templates/review_template.md โ output format
Scripts
scripts/run_review.py โ top-level orchestrator (CLI entry point)
scripts/pdf_to_markdown.py โ Stage 0 converter dispatcher
scripts/stage_runner.py โ per-stage context+prompt preparation
scripts/llm_client.py โ LLM wrapper with retry/backoff
Notes
- The paper PDF should be resampled to ~250 DPI before stage 1 if using a
vision-capable model;
pdf_to_markdown.py does this automatically.
- For very long papers (>30 pages),
stage_runner.py will chunk the
Markdown into logical sections (by heading) and run a map-reduce pass per
stage. The chunker preserves equation, table, and figure integrity.
- If the critic detects โฅ2 flags in Stage 9, the runner surfaces the review
for human inspection rather than auto-finalizing. This is the
human-in-the-loop gate.