| name | paper-research |
| description | End-to-end paper research support for arXiv/literature surveys, reproducibility-focused paper shortlisting, and experiment design. Use when you need to (1) search arXiv with complex queries, (2) download PDFs, extract text/sections, and fetch BibTeX, (3) dedupe/cluster results into a structured report, and (4) turn findings into a lit-review plan, benchmark/evaluation suite, and representation/probing experiment checklist (e.g., implicit reasoning, hidden-CoT, multilingual reasoning, cross-lingual alignment). |
Paper Research
Overview
Run a fast, reproducible “survey → shortlist → synthesize” loop for research topics, backed by small scripts that fetch arXiv metadata/PDFs/BibTeX, extract text, and generate structured Markdown briefs.
Quick start (recommended workflow)
- Create a topic workspace directory (keep everything together):
- Example:
notes/implicit-reasoning-survey/
- Search arXiv and (optionally) download PDFs:
- Run:
python3 scripts/arxiv_survey.py --terms "implicit reasoning" "hidden chain-of-thought" "multilingual reasoning" --max-results 100 --download-pdfs --pdf-dir ./pdfs --out ./arxiv.jsonl
- Extract text (+ rough sections) from PDFs:
- Run:
python3 scripts/pdf_extract.py --pdf-dir ./pdfs --out-dir ./texts --sections
- Fetch BibTeX for the found arXiv IDs:
- Run:
python3 scripts/arxiv_bibtex.py --from-jsonl ./arxiv.jsonl --out ./refs.bib
- Generate a structured research brief (table + clusters + TODO slots for notes):
- Run:
python3 scripts/generate_report.py --jsonl ./arxiv.jsonl --out ./REPORT.md
Then ask Codex to synthesize (taxonomy/benchmarks/experiments) using REPORT.md + your notes.
Workflow decision tree
A) “I need a lit review plan + paper outline”
Do this:
- Use the scripts to produce
REPORT.md (table + clusters) and refs.bib.
- Build a survey plan as a set of falsifiable questions + “what evidence would change my mind”.
- Output deliverables (in this order):
- Lit review plan (subtopics → why → representative papers to read first)
- Benchmarks/metrics (existing + proposed) aligned to the hypothesis
- Validation experiments (including representation/probing/interventions)
- Paper outline + expected contributions
When relevant, include “fastest path to reproduce” (datasets, eval harnesses, probing code).
B) “I need a reproducibility-first shortlist”
Prioritize:
- Open-source repos (training recipe, evaluation harness, probing code)
- Clear protocol (hyperparams, seeds, compute, preprocessing)
- Reusable artifacts (scripts, configs, checkpoints, datasets)
Do this:
- Run
arxiv_survey.py with stricter terms and fewer results (e.g., 30–80).
- Ask Codex to rank papers in
REPORT.md by reproducibility criteria:
- Code availability, license clarity, dataset accessibility, protocol completeness
- Produce:
- Ranked shortlist with repo links (if available)
- “Reusable parts” per paper (eval harness / probing / training recipe)
- Minimal reproduction plan (timeboxed: 2h / 1d / 1w)
C) “I need an evaluation suite + detection experiments (multilingual latent reasoning)”
Use this structure:
- Hypothesis → operational definition (what counts as “English latent reasoning”)
- Tasks:
- Multi-step reasoning across languages (same semantics, different surface forms)
- Translation-free reasoning (language-neutral, symbol-heavy, or synthetic)
- Controlled prompts enforcing target-language output
- Metrics that separate reasoning vs fluency:
- Task accuracy, step-consistency proxies, calibration, controllability, latency
- Representation-level detection:
- Layer-wise language ID / probing on activations
- Activation patching/interventions (swap “language subspace” signals)
- Forced-language and mixed-language ablations
- Expected signatures + failure modes (confounds: translation, tokenization, data mixture)
Use assets/experiment_checklist.md as the backbone checklist.
Templates (assets/)
Copy and fill these as working docs:
assets/research_brief.md → one-topic brief (taxonomy + top papers + open questions)
assets/paper_comparison_table.md → consistent per-paper extraction fields
assets/experiment_checklist.md → step-by-step experimental checklist
Scripts
All scripts are pure-Python (stdlib) where possible. pdf_extract.py supports optional extractors; if none are available, it prints a clear install hint.
scripts/arxiv_survey.py
Search arXiv via the official Atom API, write results to JSONL, and optionally download PDFs.
scripts/arxiv_bibtex.py
Fetch BibTeX from arxiv.org for a list of arXiv IDs or a JSONL produced by arxiv_survey.py.
scripts/pdf_extract.py
Extract text from PDFs into .txt and optionally produce rough section splits (heuristics).
scripts/dedupe_jsonl.py
Dedupe a JSONL file by arxiv_id and near-duplicate titles (useful when iterating queries).
scripts/generate_report.py
Generate a structured Markdown report (table + clusters + TODO note slots) from arxiv.jsonl.
References
Read when you need query patterns or a report schema:
references/arxiv_query_guide.md
references/report_fields.md
Output quality bar (what “good” looks like)
- Prefer explicit assumptions + failure modes over broad claims.
- Prefer checklists and protocols over vague “future work”.
- Always separate: (1) claim, (2) evidence, (3) test that could falsify it.