원클릭으로 Manus에서 모든 스킬 실행

$pwd:

pdf-text-extractor

Name: Pdf Text Extractor
Author: WILLOSCAR

// Download PDFs (when available) and extract plain text to support full-text evidence, writing `papers/fulltext_index.jsonl` and `papers/fulltext/*.txt`. **Trigger**: PDF download, fulltext, extract text, papers/pdfs, 全文抽取, 下载PDF. **Use when**: `queries.md` 设置 `evidence_mode: fulltext`（或你明确需要全文证据）并希望为 paper notes/claims 提供更强 evidence。 **Skip if**: `evidence_mode: abstract`（默认）；或你不希望进行下载/抽取（成本/权限/时间）。 **Network**: fulltext 下载通常需要网络（除非你手工提供 PDF 缓存在 `papers/pdfs/`）。 **Guardrail**: 缓存下载到 `papers/pdfs/`；默认不覆盖已有抽取文本（除非显式要求重抽）。

Manus에서 실행

$ git log --oneline --stat

stars:449

forks:31

updated:2026년 5월 30일 12:16

파일 탐색기

2 개 파일

SKILL.md

readonly

name

pdf-text-extractor

description

Download PDFs (when available) and extract plain text to support full-text evidence, writing `papers/fulltext_index.jsonl` and `papers/fulltext/*.txt`. **Trigger**: PDF download, fulltext, extract text, papers/pdfs, 全文抽取, 下载PDF. **Use when**: `queries.md` 设置 `evidence_mode: fulltext`（或你明确需要全文证据）并希望为 paper notes/claims 提供更强 evidence。 **Skip if**: `evidence_mode: abstract`（默认）；或你不希望进行下载/抽取（成本/权限/时间）。 **Network**: fulltext 下载通常需要网络（除非你手工提供 PDF 缓存在 `papers/pdfs/`）。 **Guardrail**: 缓存下载到 `papers/pdfs/`；默认不覆盖已有抽取文本（除非显式要求重抽）。

PDF Text Extractor

Optionally collect full-text snippets to deepen evidence beyond abstracts.

This skill is intentionally conservative: in many survey runs, abstract/snippet mode is enough and avoids heavy downloads.

Inputs

papers/core_set.csv (expects paper_id, title, and ideally pdf_url/arxiv_id/url)
Optional: outline/mapping.tsv (to prioritize mapped papers)

Outputs

papers/fulltext_index.jsonl (one record per attempted paper)
Side artifacts:
- papers/pdfs/<paper_id>.pdf (cached downloads)
- papers/fulltext/<paper_id>.txt (extracted text)

Decision: evidence mode

queries.md can set evidence_mode: "abstract" | "fulltext".
- abstract (default template): do not download; write an index that clearly records skipping.
- fulltext: download PDFs (when possible) and extract text to papers/fulltext/.

Local PDFs Mode

When you cannot/should not download PDFs (restricted network, rate limits, no permission), provide PDFs manually and run in “local PDFs only” mode.

PDF naming convention: papers/pdfs/<paper_id>.pdf where <paper_id> matches papers/core_set.csv.
Set - evidence_mode: "fulltext" in queries.md.
Run: python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only

If PDFs are missing, the script writes a to-do list:

output/MISSING_PDFS.md (human-readable summary)
papers/missing_pdfs.csv (machine-readable list)

Workflow (heuristic)

Read papers/core_set.csv.
If outline/mapping.tsv exists, prioritize mapped papers first.
For each selected paper (fulltext mode):
- resolve pdf_url (use pdf_url, else derive from arxiv_id/url when possible)
- download to papers/pdfs/<paper_id>.pdf if missing
- extract a reasonable prefix of text to papers/fulltext/<paper_id>.txt
- append/update a JSONL record in papers/fulltext_index.jsonl with status + stats
Never overwrite existing extracted text unless explicitly requested (delete the .txt to re-extract).

Quality checklist

papers/fulltext_index.jsonl exists and is non-empty.
If evidence_mode: "fulltext": at least a small but non-trivial subset has extracted text (strict mode blocks if extraction coverage is near-zero).
If evidence_mode: "abstract": the index records clearly reflect skip status (no downloads attempted).

Script

Quick Start

python .codex/skills/pdf-text-extractor/scripts/run.py --help
python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>

All Options

--max-papers <n>: cap number of papers processed (can be overridden by queries.md)
--max-pages <n>: extract at most N pages per PDF
--min-chars <n>: minimum extracted chars to count as OK
--sleep <sec>: delay between downloads
--local-pdfs-only: do not download; only use papers/pdfs/<paper_id>.pdf if present
queries.md supports: evidence_mode, fulltext_max_papers, fulltext_max_pages, fulltext_min_chars

Examples

Abstract mode (no downloads):
- Set - evidence_mode: "abstract" in queries.md, then run the script (it will emit papers/fulltext_index.jsonl with skip statuses)
Fulltext mode with local PDFs only:
- Set - evidence_mode: "fulltext" in queries.md, put PDFs under papers/pdfs/, then run: python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
Fulltext mode with smaller budget:
- python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200

Notes

Downloads are cached under papers/pdfs/; extracted text is cached under papers/fulltext/.
The script does not overwrite existing extracted text unless you delete the .txt file.

Troubleshooting

Issue: no PDFs are available to download

Fix:

Use evidence_mode: abstract (default) or provide local PDFs under papers/pdfs/ and rerun with --local-pdfs-only.

Issue: extracted text is empty/garbled

Fix:

Try a different extraction backend if supported; otherwise mark the paper as abstract evidence level and avoid strong fulltext claims.

related-skills.json

같은 저장소

agent-survey-corpus.md

from "WILLOSCAR/research-units-pipeline-skills"

Download a small corpus of open-access arXiv survey/review PDFs about agentic systems and extract text for style learning. **Trigger**: agent survey corpus, ref corpus, download surveys, 学习综述写法, 下载 survey. **Use when**: you want to study how real agent surveys structure sections (6–8 H2), size subsections, and write evidence-backed comparisons. **Skip if**: you cannot download PDFs (no network) or you don't want local PDF files. **Network**: required. **Guardrail**: only download arXiv PDFs; store under `ref/` and keep large files out of git.

2026-05-30449

global-reviewer.md

from "WILLOSCAR/research-units-pipeline-skills"

Global consistency review for survey drafts: terminology, cross-section coherence, and scope/citation hygiene. Writes `output/GLOBAL_REVIEW.md` and (optionally) applies safe edits to `output/DRAFT.md`. **Trigger**: global review, consistency check, coherence audit, 术语一致性, 全局回看, 章节呼应, 拷打 writer. **Use when**: Draft exists and you want a final evidence-first coherence pass before LaTeX/PDF. **Skip if**: You are still changing the outline/mapping/notes (do those first), or prose writing is not approved. **Network**: none. **Guardrail**: Do not invent facts or citations; do not add new citation keys; treat missing evidence as a failure signal.

2026-05-30449

literature-engineer.md

from "WILLOSCAR/research-units-pipeline-skills"

Multi-route literature expansion + metadata normalization for evidence-first surveys. Produces a large candidate pool (`papers/papers_raw.jsonl`, target ≥1200) with stable IDs and provenance, ready for dedupe/rank + citation generation. **Trigger**: evidence collector, literature engineer, 文献扩充, 多路召回, snowballing, cited by, references, 元信息增强, provenance. **Use when**: 需要把候选文献扩充到 ≥1200 篇并补齐可追溯 meta（survey pipeline 的 Stage C1，写作前置 evidence）。 **Skip if**: 已经有高质量 `papers/papers_raw.jsonl`（≥1200 且每条都有稳定标识+来源记录）。 **Network**: 可离线（靠 imports）；雪崩/在线检索需要网络。 **Guardrail**: 不允许编造论文；每条记录必须带稳定标识（arXiv id / DOI / 可信 URL）和 provenance；不写 output/ prose。

2026-05-30449

prose-writer.md

from "WILLOSCAR/research-units-pipeline-skills"

Write `output/DRAFT.md` (or `output/SNAPSHOT.md`) from an approved outline and evidence packs, using only verified citation keys from `citations/ref.bib`. **Trigger**: write draft, prose writer, snapshot, survey writing, 写综述, 生成草稿, section-by-section drafting. **Use when**: structure is approved (`DECISIONS.md` has `Approve C2`) and evidence packs exist (`outline/subsection_briefs.jsonl`, `outline/evidence_drafts.jsonl`). **Skip if**: approvals are missing, or evidence packs are incomplete / scaffolded (missing-fields, TODO markers). **Network**: none. **Guardrail**: do not invent facts or citations; only cite keys present in `citations/ref.bib`; avoid pipeline-jargon leakage in final prose.

2026-05-30449

schema-normalizer.md

from "WILLOSCAR/research-units-pipeline-skills"

Normalize cross-skill JSONL interfaces (ids + titles + citation key formats) so downstream skills do not rely on best-effort joins. **Trigger**: schema normalize, jsonl contract, interface drift, join drift, 字段不一致, schema 规范化. **Use when**: you have generated C2-C4 JSONL artifacts (outline/briefs/bindings/packs/anchors) and want deterministic, stable fields before self-loops/writing. **Skip if**: you are not using the survey pipelines, or the workspace already has a fresh PASS `output/SCHEMA_NORMALIZATION_REPORT.md` for the current artifacts. **Network**: none. **Guardrail**: NO PROSE; deterministic transforms only; do not invent evidence/claims; only fill missing ids/titles from `outline/outline.yml`.

2026-05-30449

writer-selfloop.md

from "WILLOSCAR/research-units-pipeline-skills"

Writing self-loop for surveys: run the strict section-quality gate, then rewrite only the failing `sections/*.md` files until the report is PASS. **Trigger**: writer self-loop, writing loop, quality gate loop, rewrite failing sections, 自循环, 反复改到 PASS. **Use when**: per-section files exist but C5 is FAIL/BLOCKED (thin sections, missing leads/front matter, citation-scope violations, generator voice). **Skip if**: you are still pre-C2 (NO PROSE), or evidence packs are incomplete (fix C3/C4 first). **Network**: none. **Guardrail**: do not invent facts; only use citation keys present in `citations/ref.bib`; keep citations in-scope per `outline/evidence_bindings.jsonl`; do not add/remove citation keys during rewrites.

2026-05-30449

package.json

"author": "WILLOSCAR"

"repository": "WILLOSCAR/research-units-pipeline-skills"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

기타 사무 및 행정 지원 종사자사무 및 행정 지원직43-9199L4

소프트웨어 개발자L4

name

pdf-text-extractor

description

PDF Text Extractor

Optionally collect full-text snippets to deepen evidence beyond abstracts.

This skill is intentionally conservative: in many survey runs, abstract/snippet mode is enough and avoids heavy downloads.

Inputs

papers/core_set.csv (expects paper_id, title, and ideally pdf_url/arxiv_id/url)
Optional: outline/mapping.tsv (to prioritize mapped papers)

Outputs

papers/fulltext_index.jsonl (one record per attempted paper)
Side artifacts:
- papers/pdfs/<paper_id>.pdf (cached downloads)
- papers/fulltext/<paper_id>.txt (extracted text)

Decision: evidence mode

queries.md can set evidence_mode: "abstract" | "fulltext".
- abstract (default template): do not download; write an index that clearly records skipping.
- fulltext: download PDFs (when possible) and extract text to papers/fulltext/.

Local PDFs Mode

When you cannot/should not download PDFs (restricted network, rate limits, no permission), provide PDFs manually and run in “local PDFs only” mode.

PDF naming convention: papers/pdfs/<paper_id>.pdf where <paper_id> matches papers/core_set.csv.
Set - evidence_mode: "fulltext" in queries.md.
Run: python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only

If PDFs are missing, the script writes a to-do list:

output/MISSING_PDFS.md (human-readable summary)
papers/missing_pdfs.csv (machine-readable list)

Workflow (heuristic)

Read papers/core_set.csv.
If outline/mapping.tsv exists, prioritize mapped papers first.
For each selected paper (fulltext mode):
- resolve pdf_url (use pdf_url, else derive from arxiv_id/url when possible)
- download to papers/pdfs/<paper_id>.pdf if missing
- extract a reasonable prefix of text to papers/fulltext/<paper_id>.txt
- append/update a JSONL record in papers/fulltext_index.jsonl with status + stats
Never overwrite existing extracted text unless explicitly requested (delete the .txt to re-extract).

Quality checklist

papers/fulltext_index.jsonl exists and is non-empty.
If evidence_mode: "fulltext": at least a small but non-trivial subset has extracted text (strict mode blocks if extraction coverage is near-zero).
If evidence_mode: "abstract": the index records clearly reflect skip status (no downloads attempted).

Script

Quick Start

python .codex/skills/pdf-text-extractor/scripts/run.py --help
python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>

All Options

--max-papers <n>: cap number of papers processed (can be overridden by queries.md)
--max-pages <n>: extract at most N pages per PDF
--min-chars <n>: minimum extracted chars to count as OK
--sleep <sec>: delay between downloads
--local-pdfs-only: do not download; only use papers/pdfs/<paper_id>.pdf if present
queries.md supports: evidence_mode, fulltext_max_papers, fulltext_max_pages, fulltext_min_chars

Examples

Abstract mode (no downloads):
- Set - evidence_mode: "abstract" in queries.md, then run the script (it will emit papers/fulltext_index.jsonl with skip statuses)
Fulltext mode with local PDFs only:
- Set - evidence_mode: "fulltext" in queries.md, put PDFs under papers/pdfs/, then run: python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
Fulltext mode with smaller budget:
- python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200

Notes

Downloads are cached under papers/pdfs/; extracted text is cached under papers/fulltext/.
The script does not overwrite existing extracted text unless you delete the .txt file.

Troubleshooting

Issue: no PDFs are available to download

Fix:

Use evidence_mode: abstract (default) or provide local PDFs under papers/pdfs/ and rerun with --local-pdfs-only.

Issue: extracted text is empty/garbled

Fix:

Try a different extraction backend if supported; otherwise mark the paper as abstract evidence level and avoid strong fulltext claims.

pdf-text-extractor

PDF Text Extractor

Inputs

Outputs

Decision: evidence mode

Local PDFs Mode

Workflow (heuristic)

Quality checklist

Script

Quick Start

All Options

Examples

Notes

Troubleshooting

Issue: no PDFs are available to download

Issue: extracted text is empty/garbled

이 저장소의 다른 Skills

PDF Text Extractor

Inputs

Outputs

Decision: evidence mode

Local PDFs Mode

Workflow (heuristic)

Quality checklist

Script

Quick Start

All Options

Examples

Notes

Troubleshooting

Issue: no PDFs are available to download

Issue: extracted text is empty/garbled

이 저장소의 다른 Skills