Run any Skill in Manus with one click

dedupe-rank

Stars472

Forks36

UpdatedApril 14, 2026 at 11:50

Use when a broad paper candidate pool needs deterministic deduplication and a stable core set. **Trigger**: dedupe, rank, core set, 去重, 排序, 精选论文, 核心集合. **Use when**: 检索后需要把广覆盖集合收敛成可管理的 core set（用于 taxonomy/outline/mapping）。 **Skip if**: 已经有人手工整理了稳定的 `papers/core_set.csv`（无需再次 churn）。 **Network**: none. **Guardrail**: 偏 deterministic；输出应可重复（稳定 paper_id、字段规范）。

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

WILLOSCAR

WILLOSCAR/research-units-pipeline-skills

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

File Explorer

5 files

SKILL.md

readonly

name

dedupe-rank

description

Dedupe + Rank

Turns a raw candidate pool into a deduped pool and a stable core set.

Input

papers/papers_raw.jsonl

Outputs

papers/papers_dedup.jsonl
papers/core_set.csv

Script boundary

scripts/run.py should own only:

title/year deduplication
deterministic ranking
stable paper_id generation

Use shared domain packs or pipeline contract metadata for topic-specific or product-specific behavior.

Contract-driven behavior

The script should prefer pipeline contract metadata over profile-name branching.

Current important field:

quality_contract.candidate_pool_policy.keep_full_deduped_pool

If true, the script keeps the full deduped pool in papers/core_set.csv unless the user explicitly overrides core size.

Acceptance

deduped JSONL exists
core-set CSV exists
reruns are stable for the same inputs

Non-goals

retrieval
screening
manual topic authoring inside the script

More from this repository

same repository

agent-survey-corpus

WILLOSCAR/research-units-pipeline-skills

Download a small corpus of open-access arXiv survey/review PDFs about agentic systems and extract text for style learning. **Trigger**: agent survey corpus, ref corpus, download surveys, 学习综述写法, 下载 survey. **Use when**: you want to study how real agent surveys structure sections (6–8 H2), size subsections, and write evidence-backed comparisons. **Skip if**: you cannot download PDFs (no network) or you don't want local PDF files. **Network**: required. **Guardrail**: only download arXiv PDFs; store under `ref/` and keep large files out of git.

2026-05-30472

global-reviewer

WILLOSCAR/research-units-pipeline-skills

Global consistency review for survey drafts: terminology, cross-section coherence, and scope/citation hygiene. Writes `output/GLOBAL_REVIEW.md` and (optionally) applies safe edits to `output/DRAFT.md`. **Trigger**: global review, consistency check, coherence audit, 术语一致性, 全局回看, 章节呼应, 拷打 writer. **Use when**: Draft exists and you want a final evidence-first coherence pass before LaTeX/PDF. **Skip if**: You are still changing the outline/mapping/notes (do those first), or prose writing is not approved. **Network**: none. **Guardrail**: Do not invent facts or citations; do not add new citation keys; treat missing evidence as a failure signal.

2026-05-30472

literature-engineer

WILLOSCAR/research-units-pipeline-skills

Multi-route literature expansion + metadata normalization for evidence-first surveys. Produces a large candidate pool (`papers/papers_raw.jsonl`, target ≥1200) with stable IDs and provenance, ready for dedupe/rank + citation generation. **Trigger**: evidence collector, literature engineer, 文献扩充, 多路召回, snowballing, cited by, references, 元信息增强, provenance. **Use when**: 需要把候选文献扩充到 ≥1200 篇并补齐可追溯 meta（survey pipeline 的 Stage C1，写作前置 evidence）。 **Skip if**: 已经有高质量 `papers/papers_raw.jsonl`（≥1200 且每条都有稳定标识+来源记录）。 **Network**: 可离线（靠 imports）；雪崩/在线检索需要网络。 **Guardrail**: 不允许编造论文；每条记录必须带稳定标识（arXiv id / DOI / 可信 URL）和 provenance；不写 output/ prose。

2026-05-30472

pdf-text-extractor

WILLOSCAR/research-units-pipeline-skills

Download PDFs (when available) and extract plain text to support full-text evidence, writing `papers/fulltext_index.jsonl` and `papers/fulltext/*.txt`. **Trigger**: PDF download, fulltext, extract text, papers/pdfs, 全文抽取, 下载PDF. **Use when**: `queries.md` 设置 `evidence_mode: fulltext`（或你明确需要全文证据）并希望为 paper notes/claims 提供更强 evidence。 **Skip if**: `evidence_mode: abstract`（默认）；或你不希望进行下载/抽取（成本/权限/时间）。 **Network**: fulltext 下载通常需要网络（除非你手工提供 PDF 缓存在 `papers/pdfs/`）。 **Guardrail**: 缓存下载到 `papers/pdfs/`；默认不覆盖已有抽取文本（除非显式要求重抽）。

2026-05-30472

prose-writer

WILLOSCAR/research-units-pipeline-skills

Write `output/DRAFT.md` (or `output/SNAPSHOT.md`) from an approved outline and evidence packs, using only verified citation keys from `citations/ref.bib`. **Trigger**: write draft, prose writer, snapshot, survey writing, 写综述, 生成草稿, section-by-section drafting. **Use when**: structure is approved (`DECISIONS.md` has `Approve C2`) and evidence packs exist (`outline/subsection_briefs.jsonl`, `outline/evidence_drafts.jsonl`). **Skip if**: approvals are missing, or evidence packs are incomplete / scaffolded (missing-fields, TODO markers). **Network**: none. **Guardrail**: do not invent facts or citations; only cite keys present in `citations/ref.bib`; avoid pipeline-jargon leakage in final prose.

2026-05-30472

schema-normalizer

WILLOSCAR/research-units-pipeline-skills

Normalize cross-skill JSONL interfaces (ids + titles + citation key formats) so downstream skills do not rely on best-effort joins. **Trigger**: schema normalize, jsonl contract, interface drift, join drift, 字段不一致, schema 规范化. **Use when**: you have generated C2-C4 JSONL artifacts (outline/briefs/bindings/packs/anchors) and want deterministic, stable fields before self-loops/writing. **Skip if**: you are not using the survey pipelines, or the workspace already has a fresh PASS `output/SCHEMA_NORMALIZATION_REPORT.md` for the current artifacts. **Network**: none. **Guardrail**: NO PROSE; deterministic transforms only; do not invent evidence/claims; only fill missing ids/titles from `outline/outline.yml`.

2026-05-30472

name

dedupe-rank

description

Dedupe + Rank

Turns a raw candidate pool into a deduped pool and a stable core set.

Input

papers/papers_raw.jsonl

Outputs

papers/papers_dedup.jsonl
papers/core_set.csv

Script boundary

scripts/run.py should own only:

title/year deduplication
deterministic ranking
stable paper_id generation

Use shared domain packs or pipeline contract metadata for topic-specific or product-specific behavior.

Contract-driven behavior

The script should prefer pipeline contract metadata over profile-name branching.

Current important field:

quality_contract.candidate_pool_policy.keep_full_deduped_pool

If true, the script keeps the full deduped pool in papers/core_set.csv unless the user explicitly overrides core size.

Acceptance

deduped JSONL exists
core-set CSV exists
reruns are stable for the same inputs

Non-goals

retrieval
screening
manual topic authoring inside the script