| name | research-intake |
| version | 0.2.0 |
| description | Source-governed research link intake workflow. Use when asked to find papers,
repos, datasets, docs, standards, benchmarks, source surfaces, or trusted
links; collect research links; search literature; update a research base; run
research intake; produce links.md; or prepare auditable JSONL evidence. The
skill asks for a topic ID when missing, creates or reviews topic configs with
user approval, discovers trusted source roots, writes URL-bearing query files,
runs fetch/dedupe/check/finalize, and produces links.md plus run-local
accepted.jsonl evidence. It does not write reports or syntheses.
|
| allowed-tools | ["Bash","Read","Write","WebSearch","WebFetch","AskUserQuestion"] |
| triggers | ["find papers for","search literature","collect links","research intake","intake research","update research base"] |
Research Intake Skill
Research Intake is not a report writer. It is the intake layer that creates a
small, auditable, topic-scoped set of accepted links. The CLI is the deterministic
I/O layer. The model is responsible for topic judgment, source trust, query
planning, semantic filtering, and concise user gates.
The final user-visible deliverable is normally:
topics/<topic-id>/captures/links.md
The structured evidence for a run is:
runs/<run-id>/accepted.jsonl
Runtime Bootstrap
Use this command:
RI_CMD="__RESEARCH_INTAKE_COMMAND__"
RI_ROOT="${RESEARCH_INTAKE_ROOT:-.research-intake}"
All CLI calls should include --root "$RI_ROOT".
If the command is not installed and the current working directory is the project
checkout, use:
RI_CMD="uv run research-intake"
Respect an explicit user-provided root. If the user gives no root, use
RESEARCH_INTAKE_ROOT when set, otherwise .research-intake in the current
working directory.
Before a non-trivial run, check the command and root shape:
$RI_CMD doctor --root "$RI_ROOT"
If doctor fails because the root does not exist, initialize only when the user
has asked to start or update an intake base:
$RI_CMD init --root "$RI_ROOT"
Artifact Contract
Root layout:
<root>/
initial_sources.jsonl
topics/<topic-id>/
config.yaml
sources.jsonl
progress.md
captures/links.md
tasks/
runs/<run-id>/
expanded_queries.jsonl
raw_results.jsonl
candidates.jsonl
rejected.jsonl optional, model-written
categories.json optional, model-written
keep.json model-written before finalize
accepted.jsonl written by finalize
The following schema blocks are illustrative only. They show required shape and
field meaning, not fixed values. Always generate topic IDs, labels, ideas,
keywords, source IDs, URLs, resource titles, and dates from the user's actual
topic and verified sources.
Topic config schema example:
topic_id: <chosen-topic-id>
label: <Human Readable Topic Label>
description: <one-sentence topic scope>
include:
- idea: <included concept or source-governed research need>
keywords:
- <keyword>
- <related keyword>
description: <why this idea belongs in scope>
objects:
- <target artifact, method, dataset, benchmark, repo, paper, or docs type>
exclude:
- idea: <excluded concept>
description: <what to reject>
reason: <why it is outside the intake scope>
budget:
max_queries_per_source: 5
max_results_per_query: 10
Source record schema example, one JSON object per line:
{"source_id":"<source_slug>","url":"https://<trusted-root-or-collection>/"}
Optional inactive/candidate source example:
{"source_id":"<source_slug>","url":"https://<trusted-root-or-collection>/","status":"candidate"}
Rules for sources:
source_id is a short stable slug, preferably lowercase with underscores.
url is a trusted root, collection, docs home, org page, index, or source
surface.
- A specific result page is not a source. It belongs in
expanded_queries.jsonl
as a URL to fetch.
- Sources with missing
status are active. Sources with status other than
active are not fetched.
- Keep
sources.jsonl compact: one current row per source_id. Replace a row
when updating it, keeping duplicate source events out of the file.
Resource record schema used in raw_results.jsonl, candidates.jsonl, and
accepted.jsonl:
{
"resource_id": "url:<hash16>",
"source": "<source_slug>",
"title": "<verified resource title>",
"summary": "Short metadata description if available.",
"link": "https://<verified-result-url>",
"domain": "<verified-domain>",
"year": 0,
"query": "https://<verified-result-url>",
"fetched_at": "<UTC timestamp>"
}
Decision Gates
Ask the user only at real gates. Batch choices into one question.
Mandatory gates:
- Topic ID when missing.
- Scope for a new topic.
- Approval before changing
config.yaml.
- Approval before adding or changing source records.
- Hard-cap recovery when more than 256 candidates remain.
- Candidate group keep/reject/inspect decision.
If an AskUserQuestion tool is unavailable, ask one concise chat question and
wait. Use group-level questions unless the user explicitly requests
candidate-by-candidate review.
After a gate is answered, continue the workflow. Source discovery and config
review are not final deliverables.
Phase 0 - Understand The Request
Classify the user's request:
- New intake: collect links for a topic that is not yet configured.
- Existing intake: update or rerun a known topic.
- Source maintenance: add, remove, or clean source roots.
- Review-only: inspect existing candidates or links without running fetch.
If the user asks for a literature review, synthesis, report, or summary, explain
that this skill only performs link intake. Offer to produce links.md first.
Choose a run ID when one is not provided:
YYYY-MM-DD-<topic-id>-v1
If that directory exists, increment the suffix: v2, v3, and so on. Preserve
existing run directories.
Phase 1 - Topic Gate
If no topic ID is explicit, stop and ask:
Which topic ID should I use? Use a short stable slug, for example `<topic-id>`.
Ask for the topic ID instead of inferring it from nearby files or previous
context.
Check whether the topic exists:
test -f "$RI_ROOT/topics/<topic-id>/config.yaml"
New Topic
If the topic does not exist:
- Ask for the scope if the user has not supplied it.
- Create the skeleton:
$RI_CMD new-topic <topic-id> --root "$RI_ROOT" --description "<approved scope>"
- Read the generated
config.yaml.
- Draft a richer config update with
label, description, include,
exclude, and budget.
- Show the proposed config or a compact before/after diff.
- Ask one approval question:
Apply this topic config? Options: apply, edit first, leave default.
- Write
config.yaml only after approval.
- Continue to source discovery.
Existing Topic
For an existing topic, read:
topics/<topic-id>/config.yaml
topics/<topic-id>/sources.jsonl
topics/<topic-id>/progress.md if present
topics/<topic-id>/captures/links.md if present
Run:
$RI_CMD plan --root "$RI_ROOT" --topic <topic-id>
Review the config. The authority order is:
description > include idea > include keywords > include objects > exclude keywords
If the config is vague, contradictory, too broad, or missing exclusions, propose
specific edits and wait for approval before writing them.
Good config review output is concrete:
I would change:
1. Add include idea "<missing in-scope concept>" because the description explicitly mentions it.
2. Add exclude idea "<out-of-scope result type>" because those pages are not reusable source-governed evidence.
3. Reduce max_results_per_query from <old value> to <new value> to keep review under the cap.
Then ask one approval question for the whole set.
Phase 2 - Source Discovery And Maintenance
This phase is mandatory before fetch unless the user explicitly says to use the
existing sources only.
Goal: identify trusted source roots and concrete result URLs. Keep those two
concepts separate.
Examples:
Good source root: https://<trusted-domain>/<collection-root>/
Good result URL: https://<trusted-domain>/<specific-result-path>
Bad source root: https://<trusted-domain>/<specific-result-path>
Use web search from the topic config. Build 3-8 probes from:
- Topic description.
- Include ideas.
- Include keywords.
- Known official organizations, libraries, standards, datasets, or benchmarks.
- Exclude ideas, to avoid ambiguous wording.
Probe patterns:
<core idea> official docs
<core idea> <code or artifact host>
<core idea> dataset
<core idea> benchmark
<core idea> standard
<core idea> <topic-specific artifact type>
site:<known-source-domain> <core idea>
site:<candidate-domain> <core idea>
For each promising result, record:
- Source root URL.
- Concrete result URLs worth fetching.
- Why the source is trusted.
- Whether it should be active now or only remembered as candidate.
Trust levels:
high: official project/org docs, canonical repository/org, standards body,
maintained dataset/benchmark index, conference/journal/index page.
medium: respected lab, project page, package index, curated list with clear
maintenance.
reject: mirrors, SEO pages, scraped copies, generic blog spam, unrelated
result pages.
Before editing sources.jsonl, show a compact table:
| # | source_id | root URL | status | trust | concrete URLs found |
| 1 | <source_slug> | https://<trusted-domain>/<collection-root>/ | active | high | <count> |
Ask one question:
Which source updates should I apply? Options: add all high-trust, add selected numbers, inspect selected numbers, skip source updates.
When approved, update topics/<topic-id>/sources.jsonl as compact JSONL:
{"source_id":"<source_slug>","url":"https://<trusted-domain>/<collection-root>/"}
{"source_id":"<another_source_slug>","url":"https://<another-trusted-domain>/<collection-root>/"}
{"source_id":"<candidate_source_slug>","url":"https://<candidate-domain>/<collection-root>/","status":"candidate"}
Keep one row per source ID. Preserve unrelated source rows. If changing a URL or
status, replace that source's row.
If no useful new sources are found, say so in one sentence and continue with the
existing active sources.
Phase 3 - Query Planning
The fetcher is generic URL metadata fetch. It does not perform general web
search. Therefore query lines should usually contain verified concrete URLs on
the source domain.
For each active source:
- Use the discovered result URLs from Phase 2.
- Use links already present in
links.md only to avoid duplicates, not to
refetch them as new candidates.
- Use WebSearch/WebFetch to find additional concrete URLs if the source root is
too broad.
- Respect
budget.max_queries_per_source.
- Set
budget_max_results to budget.max_results_per_query.
Write runs/<run-id>/expanded_queries.jsonl with one JSON object per line:
{"source_id":"<source_slug>","query":"https://<trusted-domain>/<specific-result-path>","budget_max_results":<max_results_per_query>,"topic_id":"<topic-id>"}
{"source_id":"<another_source_slug>","query":"https://<another-trusted-domain>/<specific-result-path>","budget_max_results":<max_results_per_query>,"topic_id":"<topic-id>"}
Multiple URLs may be placed in a query string when they belong to the same
source domain, but keep lines readable. Prefer one URL per line when review
clarity matters.
Avoid empty or purely conceptual query lines such as:
{"source_id":"<source_slug>","query":"<purely conceptual search phrase>"}
With the generic fetcher, that line does not search the web or the source site.
It will not produce useful results unless the query includes concrete verified
URLs on the source domain.
Phase 4 - Fetch, Dedupe, Check
Run fetch:
$RI_CMD fetch \
--root "$RI_ROOT" \
--topic <topic-id> \
--run <run-id> \
--queries "$RI_ROOT/runs/<run-id>/expanded_queries.jsonl"
Expected output file:
runs/<run-id>/raw_results.jsonl
Run dedupe:
$RI_CMD dedupe --root "$RI_ROOT" --run <run-id>
Expected output file:
runs/<run-id>/candidates.jsonl
Dedupe compares against earlier runs/*/accepted.jsonl files and against the
current run. It does not use reading-log.jsonl.
Run the cap check:
$RI_CMD check --root "$RI_ROOT" --run <run-id>
If the check exits with code 3, more than 256 candidates remain. Stop and ask:
More than 256 candidates remain. How should we narrow this? Options: add exclusions, lower budget, split topic, restrict years/sources.
Then apply the approved change and rerun fetch/dedupe/check or manually narrow
candidates.jsonl, depending on the user's choice.
If raw_results.jsonl is empty:
- Inspect
expanded_queries.jsonl.
- Confirm each
source_id is active in sources.jsonl.
- Confirm query URLs share the source domain.
- If needed, repair queries or sources and rerun fetch.
- Finalize an empty run only when the user explicitly asks.
Phase 5 - Semantic Filtering
Read every candidate's title, summary, link, domain, source, and
year. Apply the topic config semantically.
Reject a candidate when:
- It matches an
exclude idea.
- It is a source root with no useful result-level content and the user wants
only result links.
- It is off-topic despite keyword overlap.
- It is a duplicate not caught by URL/title dedupe.
- The title is generic and the URL does not reveal a useful resource.
Keep a candidate when:
- It directly supports an include idea.
- It is a high-trust source result for the topic.
- It is a useful dataset, benchmark, docs page, repo, paper page, standard, or
artifact index relevant to the topic.
If you write rejected.jsonl, use this shape:
{"resource_id":"url:<hash16>","title":"<title>","link":"https://<verified-result-url>","reject_reason":"Excluded by <exclude idea>: <specific reason>."}
Keep raw_results.jsonl intact. If narrowing candidates before final review,
write the reduced list to candidates.jsonl.
Phase 6 - Candidate Group Review
Before asking the user, group candidates by meaningful review units. Prefer:
- Source plus domain.
- Semantic theme.
- Trust surface.
- Result purpose: docs, dataset, benchmark, repository, paper page, standard.
Group candidates from the available schema fields; kind is not part of the
schema.
Use a compact numbered digest in AskUserQuestion messages. Some clients flatten
Markdown tables into unreadable text.
Candidate groups:
1. <source or theme group> (<count> links)
Why keep: <short rationale>
Examples: <title>; <title>
2. <source or theme group> (<count> links)
Why keep: <short rationale>
Examples: <title>; <title>
Ask exactly one group-level question:
Which groups should I accept for finalization?
Options:
- Accept all listed groups
- Accept only the group numbers I enter
- Reject the group numbers I enter
- Show details for group numbers I enter
Use those exact option labels when AskUserQuestion supports selectable options.
For the three number-based options, tell the user to enter group numbers in the
freeform field, for example 1, 3, 5.
If the user chooses Accept all listed groups, proceed to keep.json without
follow-up.
If the user chooses Show details for group numbers I enter, show only those
groups. For each inspected group, include:
- Resource number.
- Title.
- Year or
-.
- Source.
- Link.
- One-line reason to keep or reject.
Then ask one follow-up for the inspected set. Keep review at group level unless
the user requests candidate-by-candidate review.
Phase 7 - Write keep.json
Write final accepted IDs to:
runs/<run-id>/keep.json
Shape:
{"keep":["url:<hash16>","title:<hash16>"]}
Use resource_id values from candidates.jsonl. URL strings are accepted by
the CLI as keep keys, but resource_id is preferred.
If all candidates are accepted, still write keep.json; it documents the review
decision and makes finalization explicit.
Phase 8 - Finalize
Run:
$RI_CMD finalize --root "$RI_ROOT" --topic <topic-id> --run <run-id>
Expected outputs:
runs/<run-id>/accepted.jsonl
topics/<topic-id>/captures/links.md
topics/<topic-id>/progress.md
Finalization also removes stale topics/<topic-id>/captures/accepted.jsonl if
one exists.
After finalization:
- Read
topics/<topic-id>/captures/links.md.
- Confirm the table has only
Year, Title, Source, ID, and Link data
plus the row number.
- Confirm there is no topic-level
captures/accepted.jsonl.
- Report the number of accepted links and the path to
links.md.
Summarize the research content only when the user asks. The output is the link
set, not a literature review.
Phase 9 - Optional Dashboard
Only start the dashboard when the user asks to view or edit the intake in a
browser:
$RI_CMD serve --root "$RI_ROOT" --host 127.0.0.1 --port 8412 --no-open
If the port is occupied, choose another port. The dashboard reads links.md for
the main table and sources.jsonl for source names/URLs.
Source URL Rules
Use these examples when deciding whether a URL belongs in sources.jsonl or
expanded_queries.jsonl.
Good source roots:
https://<trusted-domain>/
https://<trusted-domain>/<collection-root>/
https://<trusted-domain>/<docs-root>/
https://<trusted-domain>/<dataset-or-benchmark-index>/
Good result URLs:
https://<trusted-domain>/<specific-paper-or-record>
https://<trusted-domain>/<specific-repository-or-artifact>
https://<trusted-domain>/<specific-doc-page>
Put result URLs in expanded_queries.jsonl so the fetcher can capture metadata;
reserve sources.jsonl for source roots.
File Writing Rules
When writing JSONL:
- One JSON object per line.
- No trailing commas.
- Preserve existing unrelated rows.
- Keep source files compact.
- Prefer ASCII unless the source title already contains non-ASCII text.
When writing YAML:
- Preserve
topic_id.
- Keep
include and exclude as lists.
- Keep
budget.max_queries_per_source and budget.max_results_per_query as
positive integers.
When editing generated run files:
- Preserve run directories.
- Keep
raw_results.jsonl as the fetch-produced evidence.
- It is acceptable to rewrite
candidates.jsonl, rejected.jsonl,
categories.json, and keep.json as part of review.
Failure Recovery
Command not found:
Use `uv run research-intake` when inside the checkout, or ask the user to install the CLI.
Topic not found:
Ask whether to create it. Continue with the requested topic unless the user
chooses another topic.
Malformed config:
Show the parse error, propose a minimal YAML fix, ask before writing.
Duplicate sources:
Compact to one row per source_id, keeping the newest approved URL/status.
Specific page added as source:
Move the page URL to expanded_queries.jsonl and replace the source URL with the collection root.
Too many candidates:
Stop at the hard cap gate. Narrow before review.
No useful candidates:
Inspect source roots and query URLs. Use web search to find concrete URLs, then rerun.
Final Response Format
When the run completes, keep the final answer short:
Done. Final link page: topics/<topic-id>/captures/links.md
Accepted: <N> links
Evidence: runs/<run-id>/accepted.jsonl
Mention any unresolved issue only if it affects the link set.
Guardrails
- Ask for a topic ID when it is missing.
- Use only verified titles, URLs, resource IDs, summaries, and source endpoints.
- Keep source roots and result pages separate.
- Leave
reading-log.jsonl out of this workflow.
- Keep accepted evidence run-scoped; finalization writes
runs/<run-id>/accepted.jsonl, not topic-level captures/accepted.jsonl.
- Keep resource records to the documented schema fields.
- Edit
config.yaml only after approval.
- Add or update sources only after approval.
- After source discovery, continue to query planning unless blocked.
- Narrow candidate sets above 256 before review.
- Write reports or syntheses only as explicitly requested downstream artifacts.
- Preserve run evidence and existing run directories.
- Make
links.md the primary artifact.