// Use when researching, investigating, or exploring a topic systematically with orthogonal multi-dimensional coverage and source-quality tiers. Trigger phrases include "research this deeply", "deep research on", "investigate this topic thoroughly", "explore this topic", "systematic research", "multi-dimensional research", "comprehensive research", "cover all angles of", "thorough research on", "deep dive into (research)", "exhaustive research". Spawns parallel agents across WHO/WHAT/HOW/WHERE/WHEN/WHY/LIMITS with risk-stratified spot-checking. Bounded by a user-controlled round budget with honest coverage reporting on what was and wasn't covered.
[HINT] Download the complete skill directory including SKILL.md and all related files
name
deep-research
description
Use when researching, investigating, or exploring a topic systematically with orthogonal multi-dimensional coverage and source-quality tiers. Trigger phrases include "research this deeply", "deep research on", "investigate this topic thoroughly", "explore this topic", "systematic research", "multi-dimensional research", "comprehensive research", "cover all angles of", "thorough research on", "deep dive into (research)", "exhaustive research". Spawns parallel agents across WHO/WHAT/HOW/WHERE/WHEN/WHY/LIMITS with risk-stratified spot-checking. Bounded by a user-controlled round budget with honest coverage reporting on what was and wasn't covered.
["Systematic multi-dimensional exploration of a topic","Gathering evidence across WHO/WHAT/HOW/WHERE/WHEN/WHY/LIMITS","Cold-start research with vocabulary bootstrap"]
not_for
["Reviewing an existing artifact (use deep-qa)","Designing a system (use deep-design)","Planning implementation (use deep-plan)"]
[{"name":"deep-design","relation":"follow-up","note":"Design a system based on research findings"},{"name":"deep-qa","relation":"follow-up","note":"Audit research report for defects"},{"name":"deep-qa --mode proposal","relation":"alternative","note":"When evaluating a specific proposal rather than open exploration"}]
maturity
stable
Deep Research Skill
Systematically explore a topic using parallel agents across applicable orthogonal dimensions (WHO/WHAT/HOW/WHERE/WHEN/WHY/LIMITS). Unlike a quick research brief, this skill provides structured multi-dimensional coverage with source quality tiers and risk-stratified spot-checking. Coverage is bounded by a user-controlled round budget; the output honestly characterizes what was covered and what wasn't.
Execution Model
This skill inherits the four execution-model contracts (files-not-inline, state-before-agent-spawn, structured-output, independence-invariant) from _shared/execution-model-contracts.md. The shared file is authoritative; the elaborations below are the research-specific applications.
Subagent watchdog: every research-direction spawn (Scout, Researcher, Deep Dive) and every verification/summary spawn MUST be armed with a staleness monitor per _shared/subagent-watchdog.md. This is the skill most vulnerable to silent multi-hour hangs (the 18-hour-silent-death failure mode). Use Flavor A with the thresholds below — these are calibration defaults, override via state.json → watchdog_thresholds per tier. Defaults are tuned for WebSearch/WebFetch-bound work on median-latency networks:
Tier
STALE (first warning)
HUNG (kill)
Notes
Scout
10 min
30 min
Short directions; should return quickly.
Researcher
15 min
45 min
Web fetches on slow sources legitimately sit quiet for a while.
Deep Dive
15 min
45 min
Same rationale as Researcher.
Raise thresholds for topics with known-slow sources (e.g. heavily rate-limited APIs, archive.org lookups); lower for topics where every direction is a single-endpoint adapter call. Wrong thresholds produce false-positive kills or late hang detection — they do not change correctness, because watchdog-killed directions are reported as coverage-lost, not silently omitted. Never block on TaskOutput(block=true) without a watchdog armed against the spawn's output file. Contract inheritance: timed_out_heartbeat joins this skill's per-direction termination vocabulary; stalled_watchdog / hung_killed join directions.{id}.status.
Files not inline: seed, direction definitions, per-direction research outputs, verification inputs, and fact-check evidence all live under deep-research-{run_id}/. Seed and coordinator summary are short enough to fit inline but are still written to disk so resume can reconstruct state.
State before agent spawn: each research-direction spawn writes directions.{id}.status = "in_progress" and spawn_time_iso to state.json BEFORE the Agent tool call. Spawn failure records spawn_failed; resume re-reads state and replays.
Structured output: research directions emit a per-finding block (claim + source tier + URL + attribution) between STRUCTURED_OUTPUT_START/END markers. Unparseable output → the direction is treated as needs_retry (fail-safe worst).
Independence invariant: the coordinator orchestrates expansion and gap detection; source-quality tiering and citation spot-checking are delegated to an independent fact-verification agent (see Phase 4). The coordinator never rates source quality itself.
Cross-dimensional coherence: after all research rounds complete, an independent integrator agent reads ALL findings simultaneously and annotates cross-dimensional relationships (contradictions, convergences, coverage gaps) per _shared/cross-finding-coherence.md (research variant). This is not synthesis — it is a dedicated audit pass that feeds into both fact verification and theme extraction.
Model Tier Strategy
Three tiers balance cost and quality. The coordinator (main session) always handles synthesis and gap detection — never delegated.
depth 0–1, priority=high/medium, all seed directions
~6x Scout
Deep Dive
most capable (e.g. opus)
general-purpose + model override
ONLY for re-exploration (duplication=2) of directions with exhaustion_score ≤ 2
~60x Scout
Model mapping is configurable — the tier names (Scout/Researcher/Deep Dive) are stable; the concrete models are not. Override via state.json → model_config.
Tier selection rules (applied at spawn time):
if direction.depth == 0: → Researcher
elif direction.depth == 1 and priority == "high": → Researcher
elif direction.depth == 1 and priority == "medium": → Scout
elif direction.depth >= 2: → Scout
elif re_exploration and exhaustion_score <= 2: → Deep Dive
else: → Scout
Expected cost for a full run: ~300-500x Scout-tier cost (vs ~3400x with all-Deep-Dive). Actual dollar amounts depend on the model provider's current pricing.
Model-tier glossary
"Scout / Researcher / Deep Dive" are the abstract agent tiers this skill uses — they name a capability slot, not a specific Claude model. The tier-to-model binding is configurable per run and evolves as frontier models ship. Inline mentions of "Haiku agent" / "Sonnet agent" / "Opus agent" below refer to the default binding at time of writing and should be read as "whatever model is currently bound to that tier," not as fixed model IDs. If a future reader sees "Haiku" and the default fast-tier model is now a successor, mentally substitute — the algorithm doesn't care.
Default bindings (override via state.json → tier_bindings):
Tier
Role
Default model (current)
Configurable via
Scout
Fast reconnaissance / cheap spawns / bootstrap helpers
fast tier (e.g. Haiku-class)
tier_bindings.scout
Researcher
Medium depth / per-direction investigation
mid tier (e.g. Sonnet-class)
tier_bindings.researcher
Deep Dive
Heavy lifting / complex synthesis / adversarial audit
deep tier (e.g. Opus-class)
tier_bindings.deep_dive
Numeric budgets and thresholds below (search counts, watchdog minutes, per-round coverage gates) are defaults, not invariants — calibrate via state.json keys noted inline. They degrade gracefully: wrong numbers slow a run or produce wider caveats, they don't break correctness.
Workflow
Phase 0: Seed Validation (before anything else)
Step 0a — Safety check:
Screen seed for harmful/illegal research requests
If harmful: refuse immediately, do not expand
Step 0b — Ambiguity check:
Does this seed have multiple plausible interpretations?
If ambiguous: "I'm interpreting this as [X]. Alternatives: [Y, Z]. Proceed with X? [y/N/pick:Z]"
User must confirm before any directions are generated
Step 0c — Input validation:
If seed is not a researchable topic (single number, single proper noun without context): "Please provide more context: what aspect of [X] do you want researched?"
Step 0d — Batched questioning:
If both 0b (ambiguity) and 0c (under-specification) trigger — or any other clarifying question surfaces in Phase 0 — present ALL of them as a single numbered batch in one message. Never serially. The user answers once, then Phase 1 begins.
Step 0f — Language Locus Detection (v4 — before adapter selection):
Coordinator spawns 1 Haiku agent with this prompt:
Given the seed topic "{seed}", identify the 1–4 languages in which the AUTHORITATIVE primary literature on this topic is published. Consider:
- Academic: where does the leading research happen?
- Industry: where are the commercial leaders headquartered?
- Policy/legal: which jurisdictions write primary law on this?
- Community: where do practitioners discuss in their native language?
Output STRICT JSON:
{
"authoritative_languages": ["en", "zh", "ja"], // ISO 639-1
"rationale_per_language": {"zh": "AI chip industry dominated by Chinese fabs"},
"coverage_expectation": "en_dominant" | "bilingual" | "multilingual_required",
"confidence": "high" | "medium" | "low"
}
Never output only ["en"] without justification — default en-only is a red flag, not a default answer.
Coordinator writes result to state.json → language_locus. If coverage_expectation != "en_dominant" OR len(authoritative_languages) ≥ 2 OR confidence: low → cross-lingual retrieval fires in Phase 3.3. The Phase 1 pre-run scope declaration warns user: "This topic appears to span {N} languages ({list}); enable cross-lingual retrieval (adds ~1 Scout-tier agent per language)? [Y/n]".
--auto default: under --auto, cross-lingual retrieval is enabled automatically IF coverage_expectation == "multilingual_required" (explicit locus signal), skipped otherwise. This is the ONE Phase 0 gate --auto respects — skipping it silently would hide the language-gap caveat from autonomous runs. The decision + rationale is logged to state.json → auto_decisions.
Translation validation (v4.1): Every LLM-translated query is round-trip verified before firing. Haiku agent translates query_en → query_xx, then independently back-translates query_xx → query_en_verify. If semantic similarity between query_en and query_en_verify falls below a coarse-match threshold (shared keyword count ≥ configurable threshold (default 60%, lower for CJK/agglutinative languages — set via state.json → translation_match_threshold) of original content words, excluding stopwords), the translation is marked translation_failed: true in xlang_queries/{lang}.json and that language's adapter round is skipped for this query. Three failed translations in the same language within a run auto-disable cross-lingual retrieval for that language and emit COVERAGE_CAVEAT_TRANSLATION_UNRELIABLE: {lang}. Prevents garbage queries entering adapters and laundering translation errors into coordinator confidence.
Index-based retrieval assumes agents have enough topic vocabulary to generate effective queries. For novel / cold-start topics, this assumption breaks: an agent that doesn't know canonical terms searches for adjacent-but-wrong concepts and reports confidently-empty findings. v5 detects this up front.
Coordinator spawns 1 Haiku agent with this prompt:
Topic: "{seed}"
Task: list up to 5 specific sources you recognize from memory — paper titles + authors, repo names + orgs, book titles + authors, or canonical blog post titles + authors. Include approximate year. Do NOT WebSearch. Do NOT fabricate.
Output STRICT JSON:
{
"recalled_sources": [{"title": "...", "authors_or_org": "...", "year": 2022, "confidence": "high|medium|low"}, ...],
"topic_novelty": "familiar" | "emerging" | "novel" | "cold_start",
"rationale": "...",
"known_canonical_terms": ["term1", "term2", ...]
}
Classification:
- familiar = 5+ high/medium-confidence sources with specific years
- emerging = 2-4 sources, mostly recent, some uncertainty
- novel = only vague patterns, 0-1 specific sources
- cold_start = topic unrecognized → return empty recalled_sources: []
External source verification (v5.1 — NEW, the gate that makes v5 reliable):
LLM self-report of novelty is unreliable — models confabulate plausible-sounding sources and return familiar when they should return novel. Coordinator performs external verification BEFORE accepting the classification:
For each recalled_sources entry with confidence in {high, medium} (skip low), fire 1 WebSearch query: "{title}" {authors_or_org}. Limit: 5 queries total (budget: ~1 Scout-tier agent).
Count verification outcomes:
verified: WebSearch returns ≥1 result whose title is a clear match (shared-keyword overlap ≥ 60% AND author/org appears in result snippet OR URL)
unverified: no clear match (title drift, wrong author, no results)
Override rule (coordinator, not agent):
verified_count ≥ 3 of ≤5 recalled → accept self-report classification
verified_count ≤ 1 OR recalled_sources was empty AND self-report was familiar/emerging → force classify cold_start regardless of self-report; log classification_overridden: source_verification_failed, original_claim: {self_report}
Record both self_report_novelty and verified_novelty in state.json → topic_novelty. Subsequent phases use verified_novelty.
Emit per-source verification record to state.json → source_verification_log — required for audit trail.
This turns Phase 0g from a self-report coin-flip into an externally-grounded gate. A model that confabulates 5 plausible-sounding papers on a fabricated topic will see 0-1 verify, auto-reclassify cold_start, and correctly trigger Vocabulary Bootstrap. Cost: 5 search calls per run (~1 Scout-tier agent), negligible relative to the full-run cost.
Coordinator writes verified_novelty to state.json → topic_novelty. If verified_novelty in {novel, cold_start} → Vocabulary Bootstrap mode activates (Phase 2.5 below). If cold_start → Phase 1 dimension expansion is deferred until after bootstrap — cross-cutting dimensions like PRIOR-FAILURE / ACTUAL-USAGE are meaningless without topic vocabulary.
Phase ordering (v5.1 fix): Phase 0e (pre-mortem) ALWAYS runs AFTER Phase 2.5 (vocabulary bootstrap) when bootstrap fires. Pre-mortem on a cold-start topic without vocabulary produces generic platitudes; running it post-bootstrap lets it reason with discovered canonical terms.
--auto default:novel topics auto-proceed to bootstrap. cold_start under --auto emits COVERAGE_CAVEAT_COLD_START_AUTONOMOUS tagging the run for user review. The source-verification step fires unconditionally (including under --auto) — it is non-skippable because it prevents the silent-success failure mode.
Interaction with sparse-topic mode (v5.1 specified):cold_start classification short-circuits sparse_topic_mode Round-1 detection for this run — the coordinator knows the topic is sparse before Round 1 fires, skips the Round-1 yield check, applies sparse-topic budget adjustments from Round 1 onward, and tags COVERAGE_CAVEAT_COLD_START (not COVERAGE_CAVEAT_SPARSE_TOPIC) in the final report. Prevents the two modes from triggering the same budget-cut logic twice.
Vocabulary-grounding enforcement (v5.1 — made real): The vocabulary_grounding field on claims is validated externally, not self-reported. For any claim labeled bootstrapped or discovered_mid_run, coordinator runs a grep -i "{term}" check against vocabulary_bootstrap.json (for bootstrapped) or against round findings files (for discovered_mid_run). Mismatched labels → auto-relabeled fabricated and rejected at synthesis. The agent cannot self-launder confabulations.
Activates when:state.topic_novelty in {novel, cold_start}. Runs between Phase 2 (Initialize State) and Phase 3 (Research Rounds).
Step A — Vocabulary Bootstrap (1 Scout-tier agent):
Topic: "{seed}"
Goal: build a domain vocabulary before research starts.
Strategy: use the BEST AVAILABLE retrieval tool — adapt to the environment:
- If WebSearch/WebFetch available: query Wikipedia/encyclopedic sources for the topic
- If internal search tools available (Sourcegraph, Confluence, Manuals, NECP, Slack search):
query those for docs, READMEs, and glossaries related to the topic
- If both available: use both; internal sources often have domain-insider vocabulary that
public sources lack
Steps (adapt to available tools):
1. Query 2-3 available search tools for the seed topic
2. From the top results, extract:
- Key terms, aliases, and variant spellings
- Section/category structure (subtopics)
- Related/adjacent concepts
- Seed URLs or document paths for follow-up
3. Deduplicate to `vocabulary_bootstrap.json`:
{
"canonical_terms": [...],
"aliases": {"canonical": ["variant1", ...]},
"subtopics": [...],
"adjacent_topics": [...],
"categories": [...],
"seed_urls_discovered": [...]
}
Budget: 4-6 tool calls max.
Vocabulary is handed to every subsequent research agent as {vocabulary_path} in their prompt: "Use canonical_terms + aliases in your queries — these are domain-insider names you didn't know."
Zero-match fallback: If no search tool returns useful vocabulary, flag topic_novelty: verified_cold_start_no_vocabulary — coordinator lowers expectations (max 2 rounds, COVERAGE_CAVEAT_NO_VOCABULARY_SEED) and proceeds with very limited breadth promises.
Step B — Browse-as-Retrieval primitive (v5 — new orthogonal retrieval path):
One dedicated agent per round for novel topics (Researcher tier) / emerging (Scout tier). Does NOT query indexes. Navigates like a researcher following citations/links.
Start at {seed_url} (from seed_urls_discovered, Wikipedia article preferred).
Navigate outbound body-text links (skip nav/footer/sidebar):
- Max 3 hops deep (seed = hop 0)
- Per page: WebFetch, extract body-text outbound links, pick 2-3 highest-signal (links anchored to subject-matter phrases, skip generic nav terms)
- Record every URL with hop_distance + anchor_text
- Never revisit a URL
- Stop when: budget exhausted (15 WebFetch max), 3 hops reached on all branches, or only nav links remain
Output: browse-retrieval.md with Source Table — every row tagged retrieval_path: browse_follow, hop_distance: N, anchor_text: "...".
PRIMARY output is URL+metadata set. Interesting claims encountered may be reported but aren't mandatory.
Yield is added to the Adapter Pool. Browse-found URLs are deduped against already-cited. This is the first mechanism in the skill that emulates serendipitous discovery — it finds sources the agent couldn't have queried for because it didn't know they existed.
Step C — Vocabulary-grounding in Claims Register (v5):
For novel-topic runs, every claim row adds field:
vocabulary_grounding: bootstrapped (uses term from vocabulary_bootstrap.json) | discovered_mid_run (term surfaced in a finding) | fabricated (term not verified in any source). fabricated → claim rejected at synthesis.
Cost impact: Bootstrap adds 1 Scout-tier agent. Browse-retrieval adds 1 Researcher-tier agent/round when active. Only fires for novel/cold-start topics (typically <20% of runs). Expected additional cost: 1-10 Scout-tier equivalents depending on novelty.
Step 0e — Pre-mortem micro-round (blind-spot seeding):
Before dimension expansion, spawn 1 Haiku agent with this prompt:
Given the seed "{seed}", list 5 concrete ways this research could miss the important insight.
Cover these angles:
1. Wrong framing — the seed presupposes a conclusion that may be wrong
2. Adjacent-effort blindness — parallel work that would duplicate or invalidate
3. Stale assumption — something assumed true that has changed
4. Baseline blindness — no measurement of what's being "improved"
5. Strategic-timing blindness — planning window / roadmap / executive memo coincidence
Output to deep-research-premortem.md with one concrete claim per angle.
Coordinator reads pre-mortem.md; each flagged blind spot becomes an auto-seeded direction in Phase 1 with priority=critical.
Phase 1: Seed Expansion (see DFS.md)
Sub-goal extraction (required before dimension expansion): re-read the seed and identify distinct sub-goals (e.g. "iterate on existing X" + "spin up new Y" = 2 sub-goals; "compare A" + "decide B" + "document C" = 3 sub-goals). Each sub-goal gets minimum 2 seed-specific directions allocated during expansion. This prevents the common failure mode where a secondary sub-goal gets absorbed into general infrastructure research and never gets dedicated investigation. Record sub-goals in state.json under seed_subgoals; per-sub-goal direction coverage is tracked in the coordinator summary.
Assess which dimensions from WHO/WHAT/HOW/WHERE/WHEN/WHY/LIMITS are applicable using the multi-context table (see DFS.md)
Generate 2-4 directions per applicable dimension + cross-dimensional directions
REQUIRED cross-cutting dimensions (fire on every run, regardless of seed type — these exist to break anchoring bias and catch structural blind spots):
PRIOR-FAILURE — "What has been tried in this space and failed? Look for deprecated repos, GRAVEYARD markers, abandoned PRs, killed projects, lessons-learned post-mortems."
BASELINE — "What is the current state of the thing this would improve? Measure it concretely before solutioning. Cadence, cost, pain points today."
ADJACENT-EFFORTS — "What parallel/competing work is happening right now? Who else is planning or building in this space? Check active design threads and in-progress PRs."
STRATEGIC-TIMING — "What planning windows, published roadmaps, or executive memos bear on this? Is there a time-sensitive coordination opportunity?"
ACTUAL-USAGE — "For any tool/framework/pattern claimed as 'official,' 'standard,' or 'canonical,' verify via independent code search — don't accept docs-only claims about what teams actually do."
STRUCTURAL-DISCOVERY — "Before content-keyword searches, discover how the ecosystem organizes itself. Content keywords only find things you already know to look for; structural searches find things you didn't know existed. Four sub-passes, adapted to whatever ecosystem the seed targets:
(a) Manifest/config fingerprinting: Identify the ecosystem's organizational markers — the config files, registry manifests, and build descriptors that participation requires. Search for these file patterns across all indexed repos. Every hit = an entity participating in the ecosystem that content keywords might miss. (Examples: package registries, plugin manifests, CI configs, SDK setup files, project descriptors.)
(b) Naming-convention sampling: Sample entity names (repos, packages, modules) across the namespace to discover organizational patterns. High-frequency prefixes/suffixes = real team or project namespaces. Compare discovered namespaces against the seed's initial scope — any namespace with 3+ entities not already covered → flag as a coverage gap.
(c) Org-doc seeding: Search docs, wikis, and knowledge bases for portfolio docs, roadmaps, and organizational charts BEFORE generating content-keyword directions. These docs enumerate product surface areas that keyword searches would miss. Every entity mentioned in an org doc without a corresponding content-keyword direction → auto-generates one.
(d) Adoption fingerprinting: Search for import/instantiation patterns of known frameworks and SDKs to find who's USING them in production code, not just who has them in docs. Each distinct consumer discovered this way that isn't already in scope → new direction.
Coverage verification gate (mandatory): After structural discovery, count: how many discovered entities (teams/products/repos) have ≥1 content-keyword direction targeting them? If <80%, the expansion is incomplete — generate missing directions before proceeding to Round 1. This gate prevents the failure mode where structural discovery finds N entities but content searches only cover N/5."
OPERATIONAL-INVENTORY — "When the seed involves a specific domain, team, or organization's tooling, systematically enumerate ALL systems practitioners interact with by mining operational sources — not strategic docs. Strategic docs tell you ABOUT systems; operational sources tell you WHAT SYSTEMS EXIST. Five sub-passes:
(a) Onboarding-doc mining: Search for onboarding docs, getting-started guides, bootcamp materials, and 'new hire' documents for the target domain. These are curated 'everything you need to know' lists maintained by humans — they enumerate systems that keyword searches miss because the systems are too mundane to appear in strategic memos. Every named system/tool/service in an onboarding doc without a corresponding direction → auto-generate one at priority=high.
(b) Repository dependency crawl: For topics involving codebases, use Sourcegraph or repo inspection to discover actual imports, build dependencies, CLI tool invocations, config file references, and service client instantiations. Follow the dependency graph: what does the code import? What services does it call? What CLIs do the scripts invoke? Each distinct external system discovered this way → candidate direction. This catches systems like Cybertron, Pensive, DataAuditor, MELD DSL that exist in code but not in architecture docs.
(c) Console/dashboard enumeration: For org-scoped topics, check organizational dashboards, service catalogs, and console pages that list what the org owns and depends on. These are authoritative inventories of operational systems. Every listed system not already in scope → candidate direction.
(d) Slack channel mining: Read channel topics, pinned messages, and bookmarks of the domain's primary Slack channels. Also sample recent threads for system names mentioned in operational context ('I deployed to X', 'check the Y dashboard', 'run the Z CLI'). Each system mentioned in operational context without a direction → candidate.
(e) CLI/tool invocation sampling: Search for shell command patterns, Makefile targets, CI config steps, and README 'getting started' sections that reference tools by invocation name. CLI tools and developer utilities are invisible in strategic docs but appear in every 'how to run this' guide. Each distinct tool → candidate direction.
Entity inventory register (mandatory output): Produce a flat list of ALL distinct systems/tools/services discovered across sub-passes (a)-(e), with source attribution. This register is the denominator for the Entity Saturation gate at termination (Phase 6). Write to state.json → operational_inventory. Each entity is tagged: covered (has ≥1 explored direction), queued (direction exists but unexplored), or undiscovered (no direction yet — auto-generate one).
Activation rule: This dimension activates when the seed topic references a specific team, organization, domain, workflow, or role (e.g. 'AIMS algo developer tools', 'MLP platform readiness', 'data engineer workflow'). For pure-concept topics ('what is reinforcement learning?'), skip this dimension. When in doubt, activate — the sub-passes gracefully produce empty results for non-organizational topics."
Each cross-cutting dimension gets ≥1 direction at priority=high, in addition to seed-specific directions.
Maximum: 30 initial directions (cross-cutting dimensions count against this cap; raised from 25 to accommodate OPERATIONAL-INVENTORY entity generation)
Minimum dimension rule:
0 applicable → error; prompt user to clarify
1-2 applicable → warn user; ask if they want to continue
3+ applicable → proceed normally
Pre-run scope declaration (show before proceeding):
Deep research: "{seed}"
Interpretation: [one-sentence interpretation]
Applicable dimensions ({N}): {list only applicable ones}
Initial directions: {count}
Estimated rounds needed: {low}–{high}
Suggested max_rounds: {recommendation with rationale}
Wall-clock estimate: {time range}
Set max_rounds [default {recommendation}]: _
Continue? [y/N]
User sets max_rounds explicitly — no hardcoded default.
max_rounds recommendation formula (constants are calibration defaults — override via state.json → scheduling):
initial_directions = count of directions in Phase 1
max_agents_per_round = state.max_agents_per_round or 6 # default 6, tune to your quota
depth_multiplier = state.depth_multiplier or 2.5 # expected sub-direction yield
min_recommended = state.min_recommended_rounds or 10 # floor — adaptive formula should drive, not this
min_rounds_to_cover_seed = ceil(initial_directions / max_agents_per_round)
recommended = ceil(min_rounds_to_cover_seed * depth_multiplier)
recommended = max(recommended, min_recommended)
Example (defaults): 15 initial directions → ceil(15/6)=3 × 2.5=7.5 → recommend 10.
Example (defaults): 20 initial directions → ceil(20/6)=4 × 2.5=10 → recommend 10.
Example (defaults): 40 initial directions → ceil(40/6)=7 × 2.5=17.5 → recommend 18.
Wrong constants change recommended depth, not correctness — the soft gate still prompts at max_rounds.
Note: max_rounds is a soft gate — the skill will prompt to extend when reached with non-empty frontier. Prefer saturation over budget limits. Set max_rounds high enough that frontier exhaustion or coverage plateau is the normal termination condition, not the round limit.
Phase 2: Initialize State
Create deep-research-state.json in CWD (see STATE.md for schema)
Create deep-research-findings/ directory
Write lock file: deep-research-{run_id}.lock
Print: "Starting deep research on: {seed} [run: {run_id}]"
Phase 3: Research Rounds (see DFS.md)
Before spawning each round — prospective gate:
About to run Round {N}: {frontier_size} directions queued
Estimated tokens this round: ~{estimate} ({cost_estimate})
Total spent so far: ~{running_total}
Continue? [y/N/redirect:<focus>]
This fires BEFORE agents are spawned. User can prevent spend, not just observe it.
(Skip with --auto flag for autonomous runs.)
Per round:
Pop up to max_agents_per_round (6) highest-priority directions from frontier
Select model tier for each direction
Spawn agents in parallel with 8-minute timeout (see agent prompt template below)
On timeout: mark timed_out, DO NOT re-queue, DO NOT increment dedup counter
On transient API errors (proxy ZlibError, 429, 503, ECONNRESET, timeout <30s): auto-retry ONCE with the same prompt before marking timed_out. Record retry_count: 1 in the direction entry. True infrastructure flakes cost one agent-slot to recover; persistent errors still fail to timed_out on the retry.
Collect ALL new directions from ALL completed agents BEFORE running dedup (see STATE.md)
Apply dedup against stable pre-round snapshot
Validate direction ID headers in completed findings files (see STATE.md)
Unconsumed Leads Recovery: scan every completed findings file for a ## Unconsumed Leads section (required output — see FORMAT.md). Each lead = entity/team/concept/tool mentioned but not independently researched. Apply dedup against explored + frontier. Net-new leads become directions with priority=high. This pass fires BEFORE coverage evaluation so recovered leads can drive dimension coverage.
Update coordinator summary (see SYNTHESIS.md) — including cross-cutting dimension coverage table
Run round-level dimension re-assessment (see DFS.md), including cross-cutting dimensions
Increment round
Phase 3.7 — Cross-Dimensional Coherence (after final round, before fact verification)
After all research rounds complete, spawn a coherence integrator per _shared/cross-finding-coherence.md (research variant). This is the dedicated cross-dimensional audit that synthesis alone does not perform — synthesis writes a narrative; the integrator audits claim consistency.
Why this exists: the coordinator's Pass 2 theme extraction reads mini-syntheses and writes a report. It does NOT systematically check whether the WHO dimension's "adoption is 30%" contradicts the WHAT dimension's "adoption is 45%", or whether three dimensions independently corroborate the same conclusion (a high-confidence signal). The integrator does exactly that.
Input: file paths to ALL completed findings files across all rounds (post-dedup, post-unconsumed-leads recovery). Also: the dimension taxonomy and the seed topic file.
Agent: Sonnet, independent (fresh context). Timeout: 180s (research findings are larger than critic output — more to read).
Annotation vocabulary (research variant):
CONTRADICTS — two claims from different dimensions assert incompatible facts. Synthesis must explicitly address the disagreement with source comparison, not silently pick one.
CONVERGES — multiple dimensions independently corroborate the same conclusion. High-confidence signal for synthesis to highlight.
SUPERSEDED_BY — one dimension's claim is a subset of another's more complete treatment.
SOURCE_CONFLICT — two claims cite the same source but extract different numbers/conclusions. Priority target for Phase 4 spot-check.
STANDALONE — no cross-dimensional relationship detected.
GAP — dimensional intersection not covered by any researcher. If rounds remain: feeds into frontier. If final: becomes "What this report does NOT cover" item.
Output:deep-research-{run_id}/coherence/cross-dimensional-coherence.md + structured annotations in STRUCTURED_OUTPUT_START/END block.
Fail-safe: unparseable or timed-out → all claims proceed to Phase 4 without annotations. Log COHERENCE_PARSE_FAILED. Phase 5 report notes "Cross-dimensional coherence analysis unavailable."
Downstream effects:
Phase 4 spot-check prioritizes CONTRADICTS and SOURCE_CONFLICT claims
Phase 5 Pass 2 receives pre-identified CONVERGES clusters as candidate themes and CONTRADICTS pairs as explicit "genuine contradictions"
Phase 5 Pass 4 QA offer notes if coherence found zero issues across 10+ claims from 4+ dimensions ("QA may be less critical")
COHERENCE_HIGH_CONTRADICTION_RATE (>40% of claims) → synthesis organizes by schools of thought rather than forcing resolution
Phase 4: Fact Verification (after final research round, before synthesis)
A dedicated verification pass runs before the final synthesis. See SYNTHESIS.md for full details.
Step 4a — Claim extraction:
Extract the N most significant factual claims (N = min(20, total claims))
Fetch each sampled URL; check (a) accessible, (b) attributed claim is stated in source text
For numerical claims: compare EXACT numbers. Do NOT accept semantic similarity.
Paywalled → "unverifiable — full text inaccessible"
Accessible but claim not found → "citation mismatch — flag for manual verification"
Step 4c — Corroboration independence check:
For claims cited by 3+ agents: verify sources are genuinely independent (different org, date, methodology)
Phase 5: Synthesize (see SYNTHESIS.md)
Pass 1 — Mini-syntheses are written by each agent in their findings file (required).
Pass 2 — Theme extraction:
Coordinator reads mini-syntheses AND Phase 3.7 coherence annotations
CONVERGES clusters are pre-identified cross-dimensional themes — use them as seed themes, not just raw mini-syntheses
CONTRADICTS pairs become explicit "genuine contradictions" in the report — addressed with source comparison, never silently resolved by narrative choice
GAP annotations become items in the "What this report does NOT cover" section
A theme is valid ONLY if it requires findings from 2+ distinct dimensions
Identifies meta-patterns, fundamental tradeoffs, consensus beyond what coherence pre-identified
If COHERENCE_HIGH_CONTRADICTION_RATE flagged (>40%): organize by schools of thought rather than forcing a consensus narrative
Pass 3 — Final report: Write deep-research-report.md (see FORMAT.md)
Pass 3.5 — HTML ship (default on, skip with --no-html):
Produce a styled HTML report alongside the markdown:
Dark theme (#0d1117 background), clickable table of contents, all URLs as links
Evidence Gaps / "What this report does NOT cover" section
Single-Source Claims callout section
Upload to S3 via the upload-presentation skill pattern
Skip with --no-html for markdown-only output.
Editorial synthesis standards (always apply during Pass 2-3):
People: Full name on first mention with role/context. No orphaned surnames.
Proper nouns: Disambiguate product names that could be confused with people or common words.
Counts: If you write "three documents" or "five teams," the enumeration must match exactly.
Transitions: Every sentence follows logically from the previous. No topic shifts without paragraph breaks.
Links: Every named system, tool, project, or platform MUST have a clickable hyperlink on first mention — look up the URL (manual page, GitHub repo, or reference doc) and add it. This means ADDING links, not just checking existing ones are well-formed.
Source terminology: Keep proper nouns exactly as the source uses them. Do not rename or "improve" terminology.
Implications, not facts: Each finding must state what the evidence means for the research question, not merely restate what the source says. "X is true" is incomplete; "X is true, which means Y for this question" earns its place.
Mandatory counterevidence: At least one research query per dimension must explicitly target problems, failures, or criticism of the topic. A direction with no risks surfaced is incomplete, not thorough.
No hallucinated citations: Only cite URLs whose content was actually retrieved by a tool call in this run. Never cite from training data or memory. If a claim needs a source and you don't have one, flag it as unsourced rather than fabricating a plausible URL.
Integration: When adding findings to an existing document, weave into the existing narrative — connect each new fact to established context rather than inserting standalone blocks.
Pass 3.9 — VERIFY (mechanical, before Pass 4):
After writing the report, scan for:
Capitalized proper nouns not inside [...]() — every named entity needs a link.
Broken links — all [text](url) have non-empty URLs and valid markdown syntax.
Count consistency — for every "N things" claim, count the enumerated items.
Orphaned names — every surname should have a full-name introduction earlier.
Fix any issues found. This is a mechanical gate, not a judgment call.
Pass 4 — QA pass (automatic offer):
After writing deep-research-report.md:
If Phase 3.7 coherence found zero contradictions and zero gaps across 10+ claims from 4+ dimensions:
QA pass available. Run deep-qa on this report? [y/N]
(Coherence integrator found no cross-dimensional issues — QA may be less critical for this run,
but still audits citation accuracy and counter-evidence gaps.)
Otherwise (or if coherence was unavailable/degraded):
QA pass available. Run deep-qa on this report? [y/N]
(Recommended: audits citation accuracy, logical consistency, coverage gaps,
and counter-evidence gaps that the research pass may not have self-checked.)
If y: invoke deep-qa with --type research on deep-research-report.md
QA run_id: {parent_run_id}-qa
QA report written alongside: deep-research-{run_id}-qa-report.md
Fact verification in deep-qa complements (does not replace) the spot-check in Phase 4
If n: skip. Final output remains deep-research-report.md alone.
Phase 6: Termination Check (see DFS.md)
Terminate when ANY of these is true (any-of-4, not all-of-4):
User chooses N at a round gate (explicit user decision)
Coverage plateau: No new dimensions for 3 consecutive rounds AND all frontier items have exhaustion ≥ 4 AND blind-spot check passes (all 7 cross-cutting dimensions have ≥1 explored direction AND unconsumed-leads count == 0) AND entity saturation check passes (see below)
Budget soft gate:max_rounds reached with non-empty frontier → prompt user to extend or stop (see DFS.md Step 5)
Frontier actually empties (possible because direction reporting is optional)
Blind-spot gate: condition 2 CANNOT fire if any of PRIOR-FAILURE / BASELINE / ADJACENT-EFFORTS / STRATEGIC-TIMING / ACTUAL-USAGE / OPERATIONAL-INVENTORY is uncovered, or if unconsumed-leads count > 0. Dimension coverage alone is necessary but not sufficient for "coverage plateau."
Entity Saturation gate (new — applies when OPERATIONAL-INVENTORY is active): condition 2 CANNOT fire if state.json → operational_inventory exists AND >20% of registered entities have status undiscovered. When the gate blocks, coordinator auto-generates directions for the top-priority undiscovered entities (up to max_agents_per_round) and injects one additional round. After the recovery round, re-check saturation; if still >20% undiscovered, terminate but tag the report COVERAGE_CAVEAT_ENTITY_INVENTORY_INCOMPLETE: {N} of {M} entities unexplored. This prevents the failure mode where structural/operational discovery finds 90 systems but content research only covers 60.
max_rounds is a soft gate — it prompts the user, it does not auto-terminate. In --auto mode, the coordinator continues until saturation (coverage plateau or frontier exhaustion), not until max_rounds. max_rounds in --auto mode only triggers a progress log, not a stop. Absolute hard ceiling is max_rounds * 5.
Breadth Expansion Module
Inserts between Phase 3 (Research Rounds) and Phase 5 (Synthesis). Closes the retrieval-breadth gap with Claude-Research-style web products via observable, loudly-failing mechanisms — not infeasible ones.
Phase 3.5 — Per-round Breadth Observability (runs after each round's dedup + unconsumed-leads sweep)
URL-level dedup metric. Coordinator reads every findings file's Source Table, aggregates unique_urls_seen across agents vs total_searches_fired. Computes:
domain_entropy = Shannon entropy over domain counts (computed as coordinator-state-level single-pass counter; no embeddings required)
Writes these to state.json → breadth_metrics[round_N].
Rolling domain blocklist maintenance (with narrow-domain protection). Coordinator tracks per-domain citation counts. Adds a domain to blocked_domains ONLY if:
It is in the top 5 by citation count this round, AND
It contributes < 30% of total citations across the full run-to-date (prevents blocklisting the only useful source in narrow-domain topics like arxiv-heavy academic, vendor-specific engineering, or single-standard-body policy), AND
It is not the sole source for any single-source claim
Cap the list at 10 entries. Agents pass it via WebSearch's blocked_domains parameter. If no domain qualifies for blocklisting this round, that's a valid signal — record blocklist_skipped: narrow_domain_regime in breadth_metrics and proceed without domain diversity forcing.
Sparse-topic auto-mode (trigger: after Round 1 only). If Round 1 returns unique_domains < 10 OR mean(directory_yield) == 0 across all agents:
Set sparse_topic_mode: true in state.json
Disable directory-first step for subsequent rounds (already yielding nothing)
Halve the per-agent search budget allocation for counter-evidence (redirect to depth-first)
Tag the final report with COVERAGE_CAVEAT_SPARSE_TOPIC
Notify user at next round gate: "Sparse-web topic detected (N unique domains, D directories yielded 0). Continue with reduced breadth expectations? [y/N]"
Sparse-topic mode is NEVER silently activated.
Retrieval-path yield table (built per-round, surfaced in final report). Coordinator aggregates per-source retrieval_path tags into a yield-per-path table:
BREADTH_VERDICT_START
unique_domains: N
unique_publishing_entities: M (registrable-entity count, not domain — e.g. `*.blogspot.com` → single entity)
domain_entropy: E
temporal_range: {oldest_date} to {newest_date}
citation_cascade_risk: {none | weak | strong}
(strong = ≥30% of "independent" sources trace to ≤3 original studies/authors)
unexplored_retrieval_paths: [list of paths with 0 sources]
languages_represented: {"en": N_en, "zh": N_zh, "ja": N_ja, ...} (v4 — count per source_language)
missing_authoritative_languages: [list] (v4 — intersection of state.language_locus.authoritative_languages minus languages_represented; empty if fully covered)
coverage_caveat_recommended: {none | sparse_topic | english_only | single_engine_monoculture | language_gap}
BREADTH_VERDICT_END
Coordinator reads ONLY the structured block; unparseable → fail-safe "citation_cascade_risk: strong, coverage_caveat_recommended: verify_manually." On timeout (>3min) or spawn failure → record breadth_audit: unavailable in final report header + preserve all v0 termination labels unchanged. Never block final report on auditor failure — the auditor is a quality gate, not a critical path.
Termination gate (auditor has teeth):
If citation_cascade_risk: strong AND unique_publishing_entities < 5 AND round budget remains → block termination. Inject ONE additional round targeted at cluster outliers: coordinator reads the dominant cluster's shared citations, generates 3–5 directions explicitly targeting outsiders/critics/alternative-authors of those sources, spawns them. This is the ONLY mechanism in the skill that reverses a termination decision. After the extra round, re-run the auditor; if cascade risk still strong, terminate anyway but tag the final report COVERAGE_CAVEAT_CITATION_CASCADE_UNRESOLVED.
v4 — Language gap gate: If missing_authoritative_languages is non-empty AND round budget remains AND state.language_locus.coverage_expectation != "en_dominant" → block termination. Inject ONE additional round in cross-lingual mode targeting the missing languages: coordinator translates the top 5 outstanding directions into each missing language via LLM (writing translations to xlang_queries/{lang}.json), invokes language-specific adapter variants (see Cross-Lingual Adapter section), spawns research agents constrained to source_language in {missing_languages}. After the extra round, re-run the auditor; if gap persists, terminate but tag the final report COVERAGE_CAVEAT_LANGUAGE_GAP: {list}.
Both gates may fire in sequence (cascade-resolution round first, then language-gap round). Each fires at most once per run. Both are suppressed after 1 firing under --auto to bound cost.
If remaining budget < 1 round → skip injection, tag COVERAGE_CAVEAT_CITATION_CASCADE_BUDGET_EXHAUSTED.
If --auto is set → injection fires automatically once per run; second injection suppressed.
The breadth auditor's verdict is appended verbatim to the final report as a new section: Breadth Audit. Gate actions and outcomes are recorded in state.json → breadth_gate_log.
Depth Presets
The --depth flag controls research budget (usable directly or via the investigate shim):
Depth
Dimensions
Tool calls/direction
Max rounds
Cross-cutting dims required
quick
2-3
3
3
BASELINE + ACTUAL-USAGE
standard
3-5
5-10
30
3 of 5
deep
5-7
10+
unlimited
all 5
For quick mode: skip novelty detection, vocabulary bootstrap, breadth auditing, and contrarian pass. Runs up to 3 rounds, synthesize, ship. This is the "80% quality in 20% of the time" option.
For standard mode (default): run all phases, skip the contrarian pass. Runs until saturation (coverage plateau or frontier exhaustion), with max_rounds=30 as a soft gate — not a hard stop. In --auto mode, continues past max_rounds until saturation; absolute ceiling is max_rounds * 5 (150).
For deep mode: full deep-research spec with no shortcuts.
When --depth is not specified, max_rounds and max_directions control the budget directly (backward compatible).
Golden Rules
Never spawn an agent without checking the state file first. Dedup every direction.
Direction reporting is optional. Terminal node is valid output — do NOT force agents to invent directions.
Frontier is priority-ordered. Always pop highest-priority first. Children get +2 depth bonus.
Two explorations max per direction. Third+ is skipped.
Prospective gate fires before spend. Never spawn agents without showing the user cost estimate first (unless --auto).
Coordinator context is bounded. Never accumulate raw findings — use the structured coordinator summary.
Every finding needs a source. Web search URLs required. No training-data-only findings.
Always specify model tier explicitly. Never let agents default — cost spirals come from unintentional Opus usage.
Verify numerics manually. Flag all numerical claims in the spot-check; LLM number verification is unreliable.
Mentioned-but-unexplored is a bug. Entities/teams/tools named in a finding but not independently researched are unconsumed leads. Every round must drain them.
"Official" claims need code-level verification. Docs say what's policy; search/code says what's actual. For any paved-road / canonical / standard claim, verify adoption in production code before reporting.
Blind-spot dimensions are required, not optional. PRIOR-FAILURE, BASELINE, ADJACENT-EFFORTS, STRATEGIC-TIMING, ACTUAL-USAGE fire on every run. Coverage plateau cannot be claimed without them.
Breadth must fail loudly, never silently. Every breadth mechanism (directory-first, operator-slicing, link-crawl, counter-evidence) emits a yield signal: the count of unique URLs it produced. Zero yield is a valid signal and drives adaptation (e.g. sparse-topic mode); a missing signal is a bug. Coordinator rejects any findings file without directory_yield and retrieval_path populated.
Use native tool parameters, not prompt exhortations, for enforceable constraints. Domain blocklist: WebSearch blocked_domains parameter. Domain allowlist: allowed_domains. LLMs ignore prompt-level instructions under search pressure; the API parameter cannot be ignored.
Source-entity independence ≠ domain independence. Two URLs on different domains owned by the same publishing entity (*.substack.com, blog-of-lab-X, media-conglomerate-syndication) are not independent sources for corroboration. The breadth auditor reports unique_publishing_entities, not just unique_domains — the tighter number.
Anti-Rationalization Counter-Table
The coordinator WILL be tempted to skip steps. These are the talking points it must reject.
Excuse
Reality
"Coverage is good enough — we've hit most dimensions."
No. "Good enough" is the label for the termination check, not a reason to skip it. Run the coverage plateau check and emit the honest number. Under-coverage disclosed is fine; under-coverage hidden is a bug.
"This source looks fine — no need to assess its tier."
No. Every source gets a tier classification (primary/secondary/unverified). A blog post is not a primary source because it cites one. Tier drives Evidence Quality — skipping it silently inflates the report's credibility.
"A single citation is sufficient for this claim."
No. Single-source claims MUST carry corroboration: single_source and MUST be surfaced in the report's Single-Source Claims section. Never promote a single-source claim to "established fact."
"The source is recent enough — within a couple of years."
No. For fast-moving topics, recency_class is computed from the 12-month threshold, not from vibes. A 2-year-old paper on a fast-moving topic is stale and must be flagged.
"No need to search for counter-evidence — findings are consistent."
No. Consistency across a coordinator-generated framing is not evidence of truth. Every claim needs counter_evidence_searched. "I didn't find any" is a valid answer; "I didn't look" is not.
"A primary source confirmed it — we can call it settled."
No. A single primary source is still single_source. Independence check requires 2+ sources that do not cite each other and are not from the same publishing entity. One paper is not settled science.
"This forum post / personal blog counts as a citation."
No. unverified tier sources do NOT count toward corroboration. They may appear in the source table as context, but the claim's corroboration field only counts primary + secondary.
"Exhaustion threshold met — we're done with this direction."
No. Exhaustion ≥ 4 is one termination trigger, not a license to skip the coverage plateau check or the counter-evidence search. Exhaustion measures what was found, not what was missed.
"Honest coverage report can wait — the findings are the main product."
No. The coverage report IS the product. An overclaimed research report is worse than a short one. Termination label + Coverage % + Evidence Quality + single-source count are non-optional.
"Just one more round will close the gap."
Maybe. If the frontier is non-empty and coverage plateau hasn't been reached, continue. Saturation is the preferred termination condition — stopping early with unexplored frontier is worse than running extra rounds. The coordinator self-extends in --auto mode until saturation. Only stop when: frontier empties, coverage plateau is confirmed, or the absolute hard ceiling is hit.
"The second agent repeated the first — that's corroboration."
No. Two agents reading the same sources is not independent corroboration. Independence is a source-level property, not an agent-level property.
"Counter-evidence was weak, so I didn't cite it."
No. Disconfirming evidence that was found MUST be cited inline with the claim and the field set to yes_disconfirming_evidence_present. Omitting it is selection bias dressed up as editorial judgment.
"I'll merge stale-source claims into the main text — flagging them is clutter."
No. Stale sources get counted and surfaced in the final report. The reader cannot assess freshness if the coordinator launders the date.
"The seed framing is obviously correct — I don't need a pre-mortem."
No. The most expensive research failures are framing errors. The Phase 0e pre-mortem costs ~agent.10 and catches the wrong-framing category before direction expansion. Skip it and you anchor the entire run on an unchecked premise.
"Entity X came up briefly but isn't worth a direction."
No. If it was named in a finding, it's a lead. Either dedupe it against explored directions (and record the dedupe decision) or spawn a direction. Silent skipping is how critical parallel efforts, adjacent teams, and competing tools routinely disappear from final reports.
"Docs say X is official — no need to verify with code."
No. ACTUAL-USAGE requires search/code (or equivalent) whenever the domain supports it. "Official" is about policy; code is about reality. Policy-only claims must carry corroboration: single_source and note "adoption unverified."
"Cross-cutting dimensions are optional for simple topics."
No. They exist to catch what the seed framing obscures. "Simple" topics are where the most framing errors hide — a narrow seed + skipped cross-cutting dimensions = confidently wrong report.
When the coordinator finds itself about to reach for any of these excuses: it stops, updates the structured field, and proceeds the right way. The extra structure is the cost of honest output.
Self-Review Checklist
State file is valid JSON after every round
No direction has status "in_progress" after round completes
Final report includes Cross-Cutting Dimension Coverage table
For every "official/standard/canonical" claim, the findings file notes whether code-level usage was verified
Agent Prompt Template
When spawning each research agent:
You are a research agent. Your task is to thoroughly research ONE specific direction. Think carefully and step-by-step — source quality assessment, counter-evidence searching, and Claims Register classification are load-bearing decisions that drive the final report's credibility. This is harder than it looks; do not rush.
**Your research question:** {direction.question}
**Dimension:** {direction.dimension}
**Depth:** {direction.depth} | **Priority:** {direction.priority}
**Topic velocity:** {topic_velocity} (recency threshold: {recency_threshold_months} months; sources older than this are `stale`)
**Coverage fingerprint (dedup only — do NOT let this anchor your findings):**
Already-explored directions: {list of explored direction titles}
These are listed so you avoid repeating them — NOT to constrain your conclusions.
**What we've found so far (research context):**
{coordinator_findings_summary}
Dominant framing so far: {dominant_framing}
⚠️ If your research points in a different direction, follow it. Do not assume the dominant framing is correct.
**Instructions:**
1. **Retrieval adapters — multi-index parallel fanout (REQUIRED — breadth gate).** Any single search tool has ONE ranking bias. Real breadth requires hitting multiple indexes with different biases. The coordinator gives you a pre-fetched **Adapter Pool** in `{adapter_pool_path}` — a JSON file containing seed URLs/results already harvested from 3–5 adapters applicable to this topic (see Adapter Registry section). Start your research from that pool. The coordinator auto-discovers available retrieval tools at run start and selects adapters from these categories:
- **Web search** (general index bias) — WebSearch if available
- **Academic/bibliographic** — SemanticScholar, arXiv, OpenAlex, Crossref APIs if web-accessible
- **Community/practitioner** — HackerNews, StackExchange, Reddit APIs if web-accessible
- **Code/implementation** — Sourcegraph, GitHub search, code-level keyword search
- **Internal docs** — Confluence, Manuals, NECP, RAG endpoints, Google Docs search
- **Communication** — Slack search, email archives, meeting transcripts
- **Encyclopedic** — Wikipedia external links, curated knowledge bases
**Environment adaptation:** In public-web environments, use public APIs. In internal/corporate environments, substitute internal search tools (code search, doc search, Slack, people directories). The principle — multiple independent indexes with different biases — is constant; the specific tools are not. If fewer than 3 retrieval paths are available, log `COVERAGE_CAVEAT_LIMITED_RETRIEVAL_PATHS` and proceed.
The coordinator selects applicable adapters by topic-class heuristic (see Adapter Registry section). You receive the deduplicated result pool. If the Adapter Pool contains ≥5 primary-tier results, you MAY skip general web search for this direction and spend budget on counter-evidence + depth retrieval.
2. **LLM-as-retrieval-memory (orthogonal index — REQUIRED).** Search engines rank by PageRank-style popularity. Your training corpus ranked sources by a different signal entirely. Before any WebSearch, enumerate from memory: list 8–15 specific sources on this topic you recall (author names, paper titles, repo names, canonical blog posts, key books), each with approximate year + why it's relevant. Write them to `{llm_memory_seeds_path}`. Then WebFetch each to verify existence + extract URL. Verified seeds enter the Source Table with `retrieval_path: llm_memory_verified`. Unverified or fabricated ones are discarded and flagged — do NOT include them in findings. This is the single most orthogonal retrieval path available; it surfaces sources that are seminal-but-unranked, historical-but-deprioritized, or niche-expert-known-but-not-viral.
3. **Yield signals (mandatory for every path):** record per-path counts in the Source Table header: `adapter_pool_yield: N_i per adapter | llm_memory_verified: M | websearch_yield: W | total_unique_domains: D | total_unique_publishing_entities: E`. If any path yielded 0, tag it — the coordinator uses per-path zero-yield signals to drive sparse-topic detection and adapter-pool pruning. Silent path failure is a bug.
2. **WebSearch with operator slicing and native blocklist.** When you do use WebSearch:
- **Working operators** (verified to be honored): `site:edu`, `site:gov`, `site:{specific-domain}`, `-site:{excluded-domain}`, `filetype:pdf`. Use these to cut into unexplored strata.
- **Do NOT rely on** `inurl:`, `after:`, `before:`, `intitle:` — WebSearch silently drops these. For date constraints, put the year in the query text (e.g. `"X 2025"`, `"X 2020"`).
- **Native domain blocklist** — pass already-cited domains via WebSearch's `blocked_domains` parameter (not prompt-exhortation). The coordinator supplies this list to you in `{blocked_domains}` below.
3. **When to search vs rely on training data:** WebSearch is REQUIRED for any factual claim that lands in your Claims Register — recency matters, training data is stale, and every claim needs an inline source URL (Source Table requirement). Training-data-only knowledge is acceptable ONLY for definitional/foundational context that frames or scopes the search. Findings that lack a fetched source are rejected at synthesis — do not report them.
4. **Search budget hygiene (soft target, NOT hard allocation):** the Adapter Pool is pre-fetched by the coordinator and does NOT consume your WebSearch budget. Your budget applies only to WebSearch + counter-evidence + depth-fetch. Target mix:
- **Haiku (8 slots):** LLM-memory verification (up to 3 WebFetches) + counter-evidence for TOP 3 load-bearing claims + 2–3 WebSearch. If counter-evidence would exceed budget, document skipped claims with `counter_evidence_searched: no_budget_exhausted` — do NOT silently drop them.
- **Sonnet (15 slots):** LLM-memory verification (up to 6) + counter-evidence per load-bearing claim + remainder WebSearch.
This is a target. If Adapter Pool already yielded ≥5 primary-tier sources and LLM-memory verified ≥3, you MAY skip WebSearch entirely.
5. Go deep — follow references, check citations, look for contradictions
6. Assess source quality for EACH source (primary / secondary / unverified) — see definitions below
7. Record each source's publication date, publishing entity, AND `retrieval_path` (one of: `directory`, `websearch`, `websearch+operator`, `citation-follow`). The `retrieval_path` field is required — the coordinator uses it to audit how breadth was actually achieved.
8. For every load-bearing factual claim: run a counter-evidence search. Record the outcome as `yes_none_found`, `yes_disconfirming_evidence_present` (cite the disconfirming source inline), or `no_search_skipped` (only for definitional claims)
9. Write your findings to: {findings_path}
10. Use the FORMAT specified: Findings, Claims Register, Source Table (with `retrieval_path` column), Mini-Synthesis, New Directions, Exhaustion Assessment. Include a Source Table header line: `directory_yield: N | websearch_yield: M | unique_domains: D`.
11. Every claim row in the Claims Register needs:
- `corroboration`: `single_source` | `two_independent_sources` | `three_or_more_independent_sources` (sources independent iff they don't cite each other AND aren't from the same publishing entity; unverified-tier sources don't count)
- `counter_evidence_searched`: one of the three values above
- `recency_class`: `fresh` | `stale` | `undated` (freshest source wins for the claim)
12. Every claim in text needs an inline source URL
13. New directions are OPTIONAL — if this is a terminal node, write "None — terminal node."
14. Do NOT report new directions that are paraphrases of already-explored topics
15. **Unconsumed Leads (required — do NOT skip):** scan your findings for every named entity (team, tool, repo, person, framework, project) that you mentioned but did NOT independently research. Report them in a `## Unconsumed Leads` section with one line each: `- <entity>: <why it's worth researching>`. If genuinely none: write "None — all referenced entities were core to this direction's scope." The coordinator uses this section to drain missed leads across rounds.
16. **"Official" claims require code-level verification:** for any claim that something is "official," "standard," "canonical," "paved road," or "the way X is done at <org>," run at least one code search (e.g. `search/code` or equivalent) to verify ACTUAL adoption in production code. If you can only find docs but no code, mark the claim `corroboration: single_source` and note "policy-only, adoption unverified" in the Claims Register.
17. **Restricted-access document handling:** when you encounter a source behind auth (Google Docs, Confluence, Notion, internal wikis, paywalls, Slack threads), try the domain-appropriate authenticated tool first in this order: (a) domain-specific MCP tool if one is available for this source type (e.g. `mcp__google-docs`, `mcp__jira`, `mcp__slack`), (b) internal search/gh API, (c) WebFetch as last resort. If NONE work, do NOT silently drop the source — record it in `## Unconsumed Leads` with status `access_blocked` and a one-line note describing (i) what the document is expected to contain based on context, (ii) which tool would resolve it. Never fabricate content from a document you could not access. The coordinator uses `access_blocked` leads to surface a "Known blind spots" section in the final report.
**Source quality tiers:**
- `primary`: peer-reviewed research, official documentation, primary data — note specific evidence
- `secondary`: credible journalism, institutional reports (expert blogs require explicit credentials to qualify)
- `unverified`: forums, personal blogs, social media, paywalled sources (you can't read it = can't verify it)
**Search budget (default, override via `state.json → search_budget_per_tier`):** 8 searches for Scout-tier agents; 15 searches for Researcher-tier agents. (See the Model-tier glossary above — tier labels, not model IDs, drive this allocation.)
(Counter-evidence searches count against the budget — plan accordingly.)
Two-Phase Deep Dive (optional, for low-exhaustion directions)
When a direction returns exhaustion_score ≤ 2:
Scout pass (Haiku, 5 searches): Identify 3-5 most relevant papers, return as short list
Researcher pass (Sonnet, 10 searches): Deep dive using scout's list as starting context
Costs ~7 Scout-tier equivalents vs ~60-100x for Deep Dive re-exploration.
Trigger: exhaustion_score <= 2 AND duplication[direction] < 2
--auto Flag
When --auto is passed:
All prospective round gates are skipped
Runs until saturation (coverage plateau or frontier exhaustion), NOT until max_rounds
max_rounds triggers a progress log entry, not a stop — the coordinator continues past it
Absolute hard ceiling: max_rounds * 5 (prevents truly infinite runs on pathological topics)
⚠️ Prefer saturation over budget limits — stopping early with unexplored frontier is the worse outcome
On first use in a run, coordinator fires the canary query. Parse failure / missing fields / below-threshold result count → adapter_drift_suspected. Adapter marked unavailable for this run, logged to state.json → adapter_drift_log, and COVERAGE_CAVEAT_ADAPTER_DRIFT: {name} emitted in final report. If >3 adapters drift in the same run → SYSTEM_ALERT: probable_mass_api_change surfaced to user at next round gate (recommend canary refresh before continuing).
Iterative Gates (v6 upgrade from v4's one-shot)
v4.1 cascade-gate and language-gap-gate fired at most once. v6 allows bounded iteration:
Each gate has max_iterations: 3 (hard cap).
After each injected round, re-run the auditor. If the triggering condition persists AND iteration_count < 3 AND budget remains ≥ 1 round → fire again.
Exponential cost check: each iteration N costs roughly 1.5× the previous (because injected directions target increasingly-specific cluster outliers / missing languages — narrower = more searches). Coordinator computes projected remaining cost per iteration; user gate at iteration_count ≥ 2 even under --auto.
gate_iteration_log in state.json records: {gate_type, iteration, trigger_reason, round_budget_remaining, outcome: {resolved|persists|abandoned_budget}}.
Retrieval-Path Attribution Validation (v6)
retrieval_path in the Source Table is agent-reported in v1–v5. Agent lazy-labels every URL websearch because it's default. v6 adds coordinator-side cross-check:
Coordinator records a retrieval_path_ledger per round: every URL that entered Adapter Pool is tagged with the actual mechanism that produced it (directory, adapter:{name}, browse_follow, llm_memory_verified, citation-follow).
At round-end, coordinator diffs agent-reported retrieval_path against ledger. Mismatches → auto-corrected to ledger value + logged to retrieval_path_mismatch_log. Three mismatches from the same agent in a run → that agent's future findings are flagged retrieval_path_self_report_untrusted in final report.
The Breadth Auditor's yield_per_path table uses ledger values, not agent reports — so breadth claims can't be laundered through mislabeling.
Independent Synthesis Critic (Phase 5 Pass 6, v6)
v0–v5 Phase 5 Pass 2 theme extraction and Pass 3 final-report writing are coordinator-done (with mini-syntheses as input, not raw findings). Coordinator bias → cherry-picked theme confirmation.
Breadth Auditor citation_cascade_risk: strong AND post-injection persists
→ coordinator halts further rounds and emits RUN_HALTED_TOPIC_APPEARS_ILL_FORMED with: (a) caveat list, (b) what WAS found, (c) recommendation: "Either the seed topic is malformed, the information does not exist in accessible sources, or infrastructure has degraded. Refine the seed or consult the adapter_drift_log before retrying."
Prevents burning budget on runs where every signal says "we can't research this."
Claim-Temporal Consistency Check (v6)
Every claim row with a temporal qualifier ("as of", "in 2025", "current", "recently") is validated:
If claim text contains year/date: source's publication_date must be within ±12 months (unless claim itself is historical, detected by "in 1999" vs "as of 2025" structure).
Mismatches → temporal_inconsistency: true; claim downgraded to corroboration: single_source regardless of actual source count (because temporal drift invalidates corroboration logic).
Check runs at Phase 4 fact-verification, not at agent-side — coordinator-owned.
Per-Theme Weighted Evidence Quality (v6)
Final report's single "Evidence Quality" score (v0–v5) conflates strong themes with weak ones. v6 decomposes:
Each theme in final report carries its own evidence-quality score, not one run-wide average. Reader sees which themes are load-bearing vs. which are gossamer.
Golden Rules Appended (v6 — #16–#18)
Adapter connectivity ≠ adapter correctness. A 200 OK response with junk JSON is worse than a 404. Canary-verify semantic shape per run.
Retrieval-path labels are coordinator-owned, not agent-owned. Trust the ledger, not the self-report. An agent's credibility degrades with every mismatch.
Compound caveats mean the run is broken. Three concurrent COVERAGE_CAVEAT tags is a stop-the-line signal, not a license to write a confidently-caveated report.
Anti-Rationalization Counter-Table Additions (v6)
Excuse
Reality
"Adapter returned 200, we're good."
No. Check the canary fingerprint. APIs silently rename fields; 200 with junk JSON is a worse failure mode than 404.
"The gate fired once, we're done with that kind of issue."
No. v6 iterates gates up to 3×. One failed injection doesn't mean the problem isn't solvable — it means this specific injection didn't solve it.
"Agent said the source came from citation-follow, that's what the report says."
No. Coordinator's retrieval ledger is authoritative. Agent self-reports are advisory until validated against the ledger.
"The caveats are all surfaced — we did our job."
No. Three or more compound caveats = topic is ill-formed. Halt, don't launder.
"Evidence Quality is 0.7, the report is solid."
No. Per-theme weighted scores reveal which themes are load-bearing. A single average hides weak themes under strong ones.
"Synthesis is mini-synthesis-based — critics already reviewed it."
No. Synthesis Critic (Pass 6) reviews FINAL REPORT against RAW FINDINGS to catch theme extraction bias, not just critique quality. Independence-invariant applies at synthesis too.
Adversarial-Source Heuristics (v7 — at Fact Verification)
Commercial / SEO-gamed / astroturfed / content-farm sources can pass v0–v6 filters if they have primary-source-shaped metadata. v7 adds automatic down-ranking signals applied at Phase 4 fact-verification:
For each source in Source Table, coordinator computes:
domain_age_months: If WebFetch is available, check web.archive.org/web/{date}/*/domain.tld for first-seen timestamp. If <6 months → new_domain_risk: true. If WebFetch/web access unavailable (internal-only environment), skip this check and log domain_age_check: skipped_no_web_access.
sibling_density: WebSearch site:domain.tld + count results. If >100k results with near-identical URL patterns (heuristic: >5 URLs sharing a stem like /reviews/, /guides/, /best-{word}) → content_farm_risk: true
press_release_signature: body text starts with "[CITY, STATE] –" or contains the phrase "For immediate release" or identical lead paragraphs appear on ≥3 other domains → syndication_detected: true
affiliate_link_density: count of ?ref=, /aff/, ?tag=, ?utm_, shortener-redirect patterns in outbound links. If >30% of outbound links are affiliate → commercial_intent_high: true
Sources with ≥2 risk flags → downgraded to unverified tier regardless of claimed tier; contribute 0.2× to corroboration weight. Flagged in final report's Single-Source Claims section as [ADVERSARIAL_SOURCE_RISK].
Agent-Level Hallucination Probe (Phase 3.75, v7)
Phase 4 (fact-verification) spot-checks report-level claims. Between them, agent findings can contain fabricated URLs, misattributed quotes, or invented statistics that pass dedup and appear primary-tier.
Phase 3.75 runs after each round's agent output, before unconsumed-leads recovery:
For each agent, random-sample 3 of its load-bearing claims (those with corroboration != single_source to avoid redundancy with Phase 4 single-source flagging)
WebFetch the cited URL, grep for claim text (±20% keyword overlap allowed)
If claim text NOT findable in source → mark hallucination_suspected: true on that claim, downgrade corroboration one tier, log to hallucination_log
If ≥2 of 3 sampled claims fail for the same agent → mark that agent's remaining findings agent_credibility_degraded; coordinator surfaces "Agent-level hallucination risk" as a caveat
Budget (calibration default — override via state.json → hallucination_probe_budget): 3 fetches per agent per round. At the default cap of 6 agents/round this works out to ~18 extra calls/round, roughly 6 Scout-tier equivalents/round. Scale with max_agents_per_round; raise for high-stakes runs where fabricated URLs are expensive, lower for exploratory runs where probe cost outweighs risk.
Cross-Round Direction Momentum (v7)
v0–v6 treats every round independently except via unconsumed-leads. v7 adds explicit learning across rounds:
After each round, coordinator computes per-dimension yield: (unique_primary_sources / search_budget_spent). The highest-yield dimension gets momentum_bonus: +1 priority tier for directions in subsequent rounds. The lowest-yield dimension gets anti_momentum: -1 priority tier (directions deprioritized, not eliminated).
Written to state.json → dimension_momentum. Resets every 3 rounds to prevent over-fitting. Golden Rule 19: coordinator never eliminates a dimension based on momentum alone — the cross-cutting mandatory dimensions always run at least once.
Per-Subtopic Locus Refinement (v7)
v4's language_locus and v3's topic-class heuristic are classified once per run. For topics with multiple sub-goals (extracted in Phase 1 sub-goal extraction), each sub-goal gets its own locus determination:
After sub-goal extraction, coordinator fires one additional Haiku call per sub-goal (reusing the Phase 0f + Phase 0g prompts, scoped to the sub-goal question). Writes state.json → subgoal_locus[{subgoal_id}]. Adapter pool for directions owned by a given sub-goal uses that sub-goal's locus, not the seed-level locus.
Cost: +1 Scout-tier agent per sub-goal beyond the first. Prevents the multi-era / multi-domain topic from getting locked into a single locus determined by the loudest sub-goal.
User Pivot Injection (v7)
v0–v6 user interaction is binary at gates: continue / stop. v7 adds mid-run pivot: at any round gate, user may respond redirect:{refined_seed_or_focus}. Coordinator:
Writes refined seed to state.json → pivot_log
Re-runs Phase 0f + Phase 0g + Phase 1 dimension re-assessment (NOT Phase 0b/0c/0e — those stay locked from original seed)
Preserves existing findings but marks them pre_pivot_context
Round counter resumes (pivot does not reset max_rounds)
Final report includes both pre-pivot and post-pivot findings with a PIVOT_BOUNDARY section marker
Prevents the user from having to abandon a half-good run to restart.
Claim-Provenance Graph (v7)
For topics with ≥20 claims in final report, coordinator emits a provenance graph (GraphViz DOT or JSON adjacency) as claim_provenance.dot:
Agents whose cost-per-unique-URL exceeds 2× the round median are flagged inefficient_retrieval. Over multiple runs, this feeds back to the Adapter Registry: adapters consistently contributing inefficient agents get deprioritized.
Temporal-Source-Cluster Detection (v7)
If a disproportionate share of cited sources clusters in a narrow time window, emit COVERAGE_CAVEAT_TEMPORAL_CLUSTER: {window}. Default threshold: ≥60% in a 6-month window for stable topics, ≥80% for fast_moving topics (where temporal clustering is expected). Threshold is configurable via state.json → temporal_cluster_threshold. This catches runs that unintentionally capture a hype-cycle snapshot rather than a longitudinal view. User may rerun with explicit date-range directions to resolve.
Golden Rules Appended (v7 — #19–#21)
Momentum is a bias correction, not a filter. Cross-cutting mandatory dimensions run every round regardless of momentum. A zero-yield dimension in round N may yield in round N+1 with refined queries.
Agent hallucination is a tier-1 failure mode. Spot-check every agent's non-single-source claims every round. A confident agent fabricating 2 of 3 sampled claims poisons the entire run if un-flagged.
Commercial intent ≠ invalid source. Adversarial-source flags DOWNGRADE, they do not reject. A content farm article may still cite a real primary source — use the flag to adjust confidence, not to discard.
Anti-Rationalization Counter-Table Additions (v7)
Excuse
Reality
"The source has a clean URL, looks legit."
No. Check domain age, sibling density, syndication. A 3-month-old .com with 100k SEO-shaped URLs is not a primary source.
"The agent cited a real URL, the claim's fine."
No. Phase 3.75 hallucination probe verifies the claim APPEARS in the source. Cited ≠ supported.
"This dimension yielded nothing last round — skip it."
No. Momentum is a priority adjustment, not a filter. Cross-cutting dimensions run every round.
"One Language Locus classification is enough for the whole topic."
No. Sub-goal-scoped locus catches multi-era/multi-domain drift. Cost per sub-goal is 1 Scout-tier agent.
"Final report has 40 claims, we're well-corroborated."
No. Render the provenance graph. If 40 claims trace to 3 original studies, cascading is hidden by raw count.
"Evidence Quality is averaged — 0.7 means strong."
v6 addressed this per-theme. v7 adds per-source cost efficiency. An agent producing one primary source at Researcher-tier cost is fine; one producing 20 duplicates at the same cost is not.
"All sources are from 2025, that's current."
No. Temporal-cluster detection: 60% in 6 months = hype-cycle snapshot. Flag + consider rerun with date-range directions.
Before Phase 5 synthesis, spawn a Contrarian Agent (Sonnet, independent). Its one job: argue against the dominant framing the coordinator summary has converged on.
Inputs: all mini-syntheses + coordinator summary's current dominant_framing field + full Claims Register.
Prompt:
The coordinator summary claims the dominant framing of this research is: "{dominant_framing}".
Your job: argue the strongest case AGAINST this framing using ONLY evidence already in the findings. If the evidence doesn't support a strong counter-case, say so explicitly.
Output:
COUNTER_CASE_START
counter_framing: "..."
supporting_findings: [{finding_id, claim_summary}, ...] // must cite actual findings, not fabricate
strength_vs_dominant: strong | moderate | weak | no_viable_counter
overlooked_angles: [angles the dominant framing dismisses or doesn't address]
COUNTER_CASE_END
Coordinator rules:
strength_vs_dominant: strong → dominant framing demoted; final report presents BOTH framings as contested, with a "Framing Debate" section
strength_vs_dominant: weak → counter framing listed as "Minority view" in appendix
no_viable_counter → dominant framing confirmed, BUT coordinator logs this outcome for audit (a "no viable counter" result on every run signals Contrarian Agent is rubber-stamping; threshold >80% → agent prompt revision required in skill-level maintenance)
Independence-invariant applies — Contrarian Agent output is never edited by coordinator, appended verbatim to final report as "Contrarian Audit" section.
Run Reproducibility Hash (v8)
At Phase 2 (Initialize State), coordinator computes:
Written to state.json → run_fingerprint. Surfaced at top of final report.
Two runs with matching fingerprints but different outputs = LLM non-determinism (expected, minor), OR underlying source state changed (flag).
Different fingerprints cannot be compared directly — final report explicitly notes "Reproducibility: results are a snapshot; fingerprint {hash}".
If user attempts to resume a run with a mismatched fingerprint (e.g. adapter versions have drifted since original run) → coordinator warns and offers fresh start or continue-with-caveat.
Cross-Run Adapter State Cache (v8)
v4.1 adapter_validation_log is per-run. v8 persists a rolling 30-day cache in ~/.claude/deep-research-cache/adapter_health.jsonl:
Final report's per-theme weighted evidence score (v6) now also reports perspective_distribution: {academic: 40%, industry: 30%, ...}. A theme with 100% industry perspective on a "best practices" topic is flagged COVERAGE_CAVEAT_SINGLE_PERSPECTIVE. Complements Breadth Auditor — catches framing bias the auditor misses because all sources are different URLs but from the same perspective class.
Semantic-Staleness Check (v8)
For claims that use domain jargon, coordinator runs a 1-call Haiku check per unique term: "Is this term the currently-preferred name for {concept} as of {current_year}, or has it been superseded? If superseded, by what?"
Claims using superseded jargon (e.g. "artificial general intelligence" where the current field term is "frontier models") → tagged semantic_staleness_risk: true and flagged in report. Prevents confidently-cited sources with dated vocabulary from anchoring modern claims.
Cost: ~5 Scout-tier calls per run. Only fires for topics with identifiable domain jargon (detected by >=3 claims sharing same technical terms).
Coverage-Confidence Decoupling (v8)
v0–v7 blur Coverage % (how many angles explored) with Evidence Quality (how strong evidence per angle). v8 decomposes final report header into:
breadth_score: 0–1, fraction of applicable dimensions explored
depth_score: 0–1, mean per-direction exhaustion score
corroboration_score: 0–1, fraction of claims with ≥2 independent sources
counter_evidence_score: 0–1, fraction of claims where counter-evidence was actively searched
fabrication_risk_score: 0–1, inverse of hallucination-probe pass rate (v7)
perspective_diversity_score: 0–1, Shannon entropy over perspective_lens distribution (v8)
temporal_diversity_score: 0–1, inverse of temporal cluster concentration (v7)
Header reads: Coverage: breadth=0.82, depth=0.67, corroboration=0.71, counter_evidence=0.58, fabrication_risk=0.12, perspective_diversity=0.45, temporal_diversity=0.62. User gets honest orthogonal signals, not a single summary score that averages them.
Golden Rules Appended (v8 — #22–#24)
Dominant framing survives because it wasn't challenged. Contrarian Pre-Synthesis is non-optional. A report that says "framing confirmed" without a Contrarian Audit section has not verified its own framing.
Single-perspective-class coverage on a multi-stakeholder topic is a framing failure. A 100%-industry report on "best practices" is an industry whitepaper, not research. Flag and force explicit acknowledgment.
Confidence scores are orthogonal, not averaged. Breadth, depth, corroboration, counter-evidence, fabrication-risk, perspective diversity, temporal diversity are different questions. A single composite score hides weak axes under strong ones.
Anti-Rationalization Counter-Table Additions (v8)
Excuse
Reality
"The framing is obvious from findings."
No. Contrarian Agent argues against it using the same findings. If it can build even a moderate case, your "obvious" framing isn't.
"No need to hash the run — findings speak for themselves."
No. Reproducibility fingerprint captures which adapter versions + models generated the findings. Without it, re-running 30 days later might produce different results from silently-changed APIs.
"Adapters are checked per run — we're good."
v4.1's per-run canary is fine but wastes calls on known-broken adapters. v8 cross-run cache saves budget and flags chronic-unreliable.
"All our sources are legit primary papers."
Check perspective_lens distribution. Primary + 100% academic on an applied topic = industry perspective absent = framing hole.
"The term is well-known in the field."
Semantic-staleness check: terms get superseded. A 2015-era canonical term cited in a 2026 report signals the research found old material.
"Coverage is 0.82, we're good."
Decompose: breadth, depth, corroboration, counter_evidence, fabrication_risk, perspective_diversity, temporal_diversity. 0.82 average can hide any one at 0.3.
"The contrarian agent said 'no viable counter' — case closed."
Rubber-stamping pattern. If this fires on every run, the Contrarian Agent prompt needs revision. Track the rate.
Ceiling assessment (v8)
After v8, the remaining known gaps require either:
External infrastructure the skill cannot conjure (embedding models, vector DBs, MCP servers for CNKI/Baidu, authenticated paywall access)
Human judgment the skill cannot replace (final-answer question "is this a good research report for MY purpose")
Tool-level changes outside the skill's scope (WebSearch index choice, WebFetch reliability, LLM hallucination rates)
Iterations v6→v7→v8 each added ~4-8 mechanisms of decreasing marginal impact. Further versions would increasingly add ceremony without reaching new retrieval surface, bias classes, or failure modes that the current spec doesn't already observe and caveat. The skill's coverage of observable failure modes is effectively saturated relative to what its available tools can reach.
Adapter Registry (v3)
Each adapter is a deterministic fetch recipe. The coordinator (Phase 3.3, before agent spawn) selects applicable adapters by the topic-class heuristic below, runs them in parallel, deduplicates URLs, writes results to deep-research-{run_id}/adapter_pools/direction-{id}.json, and passes the path to the research agent.
name: string
applicable_topic_classes: [string, ...]
endpoint: URL template with {topic} placeholder
method: "webfetch_json" | "webfetch_html" | "websearch"
parse_rules: JSONPath or HTML selector → list of {url, title, year?, snippet?}
yield_signal: per-call count + unique-URL count written to adapter_pools/{direction}.json header
failure_mode: on 403/404/timeout → record `adapter_failed: {name}, reason: ...` in pool JSON, continue with other adapters
Any adapter call that returns 403 | 404 | timeout | empty is recorded as adapter_failed in the pool JSON — coordinator never retries the same adapter within a run
Per-run adapter call budget (calibration default — override via state.json → adapter_budget): 5 adapters × 1 call per direction = 5 fetches per direction at Phase 3.3. For a 20-direction seed that lands around 100 adapter calls and ~30s wall-clock with parallelization — but both numbers are estimates that scale linearly with initial_directions × adapters_per_direction and depend on network latency. Treat as a rough sizing signal, not a guarantee; runs over congested links or with slow adapters routinely take 2–3× longer.
If ≥ 3 of 5 adapters fail for a given direction, coordinator tags the direction low_pool_yield — the research agent compensates by shifting its budget to WebSearch
Cross-Lingual Adapter Extension (v4)
Activated when state.language_locus.authoritative_languages contains non-en entries. For each non-en language in the locus, coordinator:
Translates seed query into target language via LLM (Haiku) before firing language-specific adapters. Writes translations to xlang_queries/{lang}.txt. Each WebSearch / API call is tagged with query_language.
Picks language-specific adapter variants:
Status honesty (v4.1): Of the 15 language-specific endpoints originally proposed, only 2 classes are verified working via WebFetch from Claude Code: (a) {lang}.wikipedia.org and (b) OpenAlexAPI with filter=language:{xx}. HAL, SciELO, J-STAGE, CiNii, Baidu Scholar, CNKI were tested live and all returned 403 / redirects / ECONNREFUSED / auth walls. Do not add an adapter to this table without a successful live probe recorded in state.json → adapter_validation_log. Non-academic / community / industry content in non-EN languages is reached via WebSearch country-TLD operators (verified working), NOT via specialized scrapers.
Check adapter_validation_log before adding new ones
Adapter validation protocol (v4.1, mandatory): Before any cross-lingual adapter runs for the first time in a run, the coordinator performs a 1-call live probe. If probe returns non-2xx OR empty body within 10s, the adapter is marked unavailable_this_run in state.json → adapter_validation_log and skipped for the remainder of the run. This prevents the language-gap gate from injecting rounds that quietly 403 on every adapter and then declaring the gap "unresolved."
Honest scope of cross-lingual coverage: With verified-working adapters only, v4.1 reliably reaches:
Non-EN community / industry / news (via WebSearch country-TLD operators — breadth varies by WebSearch index bias)
v4.1 does NOT reliably reach: paywalled non-EN databases (CNKI, AskZad), JS-heavy non-EN archives (CiNii, J-STAGE), region-blocked sources (Baidu Scholar). These are flagged in the final report as COVERAGE_CAVEAT_LANGUAGE_GAP_PARTIAL with the specific source types unreachable.
Source-language tagging. Every source entering the Source Table gets source_language (ISO 639-1):
URL TLD heuristic first (*.de, *.jp, *.cn)
Domain-specific default if known (zh.wikipedia.org → zh)
LLM content inspection as fallback (cheap — 100-char prefix check)
Unknown → source_language: und (undetermined). Uncoverted und sources are surfaced in the final report under "Language-ambiguous sources."
Cross-lingual synthesis. Non-en findings enter the coordinator summary with LLM-translated English abstracts, preserving the original-language URL + a [translated_from: {lang}] tag. Native-language quotes are retained verbatim for the final report's Evidence section.
Cost impact. ~2-4 Scout-tier equivalents additional per run when cross-lingual mode activates. LLM translation of 5 seed queries × 3 non-en languages ≈ 15 Scout-tier calls. Adapter fetches add ~10–15 retrieval calls. Gated by user consent at Phase 1 scope declaration; skipped entirely for en_dominant topics.
Execution routing (sagaflow-first — Contract 6)
Sagaflow is the default execution path. The in-session workflow above is the FALLBACK, used only when the sagaflow worker is confirmed unavailable. See _shared/execution-model-contracts.md Contract 6.
Routing sequence (mandatory before any in-session work):
Run sagaflow doctor
If healthy → launch via sagaflow below. Stop. Do not run in-session.
If unhealthy → log SAGAFLOW_UNAVAILABLE, proceed with in-session fallback.
Substitute <SEED> with the research topic/question and <N> with the direction budget (default 5, max ~15). The workflow writes its report to ~/.sagaflow/runs/<run_id>/research-report.md with the executive summary, per-direction findings, cross-cutting analysis, fact-verification spot-checks, coverage, sources, and termination label.