| name | deep-research |
| description | Advanced deep research (L2) — runs /research (L1) first, then adds reflection loop, contradiction detection, and tree depth 2 for follow-up questions. ~12 min, 20-30 sources, ~2000 word report. Use for serious questions, technology choices, non-trivial investigations. |
| user_invocable | true |
Deep Research — L2 Advanced
⚠️ Read this first. This skill has four mandatory verification checkpoints baked into the pipeline. They prevent the "hollow synthesis" failure mode where the model skips the scrape step and writes a report citing sources that don't exist on disk. Run the Bash verification blocks at every checkpoint. Do not skip them. See "Why these checkpoints exist" for the full story.
Advanced research tier. Composes on top of L1 (/research) by adding:
- Reflection loop — find gaps in L1 report, do second-pass search
- Contradiction detection — explicitly surface disagreements between sources
- Tree depth 2 — follow-up questions per branch
- Confidence grading per claim (High / Medium / Low)
Position in ladder: L2. Calls L1 as its foundation. Called by L3 (expert-research), L4, L5.
When to Use
- "что лучше X или Y" (comparisons)
- Technology choices: "стоит ли мигрировать с A на B"
- Non-trivial "как работает X" where L1 might miss nuance
- Investigations that need checking multiple angles
- Default when user says "подробно", "серьёзно", "глубоко"
Escalate to:
/expert-research (L3) — when decision is strategic or you need critic agent
/academic-research (L4) — when you need scientific/academic sources
/ultra-research (L5) — when you want a full knowledge base
Budget
- Time: ~12 min total (L1: ~5 min + L2 layer: ~7 min)
- Tavily credits: ~50 (L1: ~20 + L2: ~30)
- Sub-questions: 5 (L1: 3 + L2: +2 follow-ups)
- Sources: 20–30 (L1: 10–15 + L2: +10–15)
- Output: structured report ~2000 words
Pipeline
┌─────────────────────────────────────────┐
│ STAGE 1: Execute /research (L1) │
│ — plan → search → read → summarize │
│ — produces L1/report.md + summaries │
└──────────────────┬──────────────────────┘
│
🛑 CHECKPOINT 1: Verify L1
│
▼
┌─────────────────────────────────────────┐
│ STAGE 2: L2 Reflection Layer │
│ 1. GAP ANALYSIS — find holes in L1 │
│ 2. FOLLOW-UP PLAN — 2 new sub-qs │
│ 3. DEEPER SEARCH — search the gaps │
│ 🛑 CHECKPOINT 2: Verify URLs │
│ 4. READ + SUMMARIZE new sources │
│ 🛑 CHECKPOINT 3: Verify scrapes │
│ 5. CONTRADICTION DETECTION │
│ 6. CONFIDENCE GRADING │
│ 7. ENRICHED SYNTHESIS │
│ 8. BIBLIOGRAPHY │
│ 🛑 CHECKPOINT 4: Final verify │
└─────────────────────────────────────────┘
Artifacts Directory
.firecrawl/research/<slug>/
├── L1/ # produced by /research
│ ├── plan.md
│ ├── sources/
│ ├── report.md
│ └── bibliography.md
└── L2/ # produced by this skill
├── gap-analysis.md # what L1 missed
├── followup-plan.md # 2 new sub-questions
├── search-4.json # raw search results, subq 4
├── search-5.json # raw search results, subq 5
├── sources/ # NEW sources only (don't duplicate L1)
│ ├── 16-<slug>.md
│ ├── 16-<slug>.sum.md
│ └── ...
├── contradictions.md # flagged disagreements
├── confidence.md # per-claim confidence levels
├── report.md # enriched report (supersedes L1/report.md)
└── bibliography.md # combined L1 + L2 sources
Stage 1: Execute L1
Invoke the research skill directly. Do NOT duplicate L1 logic inline.
Use the Skill tool:
Skill: research
Args: <user's query>
Wait for L1 to complete.
🛑 CHECKPOINT 1: Verify L1 Foundation
v0.4.0: uses shared verification library. See verify-research.sh for the actual checks.
SLUG="<slug>"
VERIFY_LIB="$HOME/.claude/scripts/lib/verify-research.sh"
[ -f "$VERIFY_LIB" ] || VERIFY_LIB="scripts/lib/verify-research.sh"
[ -f "$VERIFY_LIB" ] || { echo "❌ verify-research.sh not found — run scripts/install.sh"; exit 1; }
source "$VERIFY_LIB"
verify_l2_checkpoint_1 "$SLUG" || exit 1
Only proceed if this prints ✅ CHECKPOINT 1 PASSED. The function verifies L1 foundation (plan, report, bibliography, ≥10 summaries, ≥10 matching scrapes). If L1 didn't produce these artifacts, abort L2 and report the L1 problem to the user.
Stage 2: L2 Layer
Step 2.1: GAP ANALYSIS
Read L1/report.md and all L1/sources/*.sum.md. Identify:
- Unanswered questions — parts of the original query L1 didn't cover
- Surface-level answers — topics L1 treated shallowly that deserve more depth
- Missing perspectives — angles not represented (critics, alternatives, real users)
- Outdated sources — claims based on pre-2025 material that need verification
- Unverified claims — strong assertions with only one source
Write L2/gap-analysis.md:
# Gap Analysis
**Based on:** L1/report.md
## Unanswered
- Original query asks about X, but L1 doesn't address Y
- ...
## Shallow coverage
- L1 mentions Z but doesn't explain mechanism
- ...
## Missing perspectives
- No critical/skeptical sources
- No real-world user experiences
- ...
## Unverified claims
- "X is 3x faster" — only one source, no benchmark data
- ...
Step 2.2: FOLLOW-UP PLAN
Based on gaps, generate 2 new sub-questions (tree depth 2 — branching from L1's 3 sub-questions).
Write L2/followup-plan.md:
# Follow-up plan
## New sub-questions
4. <subq4 — targets gap A>
5. <subq5 — targets gap B>
## Search strategy
- subq4 → [specific queries, possibly contrarian]
- subq5 → [...]
## Target: 10-15 NEW sources (not already in L1)
Step 2.3: DEEPER SEARCH
Run parallel searches for the 2 new sub-questions:
mkdir -p .firecrawl/research/$SLUG/L2
firecrawl search "<subq4>" --limit 8 --json -o .firecrawl/research/$SLUG/L2/search-4.json &
firecrawl search "<subq5>" --limit 8 --json -o .firecrawl/research/$SLUG/L2/search-5.json &
wait
Plus Tavily for each with search_depth: "advanced".
v0.2.2: persist Tavily results. Tavily goes through MCP — its output lives in conversation context by default, not on disk. After each mcp__tavily__tavily_search call, use the Write tool to save the returned JSON to .firecrawl/research/$SLUG/L2/tavily-N.json. This makes Tavily findings auditable and survivable across context compaction.
Special L2 search tactics:
- Add contrarian terms: "problems with X", "X issues", "X vs alternatives"
- Search for recent critiques: "X 2026 criticism"
- Look for real benchmarks: "X benchmark comparison"
- Russian-language search for additional angles
Step 2.3a-pplx: PERPLEXITY ANSWER CHANNEL (optional, parallel, v0.6.0+)
Fault-tolerant — if Perplexity MCP isn't installed or DEEP_RESEARCH_DISABLE_PERPLEXITY=1 is set, this step is skipped silently.
Perplexity returns an already-synthesized answer with citations — a different output shape from Tavily/Exa (URL lists) and Codex (raw model response). For L2 contradictions, this gives a third independent "opinion" that can be directly cross-checked against Claude's synthesis of scraped sources.
mcp__perplexity-ask__perplexity_ask with:
messages: [{role: "user", content: "Research gap: <gap description from L2/gap-analysis.md>. Include inline [N] citations and Sources: list."}]
Save the response to .firecrawl/research/$SLUG/L2/perplexity-gap.json via the Write tool.
In Step 2.5 (contradictions), compare Perplexity's answer against Claude's L2 synthesis:
- Same conclusion, same sources → high confidence
- Same conclusion, different sources → triangulation (good)
- Different conclusion → flag in
contradictions.md and investigate which is right
Perplexity's citations are candidate URLs for additional Firecrawl scraping if they're not already in L1/L2 sources.
Step 2.3a-exa: EXA NEURAL CHANNEL (optional, parallel, v0.5.0+)
Fault-tolerant — if Exa MCP is not installed or DEEP_RESEARCH_DISABLE_EXA=1 is set, this step is skipped. Independent from Codex — can run together or separately.
Exa provides neural semantic search with its own index. Complements Tavily (keyword-ish) and Firecrawl (extraction). For L2 gap-fill, Exa's web_search_advanced_exa with category filters often finds sources Tavily misses.
mcp__exa__web_search_advanced_exa with:
query: <gap description from L2/gap-analysis.md>
num_results: 8
type: "auto"
category: "news" # or "research paper" for technical gaps, omit for general
contents: { text: { max_characters: 20000 } }
Save the response JSON to .firecrawl/research/$SLUG/L2/exa-gap.json via the Write tool (same persistence pattern as Tavily — MCP responses don't survive context compaction).
In Step 2.5 (contradictions) and Step 2.7 (synthesis), merge Exa findings:
- URLs in Exa response but not in Tavily/Firecrawl → candidates for additional scrape
- Exa's
score field (relevance) can flag high-confidence sources
- If Exa and Tavily point to conflicting sources → surface in
contradictions.md
If Exa unavailable: skip silently. The skill's existing Tavily+Firecrawl+Codex channels still work.
See docs/EXA_INTEGRATION.md for full integration reference.
Step 2.3a: CODEX CROSS-MODEL CHANNEL (optional, parallel)
Added in v0.2. This step is optional and fault-tolerant — if Codex isn't installed, auth is expired, or it times out, the skill continues without it. The report notes which mode was used.
Run a parallel research pass through OpenAI Codex CLI (GPT-5.4 with live web search). This gives a second, independent search index and model — disagreements between Claude's findings and Codex's findings are the strongest signal for contradictions.md.
CODEX_HELPER="$HOME/.claude/scripts/codex-research.sh"
[ -x "$CODEX_HELPER" ] || CODEX_HELPER="scripts/codex-research.sh"
CODEX_GAP_PROMPT="You are a research assistant working in parallel with another model on this query:
<ORIGINAL QUERY>
Focus specifically on filling these gaps identified in the first-pass research:
- <gap 1 from L2/gap-analysis.md>
- <gap 2 from L2/gap-analysis.md>
Return 5-10 key facts with source URLs. Include recent critiques and contrarian viewpoints if relevant. Be concise (≤800 words)."
if [ -x "$CODEX_HELPER" ]; then
bash "$CODEX_HELPER" 180 \
".firecrawl/research/$SLUG/L2/codex-gap.md" \
"$CODEX_GAP_PROMPT" &
CODEX_PID=$!
else
echo "⏭️ Codex helper not found — skipping cross-model channel"
CODEX_PID=""
fi
After Step 2.4 (scrape) completes, wait for Codex and record the outcome:
if [ -n "$CODEX_PID" ]; then
wait "$CODEX_PID" 2>/dev/null
if [ -s ".firecrawl/research/$SLUG/L2/codex-gap.md" ]; then
echo "✅ Codex cross-model channel: output available"
else
cat ".firecrawl/research/$SLUG/L2/codex-gap.md.status" 2>/dev/null
fi
fi
The status file will say one of: SUCCESS, SKIPPED, AUTH_FAILED, RATE_LIMITED, TIMEOUT, FAILED. Use this in Step 2.5 (contradictions) and Step 2.7 (synthesis) to decide whether to incorporate Codex findings.
🛑 CHECKPOINT 2: Verify Searches Produced URLs
source "$VERIFY_LIB"
verify_l2_checkpoint_2 "$SLUG" || exit 1
Checks: ≥2 search JSON files, each non-trivially sized (≥500 bytes), ≥10 unique URLs total. Only proceed if this prints ✅ CHECKPOINT 2 PASSED.
Step 2.4: READ + SUMMARIZE (MANDATORY — DO NOT SKIP)
⚠️ This step is the primary hollow-synthesis pitfall. Do NOT rely on search snippets. You MUST run firecrawl scrape for each chosen URL. Each scrape must produce a real file on disk. The checkpoint after this step will verify.
Deduplicate against L1 sources. Pick 10–15 NEW URLs.
Scrape in parallel — one firecrawl scrape per URL — and write summaries with same structure as L1 (.sum.md with Key facts, Key quotes, Notes).
mkdir -p .firecrawl/research/$SLUG/L2/sources
firecrawl scrape "<URL1>" --only-main-content -o .firecrawl/research/$SLUG/L2/sources/16-<slug>.md &
firecrawl scrape "<URL2>" --only-main-content -o .firecrawl/research/$SLUG/L2/sources/17-<slug>.md &
firecrawl scrape "<URL3>" --only-main-content -o .firecrawl/research/$SLUG/L2/sources/18-<slug>.md &
wait
After scraping, write a .sum.md summary companion for EACH scraped file (same structure as L1):
# Source 16: <title>
**URL:** <url>
**Type:** [official-docs | tech-blog | community | news | academic]
**Date:** <publication date if available>
**Quality:** [A | B | C]
**Relevant to:** [subq4 | subq5]
## Key facts
- Fact 1 (specific, citable)
- ...
## Key quotes
> "exact quote 1"
## Notes
- Contradicts source X on Y (if applicable)
- Gap: doesn't address W
🛑 CHECKPOINT 3: Verify Scrapes Completed (CRITICAL — prevents hollow synthesis)
This is the most important checkpoint in the pipeline. Skipping it is how "hollow synthesis" happens.
source "$VERIFY_LIB"
verify_l2_checkpoint_3 "$SLUG" || exit 1
Checks: ≥8 L2 scrapes, each ≥500 bytes (not error page), each has matching .sum.md companion ≥1000 bytes. If this fails, DO NOT write the L2 report. Re-run the scrape step.
Step 2.5: CONTRADICTION DETECTION
Read all summaries (L1 + L2). If L2/codex-gap.md exists, read it too — Codex's independent findings often surface contradictions Claude's scrapes miss.
Look for direct disagreements:
- Two sources claim different numbers
- Two sources recommend opposite approaches
- Source claims X is best, another says X is deprecated
- Old source vs new source
Write L2/contradictions.md:
# Contradictions found
## C1: [Topic]
**Source A** ([1]): "claim A"
**Source B** ([7]): "opposite claim B"
**Analysis:** which is more recent? more authoritative? specific context?
**Resolution:** [which to trust and why, OR "unresolved, flag in report"]
If truly no contradictions exist, write the file with # Contradictions found\n\nNone — all sources in agreement on major points. — do not omit the file.
Step 2.6: CONFIDENCE GRADING
For each major claim in the enriched report, assign:
- High — confirmed by 3+ independent authoritative sources, recent, no contradictions
- Medium — confirmed by 2 sources, or 1 authoritative source, minor gaps
- Low — single source, old, or contradicted by other sources
Write L2/confidence.md listing major claims with their grade.
Step 2.7: ENRICHED SYNTHESIS
Produce L2/report.md — this supersedes L1/report.md.
Before writing, re-read ALL .sum.md files (both L1 and L2). The report must synthesize from summaries, not from memory of what you searched.
If L2/codex-gap.md exists and has content, incorporate its findings alongside Claude's findings. Mark Codex-sourced facts with (cross-model) tag in the report so the reader knows which claims have two-model backing (higher confidence) vs one-model (lower confidence).
If Codex was unavailable (check L2/codex-gap.md.status): add a note to the Confidence Summary section: Note: single-model run — Codex channel was <SKIPPED|AUTH_FAILED|TIMEOUT|...>. Claims have Claude-only verification.
# <Topic Title>
**Query:** <original>
**Level:** L2 (includes L1)
**Sources:** <L1 count> + <L2 count> = <total>
**Generated:** <date>
## TL;DR
[Core answer, 3-4 sentences. Include confidence marker: "✓ high confidence"]
## Executive Summary
[1 paragraph, ~200 words. Main findings with citations.]
## <Section 1: from L1 subq1, enriched with L2 findings>
[Full answer with [N] citations. Mark confidence where relevant: "(High)", "(Medium)".]
## <Section 2: from L1 subq2, enriched>
[...]
## <Section 3: from L1 subq3, enriched>
[...]
## <Section 4: from L2 subq4 — deeper dive>
[New angle not covered in L1]
## <Section 5: from L2 subq5 — deeper dive>
[Another new angle]
## Contradictions & Debates
[Explicit section listing disagreements from contradictions.md]
- **On X:** Source [1] says A, source [7] says B. Resolution: [explanation]
- **On Y:** ...
## Confidence Summary
- High confidence: [list of claims]
- Medium confidence: [list]
- Low confidence: [list, flagged as "needs verification"]
## Key takeaways
- Point 1 [1, 3, 16] (High)
- Point 2 [2, 5, 19] (Medium)
- ...
## Open questions
- What even L2 research couldn't resolve
- Where user might want to escalate to L3/L4
## Recommendation
[Concrete recommendation with explicit reasoning, accounting for confidence levels]
Target length: 1800–2500 words.
Citation rule: every [N] citation must correspond to a real bibliography entry. Do not invent citation numbers.
Step 2.8: Bibliography
Write L2/bibliography.md merging L1 and L2 sources. Keep L1 numbering intact (1–15), L2 sources continue (16+).
# Bibliography
## L1 sources (from research)
1. **[Title]**(URL) — Quality: A | Type: official-docs
Contribution: primary source for X
2. ...
15. ...
## L2 sources (new in deep-research)
16. **[Title]**(URL) — Quality: A | Type: tech-blog
Contribution: benchmark data, contradicts [3] on Y
17. ...
🛑 CHECKPOINT 4: Final Verification Before Delivering Report
source "$VERIFY_LIB"
verify_l2_checkpoint_4 "$SLUG" || exit 1
Checks: L2 report ≥1500 words; bibliography, contradictions, confidence, gap-analysis, followup-plan all present; every [N] citation (including multi-cite [1, 3, 16]) maps to a bibliography entry.
Only deliver the report to the user if this prints ✅ CHECKPOINT 4 PASSED.
Final Output
- Display
L2/report.md in chat
- Mention artifacts:
Полные материалы в .firecrawl/research/<slug>/
- Show summary stats:
📊 L2 stats: <L1> L1 sources + <L2> L2 sources, <N> contradictions, confidence: H:<n> M:<n> L:<n>
- Offer escalation:
Нужна критическая проверка? /expert-research (L3) добавит critic-агента и fact-check.
Нужны научные источники? /academic-research (L4) подключит arXiv и Scholar.
Why these checkpoints exist
Earlier versions of this skill had a recurring failure mode called "hollow synthesis": the model would run the search step, look at Tavily and Firecrawl search-result snippets, and jump directly to writing the report — silently skipping the actual firecrawl scrape step.
The resulting report looked plausible but was structurally broken:
- Cited
[L2-N] references with no corresponding file in L2/sources/
- Had no
bibliography.md
- Synthesized from snippets, not full content
- Read confidently but had zero audit trail
Root cause: under context pressure (long conversation, many tools already called), the model economizes by treating rich search-result snippets as if they were full sources. The skill said "scrape 10–15 URLs" but had no enforcement.
The checkpoints above make this failure impossible:
- Checkpoint 1 guarantees L1 actually ran before L2 starts
- Checkpoint 2 guarantees searches returned URLs
- Checkpoint 3 guarantees scrapes produced real files of non-trivial size with matching summaries
- Checkpoint 4 guarantees the final report only cites sources that exist in the bibliography
Rule of thumb: if .firecrawl/research/<slug>/L2/sources/ is empty or missing .sum.md companions, something went wrong. Stop, diagnose, fix. No synthesis allowed on a broken foundation.
Rules
- Always call L1 first — never inline the L1 logic
- Never duplicate L1 sources — L2 adds NEW sources only
- Contradictions are a feature — surface them, don't hide them
- Confidence is honest — if it's one source, grade it Low
- Gap analysis is grounded — base on actual L1 content, not assumptions
- Parallel operations — searches and scrapes run concurrently
- Preserve L1 artifacts — never modify or delete
L1/* files
- Run EVERY checkpoint — all four Bash verification blocks must print
✅ PASSED before proceeding
- Scrape is mandatory — snippets are not sources; no scrape means no citation
- No hollow synthesis — if
L2/sources/ has fewer than 8 real .md files, stop and re-scrape
Called from higher levels
L3/L4/L5 call this skill to get an L2-grade foundation before adding their own layers. When called from higher levels:
- Produce all L2 artifacts as usual
- All four checkpoints still apply
- Higher level reads
L2/report.md, L2/contradictions.md, L2/confidence.md as its input