| name | deepsearch |
| description | Deep research skill: iterative multi-hop investigation using SearXNG (meta-search) and Firecrawl (web exploration). Handles both quick web searches and structured comparative research. Use when researching information on the internet, debugging issues, finding solutions, conducting comparative studies, or answering questions requiring cross-domain context. |
DeepSearch — Unified Research Skill
Iterative multi-hop research combining meta-search discovery (SearXNG) and deep web exploration (Firecrawl). Replaces separate web-search, research, research-deep, and research-report skills.
Prerequisites — Auto-start
Before any research, always ensure services are up. Run this check at the start of every deepsearch invocation:
curl -sf "http://localhost:8888/search?q=test&format=json" > /dev/null 2>&1 || \
(cd ~/code/led8/ai/spark/agent-deepsearch && make up-searxng && sleep 5)
curl -sf -X POST "http://localhost:3002/v1/scrape" -H "Content-Type: application/json" -d '{"url":"https://example.com"}' > /dev/null 2>&1 || \
(cd ~/code/led8/ai/spark/agent-deepsearch && make up-firecrawl && sleep 20)
This is not optional — run it silently every time, don't ask the user.
Modes
Default mode: Iterative research
/deepsearch <query>
Multi-hop investigation with confidence-based iteration. For debugging, technical questions, finding solutions, understanding topics.
Outline mode: Structured comparative research
/deepsearch --outline <topic>
Generates research framework (items + fields), then deep-researches each item systematically. For technology selection, benchmarks, comparative studies.
Core Algorithm
REPEAT until confidence > HIGH or max_iterations reached:
1. Memory check (agent-memory)
2. Decompose query into sub-questions
3. SearXNG search → discover URLs + snippets
4. Firecrawl scrape → read key pages, follow interesting links
5. Extract + rank results by relevance
6. Detect pivots (findings contradicting assumptions)
7. Build synthesis → evaluate confidence
8. IF confidence < threshold → refine queries → iterate
9. Store key findings in memory
Tool Reference
SearXNG — Discovery
Basic search:
curl -s "http://localhost:8888/search?q=YOUR+QUERY&format=json" | jq '.results[:10]'
With filters:
curl -s "http://localhost:8888/search?q=QUERY&format=json&time_range=month"
curl -s "http://localhost:8888/search?q=QUERY&format=json&language=en"
curl -s "http://localhost:8888/search?q=QUERY&format=json&categories=it"
curl -s "http://localhost:8888/search?q=QUERY&format=json&engines=github,stackoverflow"
curl -s "http://localhost:8888/search?q=QUERY&format=json&pageno=2"
Response structure:
{
"results": [
{"title": "...", "url": "...", "content": "...", "engine": "google", "score": 0.95}
],
"suggestions": ["related query 1", "related query 2"],
"infoboxes": [{"infobox": "...", "content": "...", "urls": [...]}],
"number_of_results": 1234
}
Engine routing by research type:
| Type | Engines param |
|---|
| General web | google,bing,duckduckgo,brave |
| IT/Dev | github,stackoverflow,dockerhub |
| Academic | arxiv,semantic+scholar,google+scholar,pubmed |
| Forums/Community | reddit |
| News | google+news |
| Knowledge | wikipedia,wikidata |
Firecrawl — Exploration
Scrape a single page (markdown output):
curl -s -X POST "http://localhost:3002/v1/scrape" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}' | jq '.data.markdown'
Scrape with only main content (no nav/footer):
curl -s -X POST "http://localhost:3002/v1/scrape" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"], "onlyMainContent": true}'
Extract links from a page:
curl -s -X POST "http://localhost:3002/v1/scrape" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["links"]}' | jq '.data.links'
Crawl a site (follow links recursively):
CRAWL_ID=$(curl -s -X POST "http://localhost:3002/v1/crawl" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "limit": 10, "maxDepth": 2}' | jq -r '.id')
curl -s "http://localhost:3002/v1/crawl/$CRAWL_ID" | jq '.status, .completed, .total'
curl -s "http://localhost:3002/v1/crawl/$CRAWL_ID" | jq '.data[].markdown'
Map a site (discover all URLs without scraping):
curl -s -X POST "http://localhost:3002/v1/map" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}' | jq '.links'
Resilient Scraping (403 / cookie walls)
Why this matters: this self-hosted Firecrawl only has two real engines — fetch (plain HTTP) and playwright (headless browser). The cloud-only anti-bot engines (fire-engine, chrome-cdp, tlsclient, stealth proxy) and actions (click-to-dismiss cookie banners) are NOT available. So you cannot bypass aggressive bot protection by clicking. Instead, use the tiered strategy below and fall back gracefully — never loop on a blocked URL.
Failure signatures — detect and react
| Signature | Meaning | Action |
|---|
"success": false + "All scraping engines failed" | Anti-bot blocked all engines (often a 403) | Go to next tier |
statusCode: 403 (or 401/429) | Forbidden / rate-limited / auth wall | Go to next tier |
markdown is near-empty or only a cookie/consent banner | Cookie-consent wall returned instead of content | Go to next tier |
| SSL/TLS error message | Bad certificate | Retry with skipTlsVerification: true |
Hard rule: max 2 retries per URL. If still blocked, drop it and pick a different source (source diversity beats fighting one wall).
Tier 1 — basic (default)
curl -s -X POST "http://localhost:3002/v1/scrape" \
-H "Content-Type: application/json" \
-d '{"url": "URL", "formats": ["markdown"], "onlyMainContent": true}' | jq '.data.markdown'
Tier 2 — force browser render (on 403 / thin content / cookie wall)
Forces the Playwright engine (waitFor is only supported there locally), waits for JS to render, and sends a realistic desktop User-Agent:
curl -s -X POST "http://localhost:3002/v1/scrape" \
-H "Content-Type: application/json" \
-d '{
"url": "URL",
"formats": ["markdown"],
"onlyMainContent": true,
"waitFor": 3000,
"skipTlsVerification": true,
"removeBase64Images": true,
"headers": {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"}
}' | jq '.data.markdown'
Tier 3 — external fallbacks (still blocked)
Try in order, then give up on this URL:
curl -s "https://r.jina.ai/URL"
curl -s -X POST "http://localhost:3002/v1/scrape" \
-H "Content-Type: application/json" \
-d '{"url": "https://web.archive.org/web/2/URL", "formats": ["markdown"], "onlyMainContent": true}' | jq '.data.markdown'
3. If all fail: use the SearXNG snippet (.results[].content) as the source
and pick a DIFFERENT URL covering the same sub-question.
Optional infra lever: routing the Playwright service through a PROXY_SERVER (see agent-deepsearch docs/scraping-resilience.md) is the single biggest fix for IP-based 403s.
Default Mode Workflow
1. Memory Check (always first)
agent-memory memory get-context --query "<topic>" --include-long-term 2>/dev/null
- High-quality results → narrow scope, skip known parts
- No results → full research
2. Query Decomposition
Parse the research goal into atomic sub-questions:
Query: "Best approach for rate limiting in FastAPI"
→ Sub-questions:
1. What built-in rate limiting does FastAPI/Starlette offer?
2. What third-party libraries exist?
3. What are Redis-based vs in-memory tradeoffs?
4. What do production deployments recommend?
Rules:
- 3-7 sub-questions per query
- Each should be answerable independently
- Identify dependencies (must be sequential) vs. independent (can be parallel)
3. Search & Explore (iterative)
For each sub-question:
Step 3a — Discover (SearXNG):
curl -s "http://localhost:8888/search?q=SUBQUESTION&format=json&engines=RELEVANT_ENGINES" | jq '.results[:10] | .[] | {title, url, content}'
Step 3b — Explore (Firecrawl):
Pick the 2-3 most promising URLs from results:
curl -s -X POST "http://localhost:3002/v1/scrape" \
-H "Content-Type: application/json" \
-d '{"url": "PROMISING_URL", "formats": ["markdown"], "onlyMainContent": true}' | jq '.data.markdown'
If a scrape returns 403, near-empty content, or a cookie wall, escalate using the tiers in Resilient Scraping (Tier 2 browser render → Tier 3 fallbacks → different source). Never retry the same URL more than twice.
Step 3c — Follow links:
If a page references something relevant, explore it:
curl -s -X POST "http://localhost:3002/v1/scrape" \
-H "Content-Type: application/json" \
-d '{"url": "CURRENT_PAGE", "formats": ["links"]}' | jq '.data.links[] | select(contains("relevant-keyword"))'
curl -s -X POST "http://localhost:3002/v1/scrape" \
-H "Content-Type: application/json" \
-d '{"url": "LINKED_PAGE", "formats": ["markdown"], "onlyMainContent": true}'
4. Pivot Detection
| Signal | Action |
|---|
| Info contradicts assumption | Change search direction |
| Resource deprecated/removed | Search for alternatives + used/archive |
| Feature not supported | Compare alternatives immediately |
| Single-source claim | Seek corroboration |
5. Confidence Evaluation
| Level | Criteria |
|---|
| HIGH | All sub-questions answered, multiple sources agree, no gaps |
| MEDIUM | Most answered, some gaps, single-source claims |
| LOW | Key questions unanswered, conflicting info, major gaps |
If LOW or MEDIUM: refine queries, add new sub-questions, iterate.
Maximum 3 iterations before presenting best available answer.
6. Memory Storage
Store non-obvious findings:
agent-memory memory remember --kind fact --subject "TOPIC" --predicate "KEY_FINDING" --object "DETAIL" --confidence 0.9 2>/dev/null
Store only: gotchas, decisions, constraints, non-obvious facts.
Don't store: obvious facts, session logs, temporary info.
Outline Mode Workflow (--outline)
Step 1: Generate Framework
Based on topic + model knowledge, produce:
- Items list: research objects to investigate
- Fields framework: dimensions to compare
Present to user for confirmation (add/remove items/fields).
Step 2: Web Supplement
Search SearXNG for missing items:
curl -s "http://localhost:8888/search?q=TOPIC+comparison+2024+2025&format=json&time_range=year" | jq '.results[:15]'
Supplement items and fields based on findings.
Step 3: Generate Outline Files
Save to ./{topic_slug}/:
outline.yaml:
topic: "Research topic"
items:
- name: "Item 1"
category: "Category"
description: "Brief description"
execution:
batch_size: 5
output_dir: ./results
fields.yaml:
field_categories:
- category: "basic_info"
fields:
- name: "field_name"
description: "What this field captures"
detail_level: "moderate"
required: true
Step 4: Deep Research (per item)
For each item, run the full iterative flow (steps 1-6 of default mode) targeting that specific item + all fields.
Output structured JSON per item to {output_dir}/{item_slug}.json.
Validate:
python ~/.pi/agent/skills/deepsearch/scripts/validate_json.py -f fields.yaml -j results/item.json
Batch checkpoint every batch_size items — pause and ask user for approval.
Step 5: Report
After all items complete, synthesize into a markdown report covering all fields, all items, with comparison tables where applicable.
Output Format
Default mode output:
## Summary
[2-3 sentences: what was found, recommended path forward]
## Findings
### [Sub-question 1]
[Answer with evidence]
- Source: [url]
### [Sub-question 2]
[Answer with evidence]
- Source: [url]
## Confidence: [HIGH/MEDIUM/LOW]
[Why this level]
## Gaps
- [Unanswered questions, if any]
## Sources
1. [title](url) — [engine, relevance note]
2. [title](url) — [engine, relevance note]
Outline mode output:
Per-item JSON + final markdown report with comparison tables.
Key Principles
- Memory first — always check before searching
- Discover then explore — SearXNG finds URLs, Firecrawl reads them
- Follow the links — interesting pages reference other interesting pages
- Iterate on confidence — don't stop at surface-level results
- Pivot on contradictions — adapt strategy when assumptions break
- Source diversity — multiple weak sources > single strong source
- Rank semantically — relevance > authority > freshness > depth
- Store durably — non-obvious findings go to memory for future use