| name | webcrawler |
| description | Documentation harvesting agent for crawling and extracting content from documentation websites. Use for crawling documentation sites and extracting all pages about a subject, building offline knowledge bases from online docs, harvesting API references, tutorials, or guides from documentation portals, creating structured markdown exports from multi-page documentation, and downloading and organizing technical docs for embedding or RAG pipelines. Supports recursive crawling with depth control, content filtering, and structured output. |
Webcrawler Skill
Intelligent documentation harvesting agent that recursively crawls documentation websites and extracts structured content about specific subjects.
Last Updated: 2026-01-23
Quick Start
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://docs.python.org/3/library/asyncio.html" \
--subject "asyncio" \
--depth 2 \
--output .tmp/docs/python-asyncio/
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://react.dev/" \
--subject "React" \
--depth 3 \
--output .tmp/docs/react/
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://expressjs.com/en/4x/api.html" \
--subject "Express API" \
--filter "api" \
--output .tmp/docs/express-api/
Core Workflow
- Initialize Crawl — Provide base URL and subject focus
- Discover Pages — Recursively find all linked documentation pages
- Filter Content — Keep only pages matching the subject criteria
- Extract Content — Convert HTML to clean markdown
- Organize Output — Structure files in a navigable hierarchy
- Generate Index — Create a master index with all harvested pages
Scripts
crawl_docs.py — Main Documentation Crawler
The primary crawling script that handles recursive page discovery and content extraction.
python skills/webcrawler/scripts/crawl_docs.py \
--url <base-url>
--subject <topic>
--output <directory>
--depth <n>
--filter <pattern>
--delay <seconds>
--max-pages <n>
--same-domain
--include-code
--format <md|json|both>
Outputs:
index.md — Master index with links to all pages
pages/*.md — Individual markdown files per page
metadata.json — Crawl metadata and page inventory
content.json — Structured JSON with all extracted content
extract_page.py — Single Page Extractor
Extract content from a single documentation page.
python skills/webcrawler/scripts/extract_page.py \
--url <page-url>
--output <file>
--format <md|json>
--include-links
filter_docs.py — Post-Crawl Filtering
Filter already-crawled documentation by subject or pattern.
python skills/webcrawler/scripts/filter_docs.py \
--input <crawl-dir>
--subject <topic>
--output <directory>
--threshold <0.0-1.0>
Configuration
Rate Limiting & Politeness
The crawler respects robots.txt and implements polite crawling:
- Default delay: 0.5s between requests
- User-Agent: Identifies as documentation harvester
- robots.txt: Honored by default (disable with
--ignore-robots)
Domain Handling
| Mode | Behavior |
|---|
--same-domain | Only crawl pages on the starting domain |
--same-path | Only crawl pages under the starting URL path |
--allow-subdomains | Include subdomains (e.g., api.example.com) |
Content Extraction
The crawler uses intelligent content extraction:
- Main content detection — Finds
<main>, <article>, or content containers
- Navigation removal — Strips headers, footers, sidebars
- Code preservation — Maintains code blocks with language hints
- Link normalization — Converts relative links to absolute
- Image handling — Optionally downloads and references images
Output Structure
.tmp/docs/<subject>/
├── index.md # Master index with TOC
├── metadata.json # Crawl metadata
├── content.json # Structured JSON export
└── pages/
├── getting-started.md
├── installation.md
├── api-reference.md
├── configuration/
│ ├── basic.md
│ └── advanced.md
└── troubleshooting.md
Index Format
# <Subject> Documentation
> Crawled from: <base-url>
> Pages: <count>
> Date: <timestamp>
## Table of Contents
- [Getting Started](pages/getting-started.md)
- [Installation](pages/installation.md)
- [API Reference](pages/api-reference.md)
- Configuration
- [Basic](pages/configuration/basic.md)
- [Advanced](pages/configuration/advanced.md)
- [Troubleshooting](pages/troubleshooting.md)
Common Workflows
1. Harvest API Documentation
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://api.example.com/docs" \
--subject "Example API" \
--depth 4 \
--filter "/api/" \
--output .tmp/docs/example-api/
2. Build RAG Knowledge Base
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://docs.example.com" \
--subject "Example Docs" \
--depth 3 \
--format json \
--output .tmp/rag/example/
3. Offline Documentation Mirror
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://docs.kubernetes.io/docs/concepts/" \
--subject "Kubernetes Concepts" \
--depth 5 \
--max-pages 500 \
--include-images \
--output .tmp/docs/k8s-concepts/
4. Focused Topic Extraction
python skills/webcrawler/scripts/crawl_docs.py \
--url "https://developer.hashicorp.com/terraform/docs" \
--subject "Terraform" \
--depth 3 \
--output .tmp/docs/terraform-full/
python skills/webcrawler/scripts/filter_docs.py \
--input .tmp/docs/terraform-full/ \
--subject "AWS Provider" \
--output .tmp/docs/terraform-aws/
Best Practices
Crawling
- Start shallow — Begin with
--depth 1 to test, then increase
- Use filters — Narrow scope with
--filter patterns
- Set page limits — Use
--max-pages to prevent runaway crawls
- Respect rate limits — Increase
--delay for slower servers
Content Quality
- Subject focus — Be specific with
--subject for better filtering
- Review index — Check
index.md to verify crawl coverage
- Post-filter — Use
filter_docs.py to refine results
Storage
- Use
.tmp/ — Store crawled docs in the temp directory
- Organize by subject — Create subdirectories per topic
- Version with dates — Add timestamps for recurring crawls
Troubleshooting
| Issue | Cause | Solution |
|---|
| 403 Forbidden | Blocked by server | Increase delay, check robots.txt |
| Empty pages | JavaScript-rendered content | Use --render-js (requires Playwright) |
| Too many pages | Unbounded crawl | Lower depth, use filters |
| Duplicate content | Same page via multiple URLs | Enabled by default (URL normalization) |
| Missing code blocks | Extraction issue | Check --include-code is enabled |
Dependencies
Required Python packages:
pip install requests beautifulsoup4 html2text lxml
pip install playwright && playwright install
Related Skills
External Resources
AGI Framework Integration
Qdrant Memory Integration
Before executing complex tasks with this skill:
python3 execution/memory_manager.py auto --query "<task summary>"
Decision Tree:
- Cache hit? Use cached response directly — no need to re-process.
- Memory match? Inject
context_chunks into your reasoning.
- No match? Proceed normally, then store results:
python3 execution/memory_manager.py store \
--content "Description of what was decided/solved" \
--type decision \
--tags webcrawler <relevant-tags>
Note: Storing automatically updates both Vector (Qdrant) and Keyword (BM25) indices.
Agent Team Collaboration- Strategy: This skill communicates via the shared memory system.
- Orchestration: Invoked by
orchestrator via intelligent routing.
- Context Sharing: Always read previous agent outputs from memory before starting.
Local LLM Support
When available, use local Ollama models for embedding and lightweight inference:
- Embeddings:
nomic-embed-text via Qdrant memory system
- Lightweight analysis: Local models reduce API costs for repetitive patterns