Run any Skill in Manus with one click

$pwd:

webcrawler

Name: Webcrawler
Author: techwavedev

// Documentation harvesting agent for crawling and extracting content from documentation websites. Use for crawling documentation sites and extracting all pages about a subject, building offline knowledge bases from online docs, harvesting API references, tutorials, or guides from documentation portals, creating structured markdown exports from multi-page documentation, and downloading and organizing technical docs for embedding or RAG pipelines. Supports recursive crawling with depth control, content filtering, and structured output.

Run Skill in Manus

$ git log --oneline --stat

stars:3

forks:2

updated:April 12, 2026 at 22:03

File Explorer

5 files

SKILL.md

readonly

related-skills.json

same repository

claude-code-design.md

from "techwavedev/agi-agent-kit"

Answer questions about Claude Code / Anthropic agent design patterns (CLAUDE.md layering, skills anatomy, progressive disclosure, memory-first, orchestration, sub-agents, worktree isolation, Karpathy loop). Primary backend is a NotebookLM RAG over Anthropic design notebooks; falls back to bundled references when the notebook is not registered.

2026-04-153

notebooklm-internal.md

from "techwavedev/agi-agent-kit"

INTERNAL ONLY. Forked write-capable variant of the public notebooklm skill. Adds programmatic source ingestion (add_source) for the X/YouTube → NotebookLM pipeline (#119). Headless, non-interactive, cron-driven. NEVER ship publicly.

2026-04-153

ui-ux-pro-max.md

from "techwavedev/agi-agent-kit"

UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 9 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind, shadcn/ui). Actions: plan, build, cr...

2026-04-123

image-ai-generator.md

from "techwavedev/agi-agent-kit"

Generates images via Openrouter API using AI image models. Supports two modes: test (cheap model for iteration) and production (high-quality model for final output). Handles prompt construction, API calls, base64 decoding, and file saving. Supports reference images (logos, mascots) for brand-consistent generation.

2026-04-123

qdrant-memory.md

from "techwavedev/agi-agent-kit"

Intelligent token optimization through Qdrant-powered semantic caching and long-term memory. Use for (1) Semantic Cache - avoid LLM calls entirely for semantically similar queries with 100% token savings, (2) Long-Term Memory - retrieve only relevant context chunks instead of full conversation history with 80-95% context reduction, (3) Hybrid Search - combine vector similarity with keyword filtering for technical queries, (4) Memory Management - store and retrieve conversation memories, decisions, and code patterns with metadata filtering. Triggers when needing to cache responses, remember past interactions, optimize context windows, or implement RAG patterns.

2026-04-123

activecampaign-automation.md

from "techwavedev/agi-agent-kit"

Automate ActiveCampaign tasks via Rube MCP (Composio): manage contacts, tags, list subscriptions, automation enrollment, and tasks. Always search tools first for current schemas.

2026-04-123

package.json

"author": "techwavedev"

"repository": "techwavedev/agi-agent-kit"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name

webcrawler

description

Documentation harvesting agent for crawling and extracting content from documentation websites. Use for crawling documentation sites and extracting all pages about a subject, building offline knowledge bases from online docs, harvesting API references, tutorials, or guides from documentation portals, creating structured markdown exports from multi-page documentation, and downloading and organizing technical docs for embedding or RAG pipelines. Supports recursive crawling with depth control, content filtering, and structured output.

Webcrawler Skill

Intelligent documentation harvesting agent that recursively crawls documentation websites and extracts structured content about specific subjects.

Last Updated: 2026-01-23

Quick Start

# Crawl Python documentation about async/await
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://docs.python.org/3/library/asyncio.html" \
  --subject "asyncio" \
  --depth 2 \
  --output .tmp/docs/python-asyncio/

# Crawl React documentation
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://react.dev/" \
  --subject "React" \
  --depth 3 \
  --output .tmp/docs/react/

# Extract only API reference pages
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://expressjs.com/en/4x/api.html" \
  --subject "Express API" \
  --filter "api" \
  --output .tmp/docs/express-api/

Core Workflow

Initialize Crawl — Provide base URL and subject focus
Discover Pages — Recursively find all linked documentation pages
Filter Content — Keep only pages matching the subject criteria
Extract Content — Convert HTML to clean markdown
Organize Output — Structure files in a navigable hierarchy
Generate Index — Create a master index with all harvested pages

Scripts

`crawl_docs.py` — Main Documentation Crawler

The primary crawling script that handles recursive page discovery and content extraction.

python skills/webcrawler/scripts/crawl_docs.py \
  --url <base-url>           # Starting URL (required)
  --subject <topic>          # Subject focus for filtering (required)
  --output <directory>       # Output directory (default: .tmp/crawled/)
  --depth <n>                # Max crawl depth (default: 2)
  --filter <pattern>         # URL path filter pattern (optional)
  --delay <seconds>          # Delay between requests (default: 0.5)
  --max-pages <n>            # Maximum pages to crawl (default: 100)
  --same-domain              # Stay within same domain (default: true)
  --include-code             # Preserve code blocks (default: true)
  --format <md|json|both>    # Output format (default: both)

Outputs:

index.md — Master index with links to all pages
pages/*.md — Individual markdown files per page
metadata.json — Crawl metadata and page inventory
content.json — Structured JSON with all extracted content

`extract_page.py` — Single Page Extractor

Extract content from a single documentation page.

python skills/webcrawler/scripts/extract_page.py \
  --url <page-url>           # Page to extract (required)
  --output <file>            # Output file (default: stdout)
  --format <md|json>         # Output format (default: md)
  --include-links            # Include internal links (default: true)

`filter_docs.py` — Post-Crawl Filtering

Filter already-crawled documentation by subject or pattern.

python skills/webcrawler/scripts/filter_docs.py \
  --input <crawl-dir>        # Crawled docs directory (required)
  --subject <topic>          # Subject to filter for (required)
  --output <directory>       # Filtered output directory (required)
  --threshold <0.0-1.0>      # Relevance threshold (default: 0.3)

Configuration

Rate Limiting & Politeness

The crawler respects robots.txt and implements polite crawling:

Default delay: 0.5s between requests
User-Agent: Identifies as documentation harvester
robots.txt: Honored by default (disable with --ignore-robots)

Domain Handling

Mode	Behavior
`--same-domain`	Only crawl pages on the starting domain
`--same-path`	Only crawl pages under the starting URL path
`--allow-subdomains`	Include subdomains (e.g., api.example.com)

Content Extraction

The crawler uses intelligent content extraction:

Main content detection — Finds <main>, <article>, or content containers
Navigation removal — Strips headers, footers, sidebars
Code preservation — Maintains code blocks with language hints
Link normalization — Converts relative links to absolute
Image handling — Optionally downloads and references images

Output Structure

.tmp/docs/<subject>/
├── index.md              # Master index with TOC
├── metadata.json         # Crawl metadata
├── content.json          # Structured JSON export
└── pages/
    ├── getting-started.md
    ├── installation.md
    ├── api-reference.md
    ├── configuration/
    │   ├── basic.md
    │   └── advanced.md
    └── troubleshooting.md

Index Format

# <Subject> Documentation

> Crawled from: <base-url>
> Pages: <count>
> Date: <timestamp>

## Table of Contents

- [Getting Started](pages/getting-started.md)
- [Installation](pages/installation.md)
- [API Reference](pages/api-reference.md)
- Configuration
  - [Basic](pages/configuration/basic.md)
  - [Advanced](pages/configuration/advanced.md)
- [Troubleshooting](pages/troubleshooting.md)

Common Workflows

1. Harvest API Documentation

# Crawl API docs with deep recursion
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://api.example.com/docs" \
  --subject "Example API" \
  --depth 4 \
  --filter "/api/" \
  --output .tmp/docs/example-api/

2. Build RAG Knowledge Base

# Crawl and export as JSON for embedding
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://docs.example.com" \
  --subject "Example Docs" \
  --depth 3 \
  --format json \
  --output .tmp/rag/example/

# The content.json can be fed directly to embedding pipelines

3. Offline Documentation Mirror

# Full documentation harvest
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://docs.kubernetes.io/docs/concepts/" \
  --subject "Kubernetes Concepts" \
  --depth 5 \
  --max-pages 500 \
  --include-images \
  --output .tmp/docs/k8s-concepts/

4. Focused Topic Extraction

# Crawl, then filter to specific topic
python skills/webcrawler/scripts/crawl_docs.py \
  --url "https://developer.hashicorp.com/terraform/docs" \
  --subject "Terraform" \
  --depth 3 \
  --output .tmp/docs/terraform-full/

# Filter to AWS provider only
python skills/webcrawler/scripts/filter_docs.py \
  --input .tmp/docs/terraform-full/ \
  --subject "AWS Provider" \
  --output .tmp/docs/terraform-aws/

Best Practices

Crawling

Start shallow — Begin with --depth 1 to test, then increase
Use filters — Narrow scope with --filter patterns
Set page limits — Use --max-pages to prevent runaway crawls
Respect rate limits — Increase --delay for slower servers

Content Quality

Subject focus — Be specific with --subject for better filtering
Review index — Check index.md to verify crawl coverage
Post-filter — Use filter_docs.py to refine results

Storage

Use .tmp/ — Store crawled docs in the temp directory
Organize by subject — Create subdirectories per topic
Version with dates — Add timestamps for recurring crawls

Troubleshooting

Issue	Cause	Solution
403 Forbidden	Blocked by server	Increase delay, check robots.txt
Empty pages	JavaScript-rendered content	Use `--render-js` (requires Playwright)
Too many pages	Unbounded crawl	Lower depth, use filters
Duplicate content	Same page via multiple URLs	Enabled by default (URL normalization)
Missing code blocks	Extraction issue	Check `--include-code` is enabled

Dependencies

Required Python packages:

pip install requests beautifulsoup4 html2text lxml
# Optional for JavaScript rendering:
pip install playwright && playwright install

Related Skills

qdrant-memory — Store crawled docs in vector database for RAG
pdf-reader — Extract text from PDF documentation

External Resources

Scrapy Documentation — For complex crawling needs
html2text — HTML to Markdown conversion
BeautifulSoup — HTML parsing

AGI Framework Integration

Qdrant Memory Integration

Before executing complex tasks with this skill:

python3 execution/memory_manager.py auto --query "<task summary>"

Decision Tree:

Cache hit? Use cached response directly — no need to re-process.
Memory match? Inject context_chunks into your reasoning.
No match? Proceed normally, then store results:

python3 execution/memory_manager.py store \
  --content "Description of what was decided/solved" \
  --type decision \
  --tags webcrawler <relevant-tags>

Note: Storing automatically updates both Vector (Qdrant) and Keyword (BM25) indices.

Agent Team Collaboration- Strategy: This skill communicates via the shared memory system.

Orchestration: Invoked by orchestrator via intelligent routing.
Context Sharing: Always read previous agent outputs from memory before starting.

Local LLM Support

When available, use local Ollama models for embedding and lightweight inference:

Embeddings: nomic-embed-text via Qdrant memory system
Lightweight analysis: Local models reduce API costs for repetitive patterns

webcrawler

More from this repository

More from this repository

Webcrawler Skill

Quick Start

Core Workflow

Scripts

crawl_docs.py — Main Documentation Crawler

extract_page.py — Single Page Extractor

filter_docs.py — Post-Crawl Filtering

Configuration

Rate Limiting & Politeness

Domain Handling

Content Extraction

Output Structure

Index Format

Common Workflows

1. Harvest API Documentation

2. Build RAG Knowledge Base

3. Offline Documentation Mirror

4. Focused Topic Extraction

Best Practices

Crawling

Content Quality

Storage

Troubleshooting

Dependencies

Related Skills

External Resources

AGI Framework Integration

Qdrant Memory Integration

Agent Team Collaboration- Strategy: This skill communicates via the shared memory system.

Local LLM Support

Webcrawler Skill

Quick Start

Core Workflow

Scripts

crawl_docs.py — Main Documentation Crawler

extract_page.py — Single Page Extractor

filter_docs.py — Post-Crawl Filtering

Configuration

Rate Limiting & Politeness

Domain Handling

Content Extraction

Output Structure

Index Format

Common Workflows

1. Harvest API Documentation

2. Build RAG Knowledge Base

3. Offline Documentation Mirror

4. Focused Topic Extraction

Best Practices

Crawling

Content Quality

Storage

Troubleshooting

Dependencies

Related Skills

External Resources

AGI Framework Integration

Qdrant Memory Integration

Agent Team Collaboration- Strategy: This skill communicates via the shared memory system.

Local LLM Support

`crawl_docs.py` — Main Documentation Crawler

`extract_page.py` — Single Page Extractor

`filter_docs.py` — Post-Crawl Filtering

`crawl_docs.py` — Main Documentation Crawler

`extract_page.py` — Single Page Extractor

`filter_docs.py` — Post-Crawl Filtering