| name | openclaw-ultra-scraping |
| description | Powerful web scraping, crawling, and data extraction with stealth anti-bot bypass. Bypasses anti-bot systems (Cloudflare Turnstile, CAPTCHAs) out of the box. Use when: (1) scraping websites that block normal requests, (2) extracting structured data from web pages, (3) crawling multiple pages with concurrency, (4) taking screenshots of web pages, (5) extracting links, (6) any web scraping task that needs stealth/anti-detection, (7) user asks to scrape/crawl/extract from URLs, (8) need to bypass Cloudflare or other bot protection. Supports CSS/XPath selectors, adaptive element tracking (survives site redesigns), multi-session spiders, pause/resume crawls, proxy rotation, and async operations. Powered by MyClaw.ai.
|
OpenClaw Ultra Scraping
Adaptive web scraping framework for OpenClaw agents. Handles everything from single-page extraction to full-scale concurrent crawls with anti-bot bypass.
Setup
Run once before first use:
bash scripts/setup.sh
This installs all dependencies + browser engines into /opt/scrapling-venv.
Quick Start โ CLI Script
The bundled scripts/scrape.py provides a unified CLI:
PYTHON=/opt/scrapling-venv/bin/python3
$PYTHON scripts/scrape.py fetch "https://example.com" --css ".content"
$PYTHON scripts/scrape.py extract "https://example.com" --css "h1"
$PYTHON scripts/scrape.py fetch "https://protected-site.com" --stealth --solve-cloudflare --css ".data"
$PYTHON scripts/scrape.py fetch "https://spa-site.com" --dynamic --css ".product"
$PYTHON scripts/scrape.py links "https://example.com" --filter "\.pdf$"
$PYTHON scripts/scrape.py crawl "https://example.com" --depth 2 --concurrency 10 --css ".item" -o results.json
$PYTHON scripts/scrape.py fetch "https://example.com" -f markdown -o page.md
Quick Start โ Python
For complex tasks, write Python directly using the venv:
from scrapling.fetchers import Fetcher, StealthyFetcher
page = Fetcher.get('https://example.com', impersonate='chrome')
titles = page.css('h1::text').getall()
page = StealthyFetcher.fetch('https://protected.com', headless=True, solve_cloudflare=True)
data = page.css('.product').getall()
Fetcher Selection Guide
| Scenario | Fetcher | Flag |
|---|
| Normal sites, fast scraping | Fetcher | (default) |
| JS-rendered SPAs | DynamicFetcher | --dynamic |
| Cloudflare/anti-bot protected | StealthyFetcher | --stealth |
| Cloudflare Turnstile challenge | StealthyFetcher | --stealth --solve-cloudflare |
Selector Cheat Sheet
page.css('.class')
page.css('.class::text').getall()
page.xpath('//div[@id="main"]')
page.find_all('div', class_='item')
page.find_by_text('keyword')
page.css('.item', adaptive=True)
Advanced Features
- Adaptive tracking:
auto_save=True on first run, adaptive=True later โ elements are found even after site redesign
- Proxy rotation: Pass
proxy="http://host:port" or use ProxyRotator
- Sessions:
FetcherSession, StealthySession, DynamicSession for cookie/state persistence
- Spider framework: Scrapy-like concurrent crawling with pause/resume
- Async support: All fetchers have async variants
For full API details: read references/api-reference.md