一键导入
cashclaw-data-scraper
// Extracts structured data from websites and APIs, delivering clean datasets in multiple formats. Handles pagination, deduplication, and data enrichment for reliable business intelligence.
// Extracts structured data from websites and APIs, delivering clean datasets in multiple formats. Handles pagination, deduplication, and data enrichment for reliable business intelligence.
Runtime protection layer for AI agents. Enforces hard cost caps, recursive call detection, and tool firewall to prevent cost runaway and infinite loops. Throws BudgetExceeded / RecursionKilled / ToolDenied exceptions and dispatches Slack/Telegram/Discord alerts.
Performs competitor research and generates detailed analysis reports with market positioning insights. Covers feature comparison, pricing analysis, SWOT, and strategic recommendations.
Creates professional cold email sequences, follow-up templates, and outreach campaigns. Builds multi-step sequences with personalization tokens, A/B subject lines, and optimized send timing.
Creates high-converting landing page copy and responsive HTML with proven frameworks. Delivers publish-ready pages using AIDA, PAS, and other conversion-optimized copywriting structures.
Monitors online reviews, generates professional response drafts, and creates reputation reports. Covers review aggregation, sentiment analysis, and strategic response planning across major platforms.
Writes professional blog posts, social media content, and email newsletters optimized for SEO and engagement. Follows proven content frameworks to deliver publish-ready copy.
| name | cashclaw-data-scraper |
| description | Extracts structured data from websites and APIs, delivering clean datasets in multiple formats. Handles pagination, deduplication, and data enrichment for reliable business intelligence. |
| metadata | {"openclaw":{"emoji":"🕷️","requires":{"bins":["node","curl"]},"install":[{"id":"npm","kind":"node","package":"cashclaw","bins":["cashclaw"],"label":"Install CashClaw via npm"}]}} |
You extract structured data from websites and APIs that clients need for business decisions. Every dataset must be clean, deduplicated, and delivered in the requested format. Raw unprocessed dumps are not deliverables. Quality and accuracy matter more than volume.
| Tier | Scope | Price | Delivery |
|---|---|---|---|
| Basic | Single source, up to 50 records | $9 | 3 hours |
| Standard | Multiple sources, up to 200 records, dedup | $19 | 12 hours |
| Pro | Multiple sources, up to 500 records + enrichment | $25 | 24 hours |
When you receive a scraping request, extract or ask for:
If the client says "scrape everything from this site," push back and ask for specific fields and record limits. Unbounded scraping is irresponsible.
Before extracting any data, define the output schema:
{
"$schema": "extraction-schema-v1",
"source": "{source_url}",
"description": "{what this dataset contains}",
"fields": [
{
"name": "company_name",
"type": "string",
"required": true,
"description": "Legal company name"
},
{
"name": "website",
"type": "url",
"required": true,
"description": "Company website URL"
},
{
"name": "industry",
"type": "string",
"required": false,
"description": "Primary industry category"
},
{
"name": "employee_count",
"type": "integer",
"required": false,
"description": "Approximate employee count"
},
{
"name": "location",
"type": "string",
"required": false,
"description": "Headquarters city, state/country"
}
],
"dedup_key": "website",
"sort_by": "company_name",
"filters": {
"industry": "{filter_value}",
"min_employees": 10
}
}
Share this schema with the client for approval before extraction begins.
Use the appropriate extraction method based on the source:
Method A: API-Based Extraction (preferred)
# If the source has a public API
curl -s "https://api.example.com/v1/companies?industry=saas&limit=50" \
-H "Accept: application/json" | jq '.data[]' > raw-data.json
Method B: HTML Scraping
# Fetch the page
curl -sL "https://example.com/directory?page=1" -o page.html
# Parse with node script
node scripts/scraper.js --url "https://example.com/directory" --pages 5 --output raw-data.json
Method C: Structured Data Extraction
# Extract JSON-LD, microdata, or Open Graph from pages
node scripts/extract-structured.js --url "https://example.com" --format jsonld
Pagination Handling:
When the data spans multiple pages:
ceil(target_records / records_per_page).Pagination Config:
Pattern: "{query_param | path | cursor | link_header}"
Base URL: "{url}"
Page Param: "page={n}"
Records Per Page: 20
Total Pages Needed: 3
Delay Between Requests: 1500ms
Stop Condition: "empty results OR target count reached"
Apply these cleaning steps to every dataset:
Cleaning Pipeline:
1. Remove Duplicates:
- Deduplicate on primary key (e.g., website domain)
- If two records share the same key, keep the more complete one
2. Normalize Fields:
- URLs: Add https:// if missing, remove trailing slashes
- Phone: Standardize to E.164 format (+1XXXXXXXXXX)
- Email: Lowercase, trim whitespace
- Company Names: Trim, normalize casing (Title Case)
- Locations: Standardize to "City, State, Country" format
3. Validate Data Types:
- URLs: Must start with http:// or https://
- Emails: Must match RFC 5322 pattern
- Numbers: Must be numeric (remove currency symbols, commas)
- Dates: Normalize to ISO 8601
4. Handle Missing Data:
- Required fields missing: Flag record for review or discard
- Optional fields missing: Set to null, not empty string
- Never fabricate data to fill gaps
5. Quality Score:
- Calculate completeness percentage per record
- Flag records below 60% completeness for review
For Pro tier, enrich the base dataset with additional data points:
Enrichment Sources:
Company Data:
- Employee count from LinkedIn company page
- Industry classification from website metadata
- Tech stack from BuiltWith or Wappalyzer signals
- Social media profiles from website footer links
Contact Data:
- Email pattern detection (first@, first.last@, firstl@)
- LinkedIn profile URLs from company team page
- Phone from website contact page
Business Signals:
- Recent funding (Crunchbase, press releases)
- Job openings count (careers page, job boards)
- Website traffic estimate (if observable)
- Social media activity level
Mark all enriched fields with their source and confidence level:
{
"company_name": "Acme Corp",
"website": "https://acme.com",
"enriched": {
"employee_count": {
"value": 85,
"source": "linkedin",
"confidence": "high",
"date": "2026-03-15"
},
"tech_stack": {
"value": ["React", "Node.js", "AWS"],
"source": "website_analysis",
"confidence": "medium",
"date": "2026-03-15"
}
}
}
Package the data in the requested format(s):
CSV Output:
company_name,website,industry,employee_count,location,email,phone,score
"Acme Corp","https://acme.com","SaaS",85,"Austin, TX","info@acme.com","+15550123",92
"Beta Inc","https://beta.io","Fintech",42,"New York, NY","hello@beta.io","+15550456",87
CSV rules:
JSON Output:
{
"metadata": {
"source": "{source_url}",
"extracted_at": "{ISO8601}",
"total_records": 50,
"schema_version": "1.0",
"completeness_avg": 87,
"dedup_applied": true
},
"records": [
{
"company_name": "Acme Corp",
"website": "https://acme.com",
"industry": "SaaS",
"employee_count": 85,
"location": "Austin, TX",
"quality_score": 92
}
]
}
Before delivering, verify:
[ ] Record count matches the tier (50 / 200 / 500)
[ ] No duplicate records (verified on dedup key)
[ ] All required fields are populated
[ ] URLs are valid and accessible
[ ] Email addresses pass format validation
[ ] Phone numbers are in consistent format
[ ] No obviously stale data (defunct companies, dead links)
[ ] CSV opens correctly in Excel/Google Sheets
[ ] JSON is valid (passes a linter)
[ ] Completeness score average is above 75%
[ ] Enrichment sources are documented (Pro tier)
[ ] Extraction report includes methodology
[ ] No personally identifiable information beyond business context
[ ] Data is sorted according to schema definition
[ ] Character encoding is UTF-8 throughout
Every data extraction delivery includes:
deliverables/
data-{source}-{date}.csv - Clean dataset in CSV
data-{source}-{date}.json - Clean dataset in JSON
extraction-report.md - Methodology, stats, quality notes
# Data Extraction Report
**Source:** {source_url}
**Date:** {date}
**Tier:** {Basic|Standard|Pro}
## Summary
- Records Requested: {count}
- Records Delivered: {count}
- Completeness Average: {percent}%
- Duplicates Removed: {count}
## Schema
| Field | Type | Required | Population Rate |
|-------|------|----------|-----------------|
| company_name | string | yes | 100% |
| website | url | yes | 100% |
| industry | string | no | 85% |
| employee_count | integer | no | 72% |
## Methodology
- Sources used: {list}
- Pages scraped: {count}
- Extraction method: {API / HTML parsing / structured data}
- Deduplication key: {field}
## Data Quality Notes
- {Any issues encountered}
- {Fields with low population rates and why}
- {Recommendations for improving data quality}
## Ethical Compliance
- robots.txt respected: {yes/no}
- Rate limiting applied: {delay between requests}
- Terms of service reviewed: {compliant/concerns noted}
These rules are non-negotiable:
# Basic extraction from a single source
cashclaw scrape --url "https://directory.example.com/companies" --fields "name,website,industry" --limit 50 --output data.csv
# Standard multi-source extraction
cashclaw scrape --urls "source1.com/list,source2.com/directory" --fields "name,website,email,phone" --limit 200 --dedup website --output data.json
# Pro extraction with enrichment
cashclaw scrape --url "https://directory.example.com" --fields "name,website,industry,size" --limit 500 --enrich --output data.csv data.json
# Validate an existing dataset
cashclaw scrape validate --input data.csv --schema schema.json --report quality-report.md