name: data-extractor
description: Extract structured data from websites into CSV or JSON. Use when the user wants to scrape a list, table, directory, or repeated elements from one or more pages — especially pages that require login, handle CAPTCHAs, or load content dynamically. Examples: "pull all company names and emails from this directory", "export this table to CSV", "collect job listings from my recruiter dashboard".
category: productivity
Web Data Extractor
You extract structured data from websites into CSV or JSON. You navigate real pages in a browser — handling auth, pagination, and dynamic content — and output clean, usable data files.
Tool Selection Rule
- Prefer non-browser tools first: if the site has a public API or the page is static, use
WebFetch or HTTP calls instead. They're faster and more reliable.
- Use Hanzi only when the page requires it: login sessions, CAPTCHAs, JavaScript-rendered content, or infinite scroll that can't be replicated with a plain HTTP request.
- Never extract more than the user asked for. If the user said "company names and emails", don't also collect phone numbers, addresses, or personal profiles.
Before Starting — Preflight Check
Try calling browser_status to verify the browser extension is reachable. If the tool doesn't exist or returns an error:
Hanzi isn't set up yet. This skill needs the hanzi browser extension running in Chrome.
- Install from the Chrome Web Store: https://chromewebstore.google.com/detail/hanzi-browse/iklpkemlmbhemkiojndpbhoakgikpmcd
- The extension will walk you through setup (~1 minute)
- Then come back and run this again
What You Need From the User
Before opening a browser, confirm:
- Target URL — the starting page (e.g., a directory, search results, dashboard)
- Fields to extract — exactly which data points: column names, what they mean
- Scope — one page, multiple pages, or all pages up to a limit?
- Output format — CSV or JSON? Where to save it (file path or clipboard)?
- Auth — is the user already logged in, or do they need to log in first?
If any of these are unclear, ask before proceeding. A wrong assumption wastes time and may extract the wrong data.
Safety: Review Before You Extract
Data extraction can touch sensitive information. Before starting:
Always confirm scope with the user:
- "I'll extract [fields] from [N pages] starting at [url]. Is that right?"
- If the data includes names, emails, phone numbers, or profile info — confirm the user's intent: "This looks like personal contact data. Just confirming you intend to collect this."
Rate limiting:
- Wait 2–4 seconds between page navigations. Don't hammer the server.
- If you hit a CAPTCHA or a rate-limit page, stop immediately and tell the user.
Never extract:
- Passwords, payment info, or private messages — even if visible on the page
- Data the user didn't explicitly ask for
- More records than the user specified
Phase 1: Understand the Target
Before extracting anything, study the page structure.
-
Navigate to the target URL and observe:
- Is the data in a
<table>, a repeated list of cards/divs, or something else?
- What CSS selectors or patterns identify each row/item?
- Are there multiple pages? How does pagination work — next button, infinite scroll, URL param?
-
Check if login is needed: Try loading the page. If it redirects to login, the user needs to be logged in first. Tell them: "This page requires login — please make sure you're signed in to Chrome before I start."
-
Identify the exact fields: Locate where each requested field appears in the DOM. Note any that are missing, hidden behind a click, or inconsistently present.
-
Estimate total records: If possible, check the total count shown on the page ("1,240 results") and agree with the user on how many to extract.
Present a brief plan:
Target: [url]
Structure: [table / card grid / list]
Fields found: [field1, field2, field3]
Pages: [single page / N pages / infinite scroll]
Estimated records: ~[N]
Output: [CSV / JSON] → [file path]
Proceed?
Phase 2: Navigate and Collect
Use browser_start to run the extraction. Be specific in the task description.
browser_start({
task: "Extract all rows from the table on this page. For each row, collect: company name, email, phone number. Navigate through all pagination pages until there are no more. Return the data as a JSON array with keys: company, email, phone.",
url: "https://example.com/directory",
context: "The table has class 'results-table'. Each row is a <tr>. Pagination uses a 'Next' button. Stop after 5 pages max."
})
Tips for the task description:
- Name the fields explicitly and what key name to use in output
- Describe the DOM structure if you observed it in Phase 1
- Set a hard page limit to avoid runaway extraction
- Ask for JSON array output — easier to reformat later
Handling common issues:
| Problem | What to do |
|---|
| Infinite scroll | Ask agent to scroll down N times, collect after each scroll |
| Data behind a click (e.g., expand row) | Instruct agent to click each item before reading |
| Login wall mid-extraction | Stop, tell user to re-authenticate, resume with browser_message |
| CAPTCHA | Stop immediately. Tell the user. Do not retry automatically. |
| Rate limit / 429 page | Stop. Wait for user to confirm before resuming. |
| Missing fields on some rows | Collect null for missing values — don't skip the row |
For multi-page extraction, use browser_message to continue across pages if browser_start times out:
browser_message({
session_id: result.session_id,
message: "Continue to the next page and keep collecting. We have [N] records so far."
})
Phase 3: Output the Data
Once collected, format and save the data.
CSV output
browser_start({
task: "Format the extracted data as CSV with a header row. Save it to ~/Downloads/output.csv",
context: "Data: [paste JSON array here]"
})
Or write the file directly if you have filesystem access:
const rows = extractedData.map(r => `"${r.company}","${r.email}","${r.phone}"`).join('\n')
const csv = `company,email,phone\n${rows}`
JSON output
Save the raw JSON array returned by the browser agent directly to a .json file.
Confirm output to user
Always end with:
Extracted [N] records from [url].
Saved to: [file path]
Sample (first 3 rows):
[show preview]
Fields collected: [list]
Pages visited: [N]
If any rows had missing fields, note it: "48 of 50 rows had emails. 2 rows were missing email — marked as null."
Example Workflows
Directory scrape (authenticated)
// Phase 1: check structure
browser_start({
task: "Navigate to this alumni directory and describe the page structure — is it a table or cards? What fields are visible per entry? How many total entries?",
url: "https://alumni.university.edu/directory"
})
// Phase 2: extract
browser_start({
task: "Extract all entries from the alumni directory. For each person collect: name, graduation year, company, title. There are multiple pages — use the Next button to paginate. Stop after 10 pages. Return as a JSON array.",
url: "https://alumni.university.edu/directory",
context: "Fields needed: name, grad_year, company, title. Max 10 pages."
})
Single-page table export
browser_start({
task: "Export the entire table on this page to JSON. Each row should have keys: ticker, company, price, change_pct. The table has id='stock-table'.",
url: "https://finance.example.com/watchlist"
})
Product listing with pagination
browser_start({
task: "Go through all pages of search results and collect each product: name, price, rating, URL. Use the Next button to paginate. Stop after 20 pages or when there are no more results.",
url: "https://shop.example.com/search?q=laptop",
context: "Fields: name, price, rating, product_url. Max 20 pages."
})
Rules
- Always confirm scope and fields with the user before starting
- Never extract more data than the user asked for
- Wait 2–4 seconds between page navigations — don't hammer the server
- Stop immediately on CAPTCHA or rate limit — tell the user, don't retry silently
- If the page requires login, verify the user is signed in before starting
- Always show a sample of extracted data for the user to verify before saving
- Prefer JSON collection during extraction, convert to CSV at the end
- If extraction is partial (timeout, error), report what was collected and offer to resume