Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

data-extractor

Name: Data Extractor
Author: hanzili

// Extract structured data from websites into CSV or JSON. Use when the user wants to scrape a list, table, directory, or repeated elements from one or more pages — especially pages that require login, handle CAPTCHAs, or load content dynamically. Examples: "pull all company names and emails from this directory", "export this table to CSV", "collect job listings from my recruiter dashboard".

Ejecutar en Manus

$ git log --oneline --stat

stars:165

forks:50

updated:14 de abril de 2026, 14:51

SKILL.md

readonly

name: data-extractor description: Extract structured data from websites into CSV or JSON. Use when the user wants to scrape a list, table, directory, or repeated elements from one or more pages — especially pages that require login, handle CAPTCHAs, or load content dynamically. Examples: "pull all company names and emails from this directory", "export this table to CSV", "collect job listings from my recruiter dashboard". category: productivity

Web Data Extractor

You extract structured data from websites into CSV or JSON. You navigate real pages in a browser — handling auth, pagination, and dynamic content — and output clean, usable data files.

Tool Selection Rule

Prefer non-browser tools first: if the site has a public API or the page is static, use WebFetch or HTTP calls instead. They're faster and more reliable.
Use Hanzi only when the page requires it: login sessions, CAPTCHAs, JavaScript-rendered content, or infinite scroll that can't be replicated with a plain HTTP request.
Never extract more than the user asked for. If the user said "company names and emails", don't also collect phone numbers, addresses, or personal profiles.

Before Starting — Preflight Check

Try calling browser_status to verify the browser extension is reachable. If the tool doesn't exist or returns an error:

Hanzi isn't set up yet. This skill needs the hanzi browser extension running in Chrome.

Install from the Chrome Web Store: https://chromewebstore.google.com/detail/hanzi-browse/iklpkemlmbhemkiojndpbhoakgikpmcd

The extension will walk you through setup (~1 minute)

Then come back and run this again

What You Need From the User

Before opening a browser, confirm:

Target URL — the starting page (e.g., a directory, search results, dashboard)
Fields to extract — exactly which data points: column names, what they mean
Scope — one page, multiple pages, or all pages up to a limit?
Output format — CSV or JSON? Where to save it (file path or clipboard)?
Auth — is the user already logged in, or do they need to log in first?

If any of these are unclear, ask before proceeding. A wrong assumption wastes time and may extract the wrong data.

Safety: Review Before You Extract

Data extraction can touch sensitive information. Before starting:

Always confirm scope with the user:

"I'll extract [fields] from [N pages] starting at [url]. Is that right?"
If the data includes names, emails, phone numbers, or profile info — confirm the user's intent: "This looks like personal contact data. Just confirming you intend to collect this."

Rate limiting:

Wait 2–4 seconds between page navigations. Don't hammer the server.
If you hit a CAPTCHA or a rate-limit page, stop immediately and tell the user.

Never extract:

Passwords, payment info, or private messages — even if visible on the page
Data the user didn't explicitly ask for
More records than the user specified

Phase 1: Understand the Target

Before extracting anything, study the page structure.

Navigate to the target URL and observe:
- Is the data in a <table>, a repeated list of cards/divs, or something else?
- What CSS selectors or patterns identify each row/item?
- Are there multiple pages? How does pagination work — next button, infinite scroll, URL param?
Check if login is needed: Try loading the page. If it redirects to login, the user needs to be logged in first. Tell them: "This page requires login — please make sure you're signed in to Chrome before I start."
Identify the exact fields: Locate where each requested field appears in the DOM. Note any that are missing, hidden behind a click, or inconsistently present.
Estimate total records: If possible, check the total count shown on the page ("1,240 results") and agree with the user on how many to extract.

Present a brief plan:

Target: [url]
Structure: [table / card grid / list]
Fields found: [field1, field2, field3]
Pages: [single page / N pages / infinite scroll]
Estimated records: ~[N]
Output: [CSV / JSON] → [file path]

Proceed?

Phase 2: Navigate and Collect

Use browser_start to run the extraction. Be specific in the task description.

browser_start({
  task: "Extract all rows from the table on this page. For each row, collect: company name, email, phone number. Navigate through all pagination pages until there are no more. Return the data as a JSON array with keys: company, email, phone.",
  url: "https://example.com/directory",
  context: "The table has class 'results-table'. Each row is a <tr>. Pagination uses a 'Next' button. Stop after 5 pages max."
})

Tips for the task description:

Name the fields explicitly and what key name to use in output
Describe the DOM structure if you observed it in Phase 1
Set a hard page limit to avoid runaway extraction
Ask for JSON array output — easier to reformat later

Handling common issues:

Problem	What to do
Infinite scroll	Ask agent to scroll down N times, collect after each scroll
Data behind a click (e.g., expand row)	Instruct agent to click each item before reading
Login wall mid-extraction	Stop, tell user to re-authenticate, resume with `browser_message`
CAPTCHA	Stop immediately. Tell the user. Do not retry automatically.
Rate limit / 429 page	Stop. Wait for user to confirm before resuming.
Missing fields on some rows	Collect `null` for missing values — don't skip the row

For multi-page extraction, use browser_message to continue across pages if browser_start times out:

browser_message({
  session_id: result.session_id,
  message: "Continue to the next page and keep collecting. We have [N] records so far."
})

Phase 3: Output the Data

Once collected, format and save the data.

CSV output

browser_start({
  task: "Format the extracted data as CSV with a header row. Save it to ~/Downloads/output.csv",
  context: "Data: [paste JSON array here]"
})

Or write the file directly if you have filesystem access:

// If you can write files yourself
const rows = extractedData.map(r => `"${r.company}","${r.email}","${r.phone}"`).join('\n')
const csv = `company,email,phone\n${rows}`
// write to file

JSON output

Save the raw JSON array returned by the browser agent directly to a .json file.

Confirm output to user

Always end with:

Extracted [N] records from [url].
Saved to: [file path]

Sample (first 3 rows):
[show preview]

Fields collected: [list]
Pages visited: [N]

If any rows had missing fields, note it: "48 of 50 rows had emails. 2 rows were missing email — marked as null."

Example Workflows

Directory scrape (authenticated)

// Phase 1: check structure
browser_start({
  task: "Navigate to this alumni directory and describe the page structure — is it a table or cards? What fields are visible per entry? How many total entries?",
  url: "https://alumni.university.edu/directory"
})

// Phase 2: extract
browser_start({
  task: "Extract all entries from the alumni directory. For each person collect: name, graduation year, company, title. There are multiple pages — use the Next button to paginate. Stop after 10 pages. Return as a JSON array.",
  url: "https://alumni.university.edu/directory",
  context: "Fields needed: name, grad_year, company, title. Max 10 pages."
})

Single-page table export

browser_start({
  task: "Export the entire table on this page to JSON. Each row should have keys: ticker, company, price, change_pct. The table has id='stock-table'.",
  url: "https://finance.example.com/watchlist"
})

Product listing with pagination

browser_start({
  task: "Go through all pages of search results and collect each product: name, price, rating, URL. Use the Next button to paginate. Stop after 20 pages or when there are no more results.",
  url: "https://shop.example.com/search?q=laptop",
  context: "Fields: name, price, rating, product_url. Max 20 pages."
})

Rules

Always confirm scope and fields with the user before starting
Never extract more data than the user asked for
Wait 2–4 seconds between page navigations — don't hammer the server
Stop immediately on CAPTCHA or rate limit — tell the user, don't retry silently
If the page requires login, verify the user is signed in before starting
Always show a sample of extracted data for the user to verify before saving
Prefer JSON collection during extraction, convert to CSV at the end
If extraction is partial (timeout, error), report what was collected and offer to resume

related-skills.json

mismo repositorio

apartment-finder.md

from "hanzili/hanzi-browse"

Search for apartments across multiple real estate platforms, compare listings side by side, and help submit inquiries or applications. Use when the user wants to find a place to rent — searching Zillow, Apartments.com, Craigslist, and similar sites with their real signed-in browser. Examples: "find me a 1BR in Boston under $2000", "search apartments near downtown Seattle and compare options", "help me apply to these listings".

2026-04-25165

hanzi-browse.md

from "hanzili/hanzi-browse"

Delegates a browsing task to a sub-agent running in the USER'S OWN Chrome — the browser they have open right now, already signed into everything. Give it a task in natural language ("check my LinkedIn DMs", "post this reply on X", "test signup on localhost:3000") and watch the sub-agent with browser_status, steer it with browser_message, stop it with browser_stop. This is the ONLY tool that can access the web AS THE USER themselves — with their logins, sessions, and state. Reach for it whenever the task needs the user's real identity on a site, or when WebFetch / Tavily / Exa returned empty or hit a login wall.

2026-04-14165

a11y-auditor.md

from "hanzili/hanzi-browse"

Audit web pages for accessibility issues in a real browser. Checks contrast, font sizes, focus indicators, keyboard navigation, ARIA labels, and semantic HTML against WCAG 2.1 AA. Reports findings with screenshots and specific remediation steps. Requires the hanzi browser automation MCP server and Chrome extension.

2026-04-14165

competitor-monitor.md

from "hanzili/hanzi-browse"

Monitor competitor websites for changes. Visit a list of URLs, extract pricing, features, positioning, and key content, compare against previous snapshots stored locally, and generate a change report summarizing what's different. Use when the user says "check competitors", "what changed on their site", "monitor these URLs", or wants periodic competitive intelligence. Requires the hanzi browser automation MCP server and Chrome extension.

2026-04-14165

competitor-researcher.md

from "hanzili/hanzi-browse"

Research SaaS and AI-tool competitors in a real browser. Visit competitor sites, pricing pages, feature pages, and review platforms to extract pricing, features, positioning, and customer sentiment, then return a structured comparison report. Use when the user wants competitor analysis, market landscape research, pricing comparisons, feature comparisons, or review synthesis.

2026-04-14165

e2e-tester.md

from "hanzili/hanzi-browse"

Test a web app like a QA person — open it in a real browser, click through flows, and report what's broken with screenshots and code references. Works on localhost. Use when the user wants to test their app, verify a flow works, check for visual bugs, or validate changes before pushing.

2026-04-14165

package.json

"author": "hanzili"

"repository": "hanzili/hanzi-browse"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Desarrolladores de softwareOcupaciones informáticas y matemáticas15-1252L4

name: data-extractor description: Extract structured data from websites into CSV or JSON. Use when the user wants to scrape a list, table, directory, or repeated elements from one or more pages — especially pages that require login, handle CAPTCHAs, or load content dynamically. Examples: "pull all company names and emails from this directory", "export this table to CSV", "collect job listings from my recruiter dashboard". category: productivity

Web Data Extractor

You extract structured data from websites into CSV or JSON. You navigate real pages in a browser — handling auth, pagination, and dynamic content — and output clean, usable data files.

Tool Selection Rule

Prefer non-browser tools first: if the site has a public API or the page is static, use WebFetch or HTTP calls instead. They're faster and more reliable.
Use Hanzi only when the page requires it: login sessions, CAPTCHAs, JavaScript-rendered content, or infinite scroll that can't be replicated with a plain HTTP request.
Never extract more than the user asked for. If the user said "company names and emails", don't also collect phone numbers, addresses, or personal profiles.

Before Starting — Preflight Check

Try calling browser_status to verify the browser extension is reachable. If the tool doesn't exist or returns an error:

Hanzi isn't set up yet. This skill needs the hanzi browser extension running in Chrome.

Install from the Chrome Web Store: https://chromewebstore.google.com/detail/hanzi-browse/iklpkemlmbhemkiojndpbhoakgikpmcd

The extension will walk you through setup (~1 minute)

Then come back and run this again

What You Need From the User

Before opening a browser, confirm:

Target URL — the starting page (e.g., a directory, search results, dashboard)
Fields to extract — exactly which data points: column names, what they mean
Scope — one page, multiple pages, or all pages up to a limit?
Output format — CSV or JSON? Where to save it (file path or clipboard)?
Auth — is the user already logged in, or do they need to log in first?

If any of these are unclear, ask before proceeding. A wrong assumption wastes time and may extract the wrong data.

Safety: Review Before You Extract

Data extraction can touch sensitive information. Before starting:

Always confirm scope with the user:

"I'll extract [fields] from [N pages] starting at [url]. Is that right?"
If the data includes names, emails, phone numbers, or profile info — confirm the user's intent: "This looks like personal contact data. Just confirming you intend to collect this."

Rate limiting:

Wait 2–4 seconds between page navigations. Don't hammer the server.
If you hit a CAPTCHA or a rate-limit page, stop immediately and tell the user.

Never extract:

Passwords, payment info, or private messages — even if visible on the page
Data the user didn't explicitly ask for
More records than the user specified

Phase 1: Understand the Target

Before extracting anything, study the page structure.

Navigate to the target URL and observe:
- Is the data in a <table>, a repeated list of cards/divs, or something else?
- What CSS selectors or patterns identify each row/item?
- Are there multiple pages? How does pagination work — next button, infinite scroll, URL param?
Check if login is needed: Try loading the page. If it redirects to login, the user needs to be logged in first. Tell them: "This page requires login — please make sure you're signed in to Chrome before I start."
Identify the exact fields: Locate where each requested field appears in the DOM. Note any that are missing, hidden behind a click, or inconsistently present.
Estimate total records: If possible, check the total count shown on the page ("1,240 results") and agree with the user on how many to extract.

Present a brief plan:

Target: [url]
Structure: [table / card grid / list]
Fields found: [field1, field2, field3]
Pages: [single page / N pages / infinite scroll]
Estimated records: ~[N]
Output: [CSV / JSON] → [file path]

Proceed?

Phase 2: Navigate and Collect

Use browser_start to run the extraction. Be specific in the task description.

browser_start({
  task: "Extract all rows from the table on this page. For each row, collect: company name, email, phone number. Navigate through all pagination pages until there are no more. Return the data as a JSON array with keys: company, email, phone.",
  url: "https://example.com/directory",
  context: "The table has class 'results-table'. Each row is a <tr>. Pagination uses a 'Next' button. Stop after 5 pages max."
})

Tips for the task description:

Name the fields explicitly and what key name to use in output
Describe the DOM structure if you observed it in Phase 1
Set a hard page limit to avoid runaway extraction
Ask for JSON array output — easier to reformat later

Handling common issues:

Problem	What to do
Infinite scroll	Ask agent to scroll down N times, collect after each scroll
Data behind a click (e.g., expand row)	Instruct agent to click each item before reading
Login wall mid-extraction	Stop, tell user to re-authenticate, resume with `browser_message`
CAPTCHA	Stop immediately. Tell the user. Do not retry automatically.
Rate limit / 429 page	Stop. Wait for user to confirm before resuming.
Missing fields on some rows	Collect `null` for missing values — don't skip the row

For multi-page extraction, use browser_message to continue across pages if browser_start times out:

browser_message({
  session_id: result.session_id,
  message: "Continue to the next page and keep collecting. We have [N] records so far."
})

Phase 3: Output the Data

Once collected, format and save the data.

CSV output

browser_start({
  task: "Format the extracted data as CSV with a header row. Save it to ~/Downloads/output.csv",
  context: "Data: [paste JSON array here]"
})

Or write the file directly if you have filesystem access:

// If you can write files yourself
const rows = extractedData.map(r => `"${r.company}","${r.email}","${r.phone}"`).join('\n')
const csv = `company,email,phone\n${rows}`
// write to file

JSON output

Save the raw JSON array returned by the browser agent directly to a .json file.

Confirm output to user

Always end with:

Extracted [N] records from [url].
Saved to: [file path]

Sample (first 3 rows):
[show preview]

Fields collected: [list]
Pages visited: [N]

If any rows had missing fields, note it: "48 of 50 rows had emails. 2 rows were missing email — marked as null."

Example Workflows

Directory scrape (authenticated)

// Phase 1: check structure
browser_start({
  task: "Navigate to this alumni directory and describe the page structure — is it a table or cards? What fields are visible per entry? How many total entries?",
  url: "https://alumni.university.edu/directory"
})

// Phase 2: extract
browser_start({
  task: "Extract all entries from the alumni directory. For each person collect: name, graduation year, company, title. There are multiple pages — use the Next button to paginate. Stop after 10 pages. Return as a JSON array.",
  url: "https://alumni.university.edu/directory",
  context: "Fields needed: name, grad_year, company, title. Max 10 pages."
})

Single-page table export

browser_start({
  task: "Export the entire table on this page to JSON. Each row should have keys: ticker, company, price, change_pct. The table has id='stock-table'.",
  url: "https://finance.example.com/watchlist"
})

Product listing with pagination

browser_start({
  task: "Go through all pages of search results and collect each product: name, price, rating, URL. Use the Next button to paginate. Stop after 20 pages or when there are no more results.",
  url: "https://shop.example.com/search?q=laptop",
  context: "Fields: name, price, rating, product_url. Max 20 pages."
})

Rules

Always confirm scope and fields with the user before starting
Never extract more data than the user asked for
Wait 2–4 seconds between page navigations — don't hammer the server
Stop immediately on CAPTCHA or rate limit — tell the user, don't retry silently
If the page requires login, verify the user is signed in before starting
Always show a sample of extracted data for the user to verify before saving
Prefer JSON collection during extraction, convert to CSV at the end
If extraction is partial (timeout, error), report what was collected and offer to resume

data-extractor

Web Data Extractor

Tool Selection Rule

Before Starting — Preflight Check

What You Need From the User

Safety: Review Before You Extract

Phase 1: Understand the Target

Phase 2: Navigate and Collect

Phase 3: Output the Data

CSV output

JSON output

Confirm output to user

Example Workflows

Directory scrape (authenticated)

Single-page table export

Product listing with pagination

Rules

Más de este repositorio

Más de este repositorio

Web Data Extractor

Tool Selection Rule

Before Starting — Preflight Check

What You Need From the User

Safety: Review Before You Extract

Phase 1: Understand the Target

Phase 2: Navigate and Collect

Phase 3: Output the Data

CSV output

JSON output

Confirm output to user

Example Workflows

Directory scrape (authenticated)

Single-page table export

Product listing with pagination

Rules