원클릭으로 Manus에서 모든 스킬 실행

web-directory-scraper

This skill should be used when the user asks to "scrape a directory", "gather company data", "get all the info from this site into a spreadsheet", "download member list", "extract data from this website", "把網站上的資料抓下來", "擷取名單", or "幫我抓這個網站的資料". Also applies when the user shares a URL to a paginated listing page and wants the data captured or exported — even without explicitly saying "scrape". If a URL contains a paginated list of entities (member lists, company databases, product catalogs, association rosters, supplier directories), this skill applies.

Manus에서 실행

개요

설치 명령

npx skills add https://github.com/liamlin/web-directory-scraper --skill web-directory-scraper

이 명령을 Claude Code에 복사하여 붙여넣어 스킬을 설치하세요

출처

liamlin/web-directory-scraper

스타2

포크0

업데이트2026년 3월 14일 17:36

SKILL.md

readonly

name

web-directory-scraper

description

Web Directory Scraper

A systematic workflow for capturing complete public directory data from websites into structured spreadsheets.

Core Principles

Perform all network requests in the browser. The browser has direct internet access and same-origin privileges; the VM often cannot reach external sites due to proxy restrictions. Treat the browser as the primary data engine — for fetching, parsing, and even generating the final Excel file. Reserve the VM for orchestration and verification only.

Discover the API first, but maintain a strong fallback plan. Many modern sites fetch data from a REST API (fastest path). Many directory sites, especially older or regional ones, are fully server-rendered with no API. For those, in-browser fetch() + DOMParser is nearly as fast. True page-by-page browser navigation is the last resort.

Workflow Overview

1. Reconnaissance       → Visit site, understand structure, clarify scope with user
2. API Discovery        → Monitor network requests; if no API, analyze URL patterns
3. Data Collection      → Fetch data via API or HTML parsing (always in-browser)
4. Detail Enrichment    → Optionally scrape individual detail pages for richer data
5. Excel Generation     → Build spreadsheet in-browser (SheetJS) or via VM (openpyxl)
6. Verification         → Confirm completeness (expected vs actual count)

Step-by-Step Instructions

Step 1: Reconnaissance

Navigate to the target URL. Take a screenshot and read the page to understand:

What kind of data is listed (companies, people, products, etc.)
How many total records exist (look for indicators like "Total: X", "共 X 筆", "Showing 1-10 of 424", page count, etc.)
How pagination works (numbered pages, load-more button, infinite scroll, cursor-based) — or whether the directory uses category-based navigation instead (e.g., separate pages for "Class A members", "Class B members" with no pagination within each)
What fields are visible per record on the listing page vs. on individual detail pages

Use read_page to get the accessibility tree — it often reveals pagination info and total counts not immediately visible in screenshots.

Clarify scope with the user before proceeding:

Which fields do they care about? ("all available" is a valid answer)
How many pages / records? (sometimes only a subset is needed)
Output format preference? (Excel is the default, but they might want CSV or JSON)

Step 2: API Discovery

This step determines the collection strategy. Start read_network_requests monitoring, then click "next page" in the browser to trigger data-fetching calls.

Path A — REST/GraphQL API found: XHR calls like /api/members?page=2 returning JSON indicate the ideal case. Note the endpoint, pagination parameter, and response structure. Proceed to Step 3, Method A.

Path B — No API, server-rendered HTML: If the only request is a full-page HTML load (e.g., /tch/m1.2-category-name), the site is server-rendered. Note the URL pagination pattern. Proceed to Step 3, Method B.

Path C — Data embedded in JS: If no XHR calls appear but the page has data, check <script> tags for __NEXT_DATA__, __NUXT__, or inline JSON. Extract from there.

Path D — Client-side rendered (hybrid): Some sites (especially Cyberbiz, Shopify) render content via JavaScript after page load. When fetch() returns HTML but visible data is missing from it, the content is JS-rendered. However, test each page type separately — the same site can mix approaches. Listing pages might be JS-rendered while detail pages return full data in raw HTML.

Step 3: Data Collection

All methods use in-browser fetch(). Direct HTTP from the VM is typically blocked by proxy restrictions.

Method A: API Fetching — Fetch the JSON API for every page in a single javascript_tool call. See examples/api-fetching.js for the complete implementation.

Method B: HTML Fetch + DOMParser — Fetch each page's HTML and parse with DOMParser in a single javascript_tool call. This runs at roughly the same speed as API fetching and is the key technique for server-rendered sites. See examples/html-parsing.js for the complete implementation, including CSS selector discovery.

Method C: Page-by-Page Navigation (last resort) — Only if fetch() doesn't work (e.g., the site requires cookies set by client-side JS, or uses anti-bot measures). Navigate to each page URL, extract data with javascript_tool, and store in localStorage incrementally. This is 10-50x slower than Methods A/B.

For batch sizing and timeout management details, consult references/batch-sizing.md.

Step 4: Detail Enrichment (Optional)

Many directories show minimal info on listing pages but have much richer data on detail pages (phone, fax, address, website, certifications, etc.).

When to enrich: When the user wants "all available info" and the listing page only shows a subset. Check one detail page first to see what extra fields exist.

How to enrich: Use Promise.all to fetch detail pages in parallel batches (10 concurrent, ~200 companies per javascript_tool call). See examples/detail-enrichment.js for the complete implementation with locale-adaptable regex extraction.

Step 5: Excel Generation

Large datasets (200+ records): Build the Excel directly in the browser with SheetJS. This bypasses the browser-to-VM data transfer bottleneck. Also generate a raw JSON backup download as a safety net. See examples/excel-generation.js for all approaches.

Medium datasets (50-200 records): Use the get_page_text bridge — write data to document.body as plain text, read with get_page_text, then build a formatted spreadsheet on the VM with openpyxl.

Small datasets (<50 records): Transfer data directly via javascript_tool return value.

Step 6: Verification

Verification matters because partial data is worse than no data — silent gaps (e.g., missing pages 4-41 out of 43) can go unnoticed if the file looks full. Run verification in the browser where the data still lives. See examples/verification.js for the verification script.

Report clearly: "Collected X records from Y pages. Expected total: Z. All pages accounted for."

Additional Resources

Reference Files

references/troubleshooting.md — Common pitfalls (empty HTML from fetch(), JS timeout, truncated output, rate limiting) with causes and solutions, plus a "What to Avoid" checklist
references/batch-sizing.md — Practical batch sizes for API fetching, HTML parsing, and detail enrichment; timeout recovery; data transfer strategy by dataset size; CJK text considerations

Example Files

Working JavaScript examples for each workflow step:

examples/api-fetching.js — Method A: JSON API pagination loop
examples/html-parsing.js — Method B: fetch() + DOMParser with CSS selector discovery
examples/detail-enrichment.js — Step 4: Parallel detail page fetching with regex extraction
examples/excel-generation.js — Step 5: SheetJS, get_page_text bridge, and JSON backup approaches
examples/verification.js — Step 6: Completeness check with per-page breakdown

출처

liamlin

liamlin/web-directory-scraper

GitHub 저장소 열기 Creator 저장소 보기

설치 명령

다운로드

Manus에서 실행

유용한 대상SOC

소프트웨어 개발자컴퓨터 및 수학직15-1252L4

name

web-directory-scraper

description

Web Directory Scraper

A systematic workflow for capturing complete public directory data from websites into structured spreadsheets.

Core Principles

Workflow Overview

1. Reconnaissance       → Visit site, understand structure, clarify scope with user
2. API Discovery        → Monitor network requests; if no API, analyze URL patterns
3. Data Collection      → Fetch data via API or HTML parsing (always in-browser)
4. Detail Enrichment    → Optionally scrape individual detail pages for richer data
5. Excel Generation     → Build spreadsheet in-browser (SheetJS) or via VM (openpyxl)
6. Verification         → Confirm completeness (expected vs actual count)

Step-by-Step Instructions

Step 1: Reconnaissance

Navigate to the target URL. Take a screenshot and read the page to understand:

What kind of data is listed (companies, people, products, etc.)
How many total records exist (look for indicators like "Total: X", "共 X 筆", "Showing 1-10 of 424", page count, etc.)
How pagination works (numbered pages, load-more button, infinite scroll, cursor-based) — or whether the directory uses category-based navigation instead (e.g., separate pages for "Class A members", "Class B members" with no pagination within each)
What fields are visible per record on the listing page vs. on individual detail pages

Use read_page to get the accessibility tree — it often reveals pagination info and total counts not immediately visible in screenshots.

Clarify scope with the user before proceeding:

Which fields do they care about? ("all available" is a valid answer)
How many pages / records? (sometimes only a subset is needed)
Output format preference? (Excel is the default, but they might want CSV or JSON)

Step 2: API Discovery

This step determines the collection strategy. Start read_network_requests monitoring, then click "next page" in the browser to trigger data-fetching calls.

Path C — Data embedded in JS: If no XHR calls appear but the page has data, check <script> tags for __NEXT_DATA__, __NUXT__, or inline JSON. Extract from there.

Step 3: Data Collection

All methods use in-browser fetch(). Direct HTTP from the VM is typically blocked by proxy restrictions.

Method A: API Fetching — Fetch the JSON API for every page in a single javascript_tool call. See examples/api-fetching.js for the complete implementation.

For batch sizing and timeout management details, consult references/batch-sizing.md.

Step 4: Detail Enrichment (Optional)

Many directories show minimal info on listing pages but have much richer data on detail pages (phone, fax, address, website, certifications, etc.).

When to enrich: When the user wants "all available info" and the listing page only shows a subset. Check one detail page first to see what extra fields exist.

Step 5: Excel Generation

Small datasets (<50 records): Transfer data directly via javascript_tool return value.

Step 6: Verification

Report clearly: "Collected X records from Y pages. Expected total: Z. All pages accounted for."

Additional Resources

Reference Files

references/troubleshooting.md — Common pitfalls (empty HTML from fetch(), JS timeout, truncated output, rate limiting) with causes and solutions, plus a "What to Avoid" checklist
references/batch-sizing.md — Practical batch sizes for API fetching, HTML parsing, and detail enrichment; timeout recovery; data transfer strategy by dataset size; CJK text considerations

Example Files

Working JavaScript examples for each workflow step:

examples/api-fetching.js — Method A: JSON API pagination loop
examples/html-parsing.js — Method B: fetch() + DOMParser with CSS selector discovery
examples/detail-enrichment.js — Step 4: Parallel detail page fetching with regex extraction
examples/excel-generation.js — Step 5: SheetJS, get_page_text bridge, and JSON backup approaches
examples/verification.js — Step 6: Completeness check with per-page breakdown