Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

scrape

Name: Scrape
Author: brightdata

// Scrape web content as clean markdown/HTML/JSON via the Bright Data CLI (`bdata scrape`). Use when the user wants to fetch a page, extract content from a list of URLs, or crawl paginated listings. Hands off to `data-feeds` for supported platforms (Amazon, LinkedIn, TikTok, Instagram, YouTube, Reddit, etc.) and to `search` when URLs must be discovered first. Requires the Bright Data CLI; proactively guides install + login if missing.

Ejecutar en Manus

$ git log --oneline --stat

stars:137

forks:22

updated:19 de abril de 2026, 15:48

Explorador de archivos

4 archivos

SKILL.md

readonly

name

scrape

description

Scrape web content as clean markdown/HTML/JSON via the Bright Data CLI (`bdata scrape`). Use when the user wants to fetch a page, extract content from a list of URLs, or crawl paginated listings. Hands off to `data-feeds` for supported platforms (Amazon, LinkedIn, TikTok, Instagram, YouTube, Reddit, etc.) and to `search` when URLs must be discovered first. Requires the Bright Data CLI; proactively guides install + login if missing.

Bright Data — Scrape

Get clean content (markdown, HTML, JSON, screenshot) from one or more URLs via the Bright Data CLI. This skill owns the "fetch raw or lightly-structured content" job. For platform-specific structured data (Amazon, LinkedIn, TikTok, etc.), stop and use data-feeds instead — you'll get clean JSON without selector logic.

Setup gate (run first)

Before any scrape, verify the CLI is installed and authenticated:

if ! command -v bdata >/dev/null 2>&1; then
    echo "bdata CLI not installed — see bright-data-best-practices/references/cli-setup.md"
elif ! bdata zones >/dev/null 2>&1; then
    echo "bdata not authenticated — run: bdata login  (or: bdata login --device for SSH)"
fi

If either check fails, halt and route the user to skills/bright-data-best-practices/references/cli-setup.md. Do not attempt the legacy curl fallback silently — ask the user first.

Pick your path

Situation	Action
Single URL	`bdata scrape <url> -f markdown`
Small list (≤ ~20 URLs)	shell loop, 1 at a time (see `references/patterns.md`)
Larger list (dozens+)	`xargs -P 4` with parallelism cap (see `references/patterns.md`)
Paginated listing	scrape page 1 → extract next-page URL → append → repeat (see `references/examples.md`)
JS-heavy / login-gated / interaction-required	escalate to `bdata browser` (see `brightdata-cli` skill)
Amazon, LinkedIn, TikTok, Instagram, YouTube, Reddit, …	stop — hand off to `data-feeds`
No URL yet, just a topic	hand off to `search`

Action

Core commands:

# Clean markdown (default)
bdata scrape "https://example.com/article" -f markdown -o article.md

# Raw HTML (when you need the DOM)
bdata scrape "https://example.com" -f html -o page.html

# Structured JSON (when the Unlocker returns parsed fields)
bdata scrape "https://example.com" -f json --pretty -o page.json

# Visual snapshot (saves PNG)
bdata scrape "https://example.com" -f screenshot -o page.png

# Geo-targeted (override the exit country)
bdata scrape "https://example.com" --country de -f markdown

Full flag reference: references/flags.md.

Verification gate (run before claiming success)

Non-empty output: test -s "$out_path" — or, for stdout, at least 200 bytes of content.
Not a block page — grep the output for any of these signatures (case-insensitive):
- Access Denied
- Just a moment
- Attention Required
- Checking your browser
- captcha
- cf-browser-verification
- cloudflare (with < 2KB total body)
Expected markers present for the task: e.g., a product page should contain a price pattern (\$\d); an article should contain at least one <h1> or # heading.
On failure, escalation ladder:
- Retry with a different --country (e.g., --country de if the origin site is US)
- Escalate to bdata browser for full JS rendering (hand off to brightdata-cli skill)

Do not report success until all checks above pass.

Red flags

Claiming success without inspecting the output.
Silencing errors with 2>/dev/null — you'll miss auth failures and rate-limit errors.
Running bdata scrape on Amazon/LinkedIn/TikTok/Instagram/YouTube/Reddit URLs — these are supported by data-feeds and return structured data directly. Scraping loses the structure.
Scraping the same URL repeatedly in the same task — cache the first result.
Looping bdata scrape sequentially for large lists instead of using xargs -P 4 (or similar) with a parallelism cap.
Using curl against api.brightdata.com directly — legacy path; only when the CLI isn't available.

References

references/flags.md — every flag with when-to-use notes.
references/patterns.md — shell-loop batching, xargs parallelism, pagination recipe, retry/backoff, block-page recovery chain, legacy curl fallback.
references/examples.md — (1) single page → markdown, (2) batch a list of URLs with parallelism cap, (3) paginated listing, (4) block-page recovery.

related-skills.json

mismo repositorio

scraper-studio.md

from "brightdata/skills"

Build and run AI-generated Bright Data scrapers from the terminal via `bdata scraper create` and `bdata scraper run`. Use this skill whenever the user wants to generate a scraper from a natural-language description, build a custom scraper without writing code, turn a URL + plain-English description into a reusable scraper, or run an existing Bright Data collector against a URL and get the data back. Triggers on phrases like 'build me a scraper for', 'create a scraper that extracts', 'generate a scraper from a description', 'turn this URL into a scraper', 'run this scraper on', 'run my collector', 'scraper studio', `scraper create`, `scraper run`, `collector_id`, `automate_template`, or `/dca/`. Covers the AI flow (template create → trigger AI generation → poll progress), the run flow (async + poll by default, `--sync` for fast pages), and the silent auto-fallback to the batch endpoint when a URL expands past the realtime page limit. Requires the Bright Data CLI.

2026-05-27137

bright-data-best-practices.md

from "brightdata/skills"

Build production-ready Bright Data integrations with best practices baked in. Reference documentation for developers using coding assistants (Claude Code, Cursor, etc.) to implement web scraping, search, browser automation, and structured data extraction. Covers Web Unlocker API, SERP API, Web Scraper API, and Browser API (Scraping Browser).

2026-05-24137

brightdata-proxy.md

from "brightdata/skills"

Generate working code that routes HTTP requests through Bright Data proxy networks (Datacenter, ISP, Residential, Mobile) and help users decide which network and IP pool type to use (shared pool, shared IPs, or dedicated IPs). Use this skill whenever the user mentions Bright Data, brightdata.com, BD proxies, brd.superproxy.io, geo.brdtest.com, a brd-customer- proxy username, a Bright Data zone, the superproxy host, or wants to scrape or route requests through Bright Data — including questions about proxy URL format, country or session or IP or sticky-session targeting, SSL certificate setup for residential or mobile proxies, KYC verification, ignoring SSL errors, choosing between shared pool and shared IPs and dedicated IPs, or integrating Bright Data into Python requests/httpx/aiohttp, Node fetch/axios, Playwright, Puppeteer, Selenium, or Scrapy.

2026-05-24137

brightdata-sdk.md

from "brightdata/skills"

Web data extraction and discovery using the Bright Data Python SDK. Use when user asks to "scrape", "get data from", "extract", "search for", or "find" information from websites. Also use when user mentions specific platforms like Amazon, LinkedIn, Instagram, Facebook, TikTok, YouTube, Reddit, Pinterest, Zillow, Crunchbase, or DigiKey, or asks for "bulk data", "historical data", or "dataset". Covers scraping, searching, datasets, and browser automation.

2026-04-27137

agent-onboarding.md

from "brightdata/skills"

Onboard an agent to Bright Data. Use when a coding agent first encounters Bright Data — for live web work (search, scrape, structured data), for wiring Bright Data into product code, for installing the agent skill bundle, or for getting an API key. One install command sets up the CLI, agent skills, and authentication. Routes the reader to the right path: live tools, app integration, MCP, auth-only, or direct REST without any install.

2026-04-26137

seo-audit.md

from "brightdata/skills"

When the user wants to audit, review, or diagnose SEO issues on their site. Uses live web data via the Bright Data CLI for accurate detection of JS-injected schema, hreflang, canonicals, and live SERP-based ranking checks. Also use when the user mentions "SEO audit," "technical SEO," "why am I not ranking," "SEO issues," "on-page SEO," "meta tags review," "SEO health check," "my traffic dropped," "lost rankings," "not showing up in Google," "site isn't ranking," "Google update hit me," "page speed," "core web vitals," "crawl errors," or "indexing issues." Use this even if the user just says something vague like "my SEO is bad" or "help with SEO" — start with an audit. For building pages at scale to target keywords, see programmatic-seo. For implementing structured data, see schema-markup. For AI search optimization, see ai-seo.

2026-04-26137

package.json

"author": "brightdata"

"repository": "brightdata/skills"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Desarrolladores de softwareOcupaciones informáticas y matemáticas15-1252L4

name

scrape

description

Bright Data — Scrape

Setup gate (run first)

Before any scrape, verify the CLI is installed and authenticated:

if ! command -v bdata >/dev/null 2>&1; then
    echo "bdata CLI not installed — see bright-data-best-practices/references/cli-setup.md"
elif ! bdata zones >/dev/null 2>&1; then
    echo "bdata not authenticated — run: bdata login  (or: bdata login --device for SSH)"
fi

If either check fails, halt and route the user to skills/bright-data-best-practices/references/cli-setup.md. Do not attempt the legacy curl fallback silently — ask the user first.

Pick your path

Situation	Action
Single URL	`bdata scrape <url> -f markdown`
Small list (≤ ~20 URLs)	shell loop, 1 at a time (see `references/patterns.md`)
Larger list (dozens+)	`xargs -P 4` with parallelism cap (see `references/patterns.md`)
Paginated listing	scrape page 1 → extract next-page URL → append → repeat (see `references/examples.md`)
JS-heavy / login-gated / interaction-required	escalate to `bdata browser` (see `brightdata-cli` skill)
Amazon, LinkedIn, TikTok, Instagram, YouTube, Reddit, …	stop — hand off to `data-feeds`
No URL yet, just a topic	hand off to `search`

Action

Core commands:

# Clean markdown (default)
bdata scrape "https://example.com/article" -f markdown -o article.md

# Raw HTML (when you need the DOM)
bdata scrape "https://example.com" -f html -o page.html

# Structured JSON (when the Unlocker returns parsed fields)
bdata scrape "https://example.com" -f json --pretty -o page.json

# Visual snapshot (saves PNG)
bdata scrape "https://example.com" -f screenshot -o page.png

# Geo-targeted (override the exit country)
bdata scrape "https://example.com" --country de -f markdown

Full flag reference: references/flags.md.

Verification gate (run before claiming success)

Non-empty output: test -s "$out_path" — or, for stdout, at least 200 bytes of content.
Not a block page — grep the output for any of these signatures (case-insensitive):
- Access Denied
- Just a moment
- Attention Required
- Checking your browser
- captcha
- cf-browser-verification
- cloudflare (with < 2KB total body)
Expected markers present for the task: e.g., a product page should contain a price pattern (\$\d); an article should contain at least one <h1> or # heading.
On failure, escalation ladder:
- Retry with a different --country (e.g., --country de if the origin site is US)
- Escalate to bdata browser for full JS rendering (hand off to brightdata-cli skill)

Do not report success until all checks above pass.

Red flags

Claiming success without inspecting the output.
Silencing errors with 2>/dev/null — you'll miss auth failures and rate-limit errors.
Running bdata scrape on Amazon/LinkedIn/TikTok/Instagram/YouTube/Reddit URLs — these are supported by data-feeds and return structured data directly. Scraping loses the structure.
Scraping the same URL repeatedly in the same task — cache the first result.
Looping bdata scrape sequentially for large lists instead of using xargs -P 4 (or similar) with a parallelism cap.
Using curl against api.brightdata.com directly — legacy path; only when the CLI isn't available.

References

references/flags.md — every flag with when-to-use notes.
references/patterns.md — shell-loop batching, xargs parallelism, pagination recipe, retry/backoff, block-page recovery chain, legacy curl fallback.
references/examples.md — (1) single page → markdown, (2) batch a list of URLs with parallelism cap, (3) paginated listing, (4) block-page recovery.

scrape

Bright Data — Scrape

Setup gate (run first)

Pick your path

Action

Verification gate (run before claiming success)

Red flags

References

Más de este repositorio

Más de este repositorio

Bright Data — Scrape

Setup gate (run first)

Pick your path

Action

Verification gate (run before claiming success)

Red flags

References