Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

headless-fallback

Étoiles5

Forks0

Mis à jour25 juin 2026 à 12:51

Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto|always|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

Installation

Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.

Exécuter dans Manus

Source

xberg-io

xberg-io/plugins

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Téléchargement

Exécuter dans Manus

SKILL.md

readonly

name	headless-fallback
description	Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto\|always\|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

Headless fallback

Some pages are unscrapable without a real browser — SPA shells, infinite scroll, Cloudflare interstitials, JS-rendered article bodies. Crawlberg ships with an optional headless-Chrome backend driven by chromiumoxide.

Modes

--browser-mode auto    # default — try static first, fall back to browser on JS/WAF
--browser-mode always  # skip static, go straight to browser
--browser-mode never   # static only, fail closed

`auto` (default)

The engine fetches statically, then inspects the response. It launches headless Chrome and re-fetches when it sees:

WAF responses from one of 8 detected vendors (Cloudflare, Akamai, AWS WAF, Imperva, DataDome, PerimeterX, Sucuri, F5).
SPA shells: <noscript> warnings, near-empty <body> with heavy JS.
Heuristic JS-render-required signals.

This is the right default. The browser only spins up when needed.

`always`

Skip the static probe entirely. Use when:

The user already told you the page needs JS.
You are scraping a site you know is React/Vue/Svelte SPA.
You need <script>-emitted state that never lands in static HTML.

crawlberg scrape https://spa.example.com --browser-mode always --format markdown

`never`

Static only — the browser path is disabled. Use when:

You are in a hot loop where a stray Chrome launch would blow the budget.
You are running in a sandbox without a Chrome binary.
The user explicitly wants only static fetches.

In never mode, JS-only pages return empty/stub content. Inspect markdown.content and markdown.warnings before treating the result as final.

Symptoms that point to headless

In --browser-mode never or when you suspect the auto detector missed a signal:

markdown.content is short, nav-only, or just a loading message.
status_code is 200 but metadata.headings is empty on a page that clearly has headings.
markdown.warnings mentions JS-render-required or WAF detection.
403/406/503 with WAF response headers (server: cloudflare, cf-mitigated, x-amz-cf-id, set-cookie: __cf_bm=…).

Re-run with --browser-mode always. If that succeeds, leave it set for that host.

External CDP endpoint

Point at an already-running Chrome (Browserless, Steel, your own) instead of launching locally:

crawlberg scrape https://example.com \
  --browser-mode always \
  --browser-endpoint ws://browser.internal:9222/devtools/browser/<id> \
  --format markdown

The endpoint must be a WebSocket URL — ws:// or wss://. The CLI rejects anything else with a clear error.

Use external CDP when:

You are running in containers or CI without a local Chrome.
You want a shared, warm browser pool across many crawl jobs.
You need browser-side residential proxies or stealth configuration the local Chrome cannot provide.

Performance cost

Headless Chrome is expensive relative to a static fetch:

Cold start: 1-3 seconds the first time it launches.
Per-page overhead: 500 ms-2 s for NetworkIdle wait, plus the page's own JS load time.
Memory: each tab takes 100-300 MB; long crawls should bound --concurrent.

Mitigations:

Stay in --browser-mode auto — the engine only pays the cost when it needs to.
Use --browser-endpoint to share one warm browser across jobs.
Drop --concurrent when you know the crawl will route through Chrome.

Wait strategies

Pass via --config JSON when you need control:

crawlberg scrape https://example.com --browser-mode always \
  --config '{"browser":{"wait_strategy":{"type":"Selector","selector":".article-body"}}}'

Supported strategies:

NetworkIdle (default) — wait until the network goes quiet.
Selector — wait until a CSS selector resolves.
Fixed — wait a fixed duration.

extra_wait adds milliseconds on top of the wait strategy if the page keeps loading content after the primary signal.

Persistent profiles

crawlberg scrape https://app.example.com --browser-mode always \
  --config '{"browser_profile":"prod","save_browser_profile":true}'

Profile names are path-traversal-validated. Use them to keep cookies, localStorage, and login state across runs without re-authenticating.

Plus depuis ce dépôt

même dépôt

automating-the-browser

xberg-io/plugins

Use when extracting a page needs scripted interaction first — click, type, press a key, scroll, wait, screenshot, or run JS before capturing the DOM. Covers `crawlberg interact <url> --actions` with the real action schema, result shape, limits, and external-CDP options.

2026-06-255

crawlberg

xberg-io/plugins

Crawl, scrape, and convert websites to Markdown using the local crawlberg CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out.

2026-06-255

crawling-a-site

xberg-io/plugins

Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

2026-06-255

mapping-urls

xberg-io/plugins

Use when the user wants the list of URLs on a site rather than the page content — sitemap analysis, link planning, or seeding another tool. Covers `crawlberg map <url>` with `--limit`, `--search`, robots, output, and how it differs from a full crawl.

2026-06-255

scraping-html-to-markdown

xberg-io/plugins

Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

2026-06-255

serving-the-api

xberg-io/plugins

Use when the user wants a long-running HTTP service for scrape/crawl/map instead of one-shot CLI calls or the MCP server — for example wiring crawlberg into other apps over REST. Covers `crawlberg serve`, the Firecrawl-v1-compatible endpoints, `--host`/`--port`, and when to prefer it.

2026-06-255

name	headless-fallback
description	Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto\|always\|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

Headless fallback

Modes

--browser-mode auto    # default — try static first, fall back to browser on JS/WAF
--browser-mode always  # skip static, go straight to browser
--browser-mode never   # static only, fail closed

`auto` (default)

The engine fetches statically, then inspects the response. It launches headless Chrome and re-fetches when it sees:

WAF responses from one of 8 detected vendors (Cloudflare, Akamai, AWS WAF, Imperva, DataDome, PerimeterX, Sucuri, F5).
SPA shells: <noscript> warnings, near-empty <body> with heavy JS.
Heuristic JS-render-required signals.

This is the right default. The browser only spins up when needed.

`always`

Skip the static probe entirely. Use when:

The user already told you the page needs JS.
You are scraping a site you know is React/Vue/Svelte SPA.
You need <script>-emitted state that never lands in static HTML.

crawlberg scrape https://spa.example.com --browser-mode always --format markdown

`never`

Static only — the browser path is disabled. Use when:

You are in a hot loop where a stray Chrome launch would blow the budget.
You are running in a sandbox without a Chrome binary.
The user explicitly wants only static fetches.

In never mode, JS-only pages return empty/stub content. Inspect markdown.content and markdown.warnings before treating the result as final.

Symptoms that point to headless

In --browser-mode never or when you suspect the auto detector missed a signal:

markdown.content is short, nav-only, or just a loading message.
status_code is 200 but metadata.headings is empty on a page that clearly has headings.
markdown.warnings mentions JS-render-required or WAF detection.
403/406/503 with WAF response headers (server: cloudflare, cf-mitigated, x-amz-cf-id, set-cookie: __cf_bm=…).

Re-run with --browser-mode always. If that succeeds, leave it set for that host.

External CDP endpoint

Point at an already-running Chrome (Browserless, Steel, your own) instead of launching locally:

crawlberg scrape https://example.com \
  --browser-mode always \
  --browser-endpoint ws://browser.internal:9222/devtools/browser/<id> \
  --format markdown

The endpoint must be a WebSocket URL — ws:// or wss://. The CLI rejects anything else with a clear error.

Use external CDP when:

You are running in containers or CI without a local Chrome.
You want a shared, warm browser pool across many crawl jobs.
You need browser-side residential proxies or stealth configuration the local Chrome cannot provide.

Performance cost

Headless Chrome is expensive relative to a static fetch:

Cold start: 1-3 seconds the first time it launches.
Per-page overhead: 500 ms-2 s for NetworkIdle wait, plus the page's own JS load time.
Memory: each tab takes 100-300 MB; long crawls should bound --concurrent.

Mitigations:

Stay in --browser-mode auto — the engine only pays the cost when it needs to.
Use --browser-endpoint to share one warm browser across jobs.
Drop --concurrent when you know the crawl will route through Chrome.

Wait strategies

Pass via --config JSON when you need control:

crawlberg scrape https://example.com --browser-mode always \
  --config '{"browser":{"wait_strategy":{"type":"Selector","selector":".article-body"}}}'

Supported strategies:

NetworkIdle (default) — wait until the network goes quiet.
Selector — wait until a CSS selector resolves.
Fixed — wait a fixed duration.

extra_wait adds milliseconds on top of the wait strategy if the page keeps loading content after the primary signal.

Persistent profiles

crawlberg scrape https://app.example.com --browser-mode always \
  --config '{"browser_profile":"prod","save_browser_profile":true}'

Profile names are path-traversal-validated. Use them to keep cookies, localStorage, and login state across runs without re-authenticating.

headless-fallback

Headless fallback

Modes

auto (default)

always

never

Symptoms that point to headless

External CDP endpoint

Performance cost

Wait strategies

Persistent profiles

Plus depuis ce dépôt

Plus depuis ce dépôt

Headless fallback

Modes

auto (default)

always

never

Symptoms that point to headless

External CDP endpoint

Performance cost

Wait strategies

Persistent profiles

`auto` (default)

`always`

`never`

`auto` (default)

`always`

`never`