تشغيل أي مهارة في Manus بنقرة واحدة

crawling-a-site

النجوم٥

التفرعات٠

آخر تحديث٢٥ يونيو ٢٠٢٦ في ١٢:٥١

Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

xberg-io

xberg-io/plugins

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

SKILL.md

readonly

name	crawling-a-site
description	Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

Crawling a site

Reach for crawlberg crawl when one URL is not enough — the user wants the docs site, the blog, the marketing pages, or the whole domain.

Quick recipe

crawlberg crawl https://example.com \
  --depth 3 \
  --max-pages 200 \
  --concurrent 8 \
  --rate-limit 250 \
  --stay-on-domain \
  --respect-robots-txt \
  --format markdown

Defaults you should usually override:

--depth 2 is shallow — set it explicitly.
--max-pages is unbounded by default; cap it for any unknown site.
--concurrent 10 is aggressive for small hosts; drop to 4-8 for third-party sites.

Flag surface

Flag	Default	Purpose
`--depth`, `-d`	`2`	Maximum hop count from the seed URL.
`--max-pages`, `-n`	—	Hard cap on pages fetched. Set this on any unknown site.
`--concurrent`, `-c`	`10`	Parallel in-flight requests.
`--rate-limit`	`200`	Milliseconds between requests to the same origin.
`--stay-on-domain`	off	Skip links that leave the seed domain.
`--respect-robots-txt`	off	Honour `robots.txt`. Pass it for any third-party host.
`--proxy`	—	HTTP, HTTPS, or SOCKS5 proxy URL.
`--user-agent`	—	Override the request UA. Be honest.
`--timeout`	`30000`	Per-request timeout in ms.
`--browser-mode`	`auto`	`auto`, `always`, `never` — see the headless-fallback skill.
`--browser-endpoint`	—	External CDP `ws://` URL.
`--format`	`json`	`json` or `markdown`.
`--config`	—	Inline JSON or `@file.json` for the full `CrawlConfig`.

Multiple seed URLs are accepted positionally — the engine fans out with batch_crawl and aggregates results.

When to pick which flags

Docs sites you own

crawlberg crawl https://docs.example.com \
  --depth 5 --max-pages 1000 --concurrent 16 --rate-limit 100 \
  --stay-on-domain --format markdown > docs.md

Higher concurrency and lower rate limits are fine on infrastructure you control.

Third-party sites

crawlberg crawl https://blog.unknown.example \
  --depth 2 --max-pages 50 --concurrent 4 --rate-limit 500 \
  --stay-on-domain --respect-robots-txt --format markdown

Stay shallow, cap pages, throttle hard, obey robots.

Multi-seed batch

crawlberg crawl \
  https://example.com/blog \
  https://example.com/docs \
  https://example.com/pricing \
  --depth 2 --max-pages 100 --stay-on-domain --format json

JSON output for batch is an array of { seed_url, result } entries — each result is a full crawl payload or { error: ... }.

Output

Markdown mode

---
URL: https://example.com/page-one
---
# Page One

… markdown content …

---
URL: https://example.com/page-two
---
…

JSON mode

Top-level CrawlResult with pages: [...]. Each page carries the rendered Markdown plus metadata, links, images, JSON-LD, and HTTP response info. Read result.pages[i].markdown.content for the Markdown string.

Politeness checklist

Pass --respect-robots-txt on every third-party crawl.
Cap --max-pages — a runaway BFS can issue tens of thousands of requests.
Bump --rate-limit for hosts that show signs of stress (5xx, slowdowns).
Identify yourself via --user-agent crawlberg (contact@example.com).

Common pitfalls

No pages returned. The seed page may be JS-only — the engine falls back to headless automatically in --browser-mode auto, but never mode will silently produce an empty crawl. Re-run with --browser-mode always or check the headless-fallback skill.
Crawl leaves the domain. Pass --stay-on-domain. Combine with allow_subdomains: true in --config JSON to include subdomains.
Slow crawl. The default rate limit is 200 ms per origin — multiple seed URLs on the same host still share the bucket. Spread seeds across hosts or raise --concurrent for unrelated origins.
Memory growth. Each page carries full Markdown plus structured data. Stream JSON output to a file rather than holding it in memory; set --max-pages aggressively if downstream cannot keep up.

When to reach for `map` instead

If the user only needs the list of URLs (sitemap analysis, link planning, seeding another tool), use crawlberg map <url> — it skips rendering and returns a flat MapResult with hundreds of URLs in seconds.

المزيد من هذا المستودع

نفس المستودع

automating-the-browser

xberg-io/plugins

Use when extracting a page needs scripted interaction first — click, type, press a key, scroll, wait, screenshot, or run JS before capturing the DOM. Covers `crawlberg interact <url> --actions` with the real action schema, result shape, limits, and external-CDP options.

2026-06-255

crawlberg

xberg-io/plugins

Crawl, scrape, and convert websites to Markdown using the local crawlberg CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out.

2026-06-255

headless-fallback

xberg-io/plugins

Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto|always|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

2026-06-255

mapping-urls

xberg-io/plugins

Use when the user wants the list of URLs on a site rather than the page content — sitemap analysis, link planning, or seeding another tool. Covers `crawlberg map <url>` with `--limit`, `--search`, robots, output, and how it differs from a full crawl.

2026-06-255

scraping-html-to-markdown

xberg-io/plugins

Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

2026-06-255

serving-the-api

xberg-io/plugins

Use when the user wants a long-running HTTP service for scrape/crawl/map instead of one-shot CLI calls or the MCP server — for example wiring crawlberg into other apps over REST. Covers `crawlberg serve`, the Firecrawl-v1-compatible endpoints, `--host`/`--port`, and when to prefer it.

2026-06-255

name	crawling-a-site
description	Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

Crawling a site

Reach for crawlberg crawl when one URL is not enough — the user wants the docs site, the blog, the marketing pages, or the whole domain.

Quick recipe

crawlberg crawl https://example.com \
  --depth 3 \
  --max-pages 200 \
  --concurrent 8 \
  --rate-limit 250 \
  --stay-on-domain \
  --respect-robots-txt \
  --format markdown

Defaults you should usually override:

--depth 2 is shallow — set it explicitly.
--max-pages is unbounded by default; cap it for any unknown site.
--concurrent 10 is aggressive for small hosts; drop to 4-8 for third-party sites.

Flag surface

Flag	Default	Purpose
`--depth`, `-d`	`2`	Maximum hop count from the seed URL.
`--max-pages`, `-n`	—	Hard cap on pages fetched. Set this on any unknown site.
`--concurrent`, `-c`	`10`	Parallel in-flight requests.
`--rate-limit`	`200`	Milliseconds between requests to the same origin.
`--stay-on-domain`	off	Skip links that leave the seed domain.
`--respect-robots-txt`	off	Honour `robots.txt`. Pass it for any third-party host.
`--proxy`	—	HTTP, HTTPS, or SOCKS5 proxy URL.
`--user-agent`	—	Override the request UA. Be honest.
`--timeout`	`30000`	Per-request timeout in ms.
`--browser-mode`	`auto`	`auto`, `always`, `never` — see the headless-fallback skill.
`--browser-endpoint`	—	External CDP `ws://` URL.
`--format`	`json`	`json` or `markdown`.
`--config`	—	Inline JSON or `@file.json` for the full `CrawlConfig`.

Multiple seed URLs are accepted positionally — the engine fans out with batch_crawl and aggregates results.

When to pick which flags

Docs sites you own

crawlberg crawl https://docs.example.com \
  --depth 5 --max-pages 1000 --concurrent 16 --rate-limit 100 \
  --stay-on-domain --format markdown > docs.md

Higher concurrency and lower rate limits are fine on infrastructure you control.

Third-party sites

crawlberg crawl https://blog.unknown.example \
  --depth 2 --max-pages 50 --concurrent 4 --rate-limit 500 \
  --stay-on-domain --respect-robots-txt --format markdown

Stay shallow, cap pages, throttle hard, obey robots.

Multi-seed batch

crawlberg crawl \
  https://example.com/blog \
  https://example.com/docs \
  https://example.com/pricing \
  --depth 2 --max-pages 100 --stay-on-domain --format json

JSON output for batch is an array of { seed_url, result } entries — each result is a full crawl payload or { error: ... }.

Output

Markdown mode

---
URL: https://example.com/page-one
---
# Page One

… markdown content …

---
URL: https://example.com/page-two
---
…

JSON mode

Politeness checklist

Pass --respect-robots-txt on every third-party crawl.
Cap --max-pages — a runaway BFS can issue tens of thousands of requests.
Bump --rate-limit for hosts that show signs of stress (5xx, slowdowns).
Identify yourself via --user-agent crawlberg (contact@example.com).

Common pitfalls

No pages returned. The seed page may be JS-only — the engine falls back to headless automatically in --browser-mode auto, but never mode will silently produce an empty crawl. Re-run with --browser-mode always or check the headless-fallback skill.
Crawl leaves the domain. Pass --stay-on-domain. Combine with allow_subdomains: true in --config JSON to include subdomains.
Slow crawl. The default rate limit is 200 ms per origin — multiple seed URLs on the same host still share the bucket. Spread seeds across hosts or raise --concurrent for unrelated origins.
Memory growth. Each page carries full Markdown plus structured data. Stream JSON output to a file rather than holding it in memory; set --max-pages aggressively if downstream cannot keep up.

crawling-a-site

Crawling a site

Quick recipe

Flag surface

When to pick which flags

Docs sites you own

Third-party sites

Multi-seed batch

Output

Markdown mode

JSON mode

Politeness checklist

Common pitfalls

When to reach for map instead

المزيد من هذا المستودع

المزيد من هذا المستودع

Crawling a site

Quick recipe

Flag surface

When to pick which flags

Docs sites you own

Third-party sites

Multi-seed batch

Output

Markdown mode

JSON mode

Politeness checklist

Common pitfalls

When to reach for map instead

When to reach for `map` instead

When to reach for `map` instead