Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

crawlberg

Sterne5

Forks0

Aktualisiert25. Juni 2026 um 12:51

Crawl, scrape, and convert websites to Markdown using the local crawlberg CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

xberg-io

xberg-io/plugins

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

automating-the-browser

xberg-io/plugins

Use when extracting a page needs scripted interaction first — click, type, press a key, scroll, wait, screenshot, or run JS before capturing the DOM. Covers `crawlberg interact <url> --actions` with the real action schema, result shape, limits, and external-CDP options.

2026-06-255

crawling-a-site

xberg-io/plugins

Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

2026-06-255

headless-fallback

xberg-io/plugins

Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto|always|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

2026-06-255

mapping-urls

xberg-io/plugins

Use when the user wants the list of URLs on a site rather than the page content — sitemap analysis, link planning, or seeding another tool. Covers `crawlberg map <url>` with `--limit`, `--search`, robots, output, and how it differs from a full crawl.

2026-06-255

scraping-html-to-markdown

xberg-io/plugins

Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

2026-06-255

serving-the-api

xberg-io/plugins

Use when the user wants a long-running HTTP service for scrape/crawl/map instead of one-shot CLI calls or the MCP server — for example wiring crawlberg into other apps over REST. Covers `crawlberg serve`, the Firecrawl-v1-compatible endpoints, `--host`/`--port`, and when to prefer it.

2026-06-255

name	crawlberg
description	Crawl, scrape, and convert websites to Markdown using the local crawlberg CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out.
license	MIT
metadata	{"author":"kreuzberg-dev","version":"0.1.0","repository":"https://github.com/xberg-io/crawlberg"}

Crawlberg

Crawlberg is a Rust-native web crawler and scraper. It fetches static HTML with reqwest, falls back to headless Chrome when a page needs JS or trips a WAF, and converts every result to clean Markdown via the built-in HTML→Markdown engine.

Use this skill when the user wants to:

Scrape a single URL to Markdown plus structured metadata.
Crawl a site following links bounded by depth, page count, and concurrency.
Enumerate URLs from sitemaps without paying for rendering.
Drive a real browser (click, type, scroll) and capture the resulting DOM.
Run the same operations from another agent harness via MCP tools.

Installation

The plugin shells out to a crawlberg binary on PATH. Install one of:

brew install xberg-io/tap/crawlberg
# or run without a persistent install (the CLI proxy package self-installs the binary):
npx @kreuzberg/crawlberg-cli --help
uvx --from crawlberg-cli crawlberg --help
# or build from source:
cargo install --git https://github.com/xberg-io/crawlberg crawlberg-cli --features all

The serve and mcp subcommands are gated behind non-default cargo features (api and mcp). The Homebrew tap is built with all features, so both subcommands work out of the box. A from-source build must pass --features mcp (and --features api for serve), or --features all, to include them.

Verify:

crawlberg --version

Headless fallback needs Chrome/Chromium reachable locally (chromiumoxide launches it on demand). Skip the install if you only plan to use --browser-mode never.

Command map

crawlberg scrape <url>     # single page → JSON or Markdown
crawlberg crawl <url...>   # follow links, BFS, depth-bounded
crawlberg map <url>        # enumerate URLs via sitemaps + link extraction
crawlberg interact <url>   # browser actions: click, type, scroll
crawlberg mcp              # MCP server (stdio) — auto-registered (`mcp` feature)
crawlberg serve            # REST API server (`api` feature)

Batch behaviour is built into crawl: pass multiple seed URLs and the engine fans out via batch_crawl internally.

Shared flags

Flag	Default	Notes
`--format`	`json`	`json` or `markdown`.
`--timeout`	`30000`	Request timeout in milliseconds.
`--browser-mode`	`auto`	`auto`, `always`, or `never`.
`--browser-endpoint`	—	Optional CDP `ws://` or `wss://` URL.
`--respect-robots-txt`	off	Pass to obey `robots.txt`.
`--config <json>`	—	Inline JSON or `@file.json` to override defaults.

The --config flag accepts the full CrawlConfig schema. Anything you set explicitly on the CLI overrides the corresponding JSON field.

Scrape a single page

crawlberg scrape https://example.com --format markdown

JSON output (default) carries the rendered Markdown, page metadata (PageMetadata), links by category, images, feeds, JSON-LD blocks, and HTTP response metadata. Use Markdown output when piping into a file the user will read.

See the scraping-html-to-markdown skill for the full flag surface.

Crawl a site

crawlberg crawl https://example.com \
  --depth 3 --max-pages 200 --concurrent 8 --rate-limit 250 \
  --stay-on-domain --respect-robots-txt --format markdown

Crawling is BFS by default, bounded by --depth, --max-pages, and --concurrent. Per-domain politeness is enforced by --rate-limit (milliseconds between requests to the same origin).

See the crawling-a-site skill for the recommended defaults and the full flag surface.

Map URLs

crawlberg map https://example.com --limit 500 --search docs --format markdown

map reads sitemap.xml (and nested sitemaps), then falls back to link extraction from the seed page. It does not render pages — use it to plan a crawl or to feed URLs into another tool.

Browser interaction

crawlberg interact https://example.com \
  --actions '[{"type":"click","selector":"#load-more"},
              {"type":"wait","milliseconds":500},
              {"type":"scrape"}]'

Action types are click, type, press, scroll, wait, screenshot, executeJs, and scrape (to wait for an element, use wait with a selector field). The result wraps the final HTML under interaction.final_html. See the automating-the-browser skill for the full action schema and limits.

MCP server

When this plugin is installed in a Claude Code / Codex / Cursor / Gemini / opencode harness, the MCP server is auto-registered:

crawlberg mcp

mcp is a stdio-transport server and takes no arguments. It requires a binary built with the mcp feature (see Installation).

Prefer MCP tools over shelling out when both are available:

Typed schemas surface argument errors before the call.
Results stream back as structured tool output instead of stdout text.
No --format juggling — the harness pulls whatever shape it needs.

Fall back to the CLI when you need to script a pipeline, capture stderr, or chain with shell tools.

Headless fallback

In --browser-mode auto (default), the engine:

Fetches statically via reqwest.
Detects WAF blocks (8 vendors) and JS-only shells.
Re-fetches through headless Chrome with a real fingerprint when needed.

Force the browser path with --browser-mode always when you already know the page needs JS. Use --browser-mode never for hot loops where the cost of a stray Chrome launch is unacceptable.

Point --browser-endpoint ws://host:9222/devtools/browser/<id> at an already-running Chrome to skip the local launch.

See the headless-fallback skill for symptoms, costs, and external-CDP patterns.

Output formats

Mode	Use when
`json`	Downstream consumer needs metadata, links, images, etc.
`markdown`	Human reader or LLM-context payload.

Markdown output skips metadata. If you need both, run with --format json and read result.markdown.content.

Robots, rate limits, ethics

--respect-robots-txt is off by default; pass it for any crawl on a host you do not own.
The default --rate-limit 200 already produces a polite cadence; raise it for shared hosts.
Identify the crawler honestly via --user-agent. Do not impersonate a browser unless the operator has approved it.

Cross-references

skills/crawling-a-site/SKILL.md — multi-page crawl with depth, page caps, concurrency, rate limits, and domain scoping.
skills/scraping-html-to-markdown/SKILL.md — single-page rendering, the Markdown output shape, and common pitfalls.
skills/mapping-urls/SKILL.md — map: sitemap + link URL discovery, filtering, and seeding a crawl.
skills/automating-the-browser/SKILL.md — interact: the full scripted action schema, limits, and result shape.
skills/serving-the-api/SKILL.md — serve: the Firecrawl-v1-compatible REST API server and its endpoints.
skills/headless-fallback/SKILL.md — when and how to force the browser backend.