Run any Skill in Manus with one click

scraping-html-to-markdown

Stars5

Forks0

UpdatedJune 25, 2026 at 12:51

Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

xberg-io

xberg-io/plugins

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

SKILL.md

readonly

name	scraping-html-to-markdown
description	Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

Scraping HTML to Markdown

crawlberg scrape <url> is the right tool when the user has a single page in mind. It returns Markdown plus a full structured payload (metadata, links, images, JSON-LD, HTTP response info).

Quick recipe

crawlberg scrape https://example.com/article --format markdown

JSON form (default) when downstream needs metadata:

crawlberg scrape https://example.com/article --format json

Flag surface

Flag	Default	Purpose
`--format`	`json`	`json` or `markdown`.
`--timeout`	`30000`	Per-request timeout in ms.
`--proxy`	—	HTTP, HTTPS, or SOCKS5 proxy URL.
`--user-agent`	—	Override request UA.
`--respect-robots-txt`	off	Honour `robots.txt`.
`--browser-mode`	`auto`	`auto`, `always`, `never` — see headless-fallback skill.
`--browser-endpoint`	—	External CDP `ws://` URL.
`--config`	—	Inline JSON or `@file.json` for full `CrawlConfig`.

Output shape

Markdown mode

Prints the rendered Markdown only. Use when piping to a file the user will read, or when the result becomes LLM context downstream.

JSON mode

Top-level PageResult with:

url, final_url (after redirects), status_code.
markdown: { content, fit_content, warnings } — fit_content is a pruned LLM-optimised variant.
metadata: Open Graph, Twitter Card, Dublin Core, article tags, JSON-LD, headings (H1–H6), feeds, favicons, hreflang.
links: arrays for Internal, External, Anchor, and Document.
images: <img>, <picture>, srcset, og:image.
tables: structured table data preserved separately from Markdown.
response: HTTP headers, content type, charset, body size.

Read result.markdown.content for the Markdown string when scripting.

Common pitfalls

Empty or stub content

Static fetch returned a JS shell. Symptoms in JSON output:

markdown.content is short or only contains nav/footer chrome.
markdown.warnings mentions JS-render-required.
metadata.headings is empty when the page clearly has headings.

Re-run with --browser-mode always and see the headless-fallback skill.

WAF block

Auto mode detects 8 WAF vendors and retries through headless Chrome automatically. If you forced --browser-mode never, the WAF response will fall through. Check response.status_code — 403/406/503 with WAF headers (server: cloudflare, x-amz-cf-id, etc.) is the giveaway.

Robots.txt blocking the fetch

If --respect-robots-txt is set and the path is disallowed, the scrape returns an error rather than partial content. Drop the flag only on hosts you own or have authorisation for.

Wrong charset

Most pages declare UTF-8. Pages that lie about their charset surface as mojibake in markdown.content. Override via --config '{"force_encoding":"latin-1"}' or similar.

Examples

Scrape an article for downstream LLM context

crawlberg scrape https://blog.example.com/post-123 --format markdown \
  > /tmp/article.md

Scrape with proxy and custom UA

crawlberg scrape https://example.com \
  --proxy http://proxy.internal:3128 \
  --user-agent "crawlberg (research@example.com)" \
  --format json

Extract just the OG metadata

crawlberg scrape https://example.com --format json \
  | jq '.metadata | {title: .og.title, description: .og.description, image: .og.image}'

When to reach for crawl or interact instead

The user wants the whole site, not one page → crawling-a-site skill.
The user needs to click, type, or scroll before extracting → use crawlberg interact with the action list.
The user only wants the list of URLs → crawlberg map.

More from this repository

same repository

automating-the-browser

xberg-io/plugins

Use when extracting a page needs scripted interaction first — click, type, press a key, scroll, wait, screenshot, or run JS before capturing the DOM. Covers `crawlberg interact <url> --actions` with the real action schema, result shape, limits, and external-CDP options.

2026-06-255

crawlberg

xberg-io/plugins

Crawl, scrape, and convert websites to Markdown using the local crawlberg CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out.

2026-06-255

crawling-a-site

xberg-io/plugins

Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

2026-06-255

headless-fallback

xberg-io/plugins

Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto|always|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

2026-06-255

mapping-urls

xberg-io/plugins

Use when the user wants the list of URLs on a site rather than the page content — sitemap analysis, link planning, or seeding another tool. Covers `crawlberg map <url>` with `--limit`, `--search`, robots, output, and how it differs from a full crawl.

2026-06-255

serving-the-api

xberg-io/plugins

Use when the user wants a long-running HTTP service for scrape/crawl/map instead of one-shot CLI calls or the MCP server — for example wiring crawlberg into other apps over REST. Covers `crawlberg serve`, the Firecrawl-v1-compatible endpoints, `--host`/`--port`, and when to prefer it.

2026-06-255

name	scraping-html-to-markdown
description	Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

Scraping HTML to Markdown

crawlberg scrape <url> is the right tool when the user has a single page in mind. It returns Markdown plus a full structured payload (metadata, links, images, JSON-LD, HTTP response info).

Quick recipe

crawlberg scrape https://example.com/article --format markdown

JSON form (default) when downstream needs metadata:

crawlberg scrape https://example.com/article --format json

Flag surface

Flag	Default	Purpose
`--format`	`json`	`json` or `markdown`.
`--timeout`	`30000`	Per-request timeout in ms.
`--proxy`	—	HTTP, HTTPS, or SOCKS5 proxy URL.
`--user-agent`	—	Override request UA.
`--respect-robots-txt`	off	Honour `robots.txt`.
`--browser-mode`	`auto`	`auto`, `always`, `never` — see headless-fallback skill.
`--browser-endpoint`	—	External CDP `ws://` URL.
`--config`	—	Inline JSON or `@file.json` for full `CrawlConfig`.

Output shape

Markdown mode

Prints the rendered Markdown only. Use when piping to a file the user will read, or when the result becomes LLM context downstream.

JSON mode

Top-level PageResult with:

url, final_url (after redirects), status_code.
markdown: { content, fit_content, warnings } — fit_content is a pruned LLM-optimised variant.
metadata: Open Graph, Twitter Card, Dublin Core, article tags, JSON-LD, headings (H1–H6), feeds, favicons, hreflang.
links: arrays for Internal, External, Anchor, and Document.
images: <img>, <picture>, srcset, og:image.
tables: structured table data preserved separately from Markdown.
response: HTTP headers, content type, charset, body size.

Read result.markdown.content for the Markdown string when scripting.

Common pitfalls

Empty or stub content

Static fetch returned a JS shell. Symptoms in JSON output:

markdown.content is short or only contains nav/footer chrome.
markdown.warnings mentions JS-render-required.
metadata.headings is empty when the page clearly has headings.

Re-run with --browser-mode always and see the headless-fallback skill.

WAF block

Robots.txt blocking the fetch

If --respect-robots-txt is set and the path is disallowed, the scrape returns an error rather than partial content. Drop the flag only on hosts you own or have authorisation for.

Wrong charset

Most pages declare UTF-8. Pages that lie about their charset surface as mojibake in markdown.content. Override via --config '{"force_encoding":"latin-1"}' or similar.

Examples

Scrape an article for downstream LLM context

crawlberg scrape https://blog.example.com/post-123 --format markdown \
  > /tmp/article.md

Scrape with proxy and custom UA

crawlberg scrape https://example.com \
  --proxy http://proxy.internal:3128 \
  --user-agent "crawlberg (research@example.com)" \
  --format json

Extract just the OG metadata

crawlberg scrape https://example.com --format json \
  | jq '.metadata | {title: .og.title, description: .og.description, image: .og.image}'

When to reach for crawl or interact instead

The user wants the whole site, not one page → crawling-a-site skill.
The user needs to click, type, or scroll before extracting → use crawlberg interact with the action list.
The user only wants the list of URLs → crawlberg map.