تشغيل أي مهارة في Manus بنقرة واحدة

html-to-markdown

النجوم٥

التفرعات٠

آخر تحديث٢٥ يونيو ٢٠٢٦ في ١٢:٥١

Convert HTML to Markdown, Djot, or plain text with structured extraction. Use when writing code that calls html-to-markdown APIs in Rust, Python, TypeScript, Go, Ruby, PHP, Java, C#, Elixir, R, C, or WASM. Covers installation, conversion, configuration, metadata extraction, tables, document structure, inline images, URL fetching, and CLI usage.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

xberg-io

xberg-io/plugins

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

مستكشف الملفات

7 ملفات

SKILL.md

readonly

المزيد من هذا المستودع

نفس المستودع

automating-the-browser

xberg-io/plugins

Use when extracting a page needs scripted interaction first — click, type, press a key, scroll, wait, screenshot, or run JS before capturing the DOM. Covers `crawlberg interact <url> --actions` with the real action schema, result shape, limits, and external-CDP options.

2026-06-255

crawlberg

xberg-io/plugins

Crawl, scrape, and convert websites to Markdown using the local crawlberg CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out.

2026-06-255

crawling-a-site

xberg-io/plugins

Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

2026-06-255

headless-fallback

xberg-io/plugins

Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto|always|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

2026-06-255

mapping-urls

xberg-io/plugins

Use when the user wants the list of URLs on a site rather than the page content — sitemap analysis, link planning, or seeding another tool. Covers `crawlberg map <url>` with `--limit`, `--search`, robots, output, and how it differs from a full crawl.

2026-06-255

scraping-html-to-markdown

xberg-io/plugins

Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

2026-06-255

name	html-to-markdown
description	Convert HTML to Markdown, Djot, or plain text with structured extraction. Use when writing code that calls html-to-markdown APIs in Rust, Python, TypeScript, Go, Ruby, PHP, Java, C#, Elixir, R, C, or WASM. Covers installation, conversion, configuration, metadata extraction, tables, document structure, inline images, URL fetching, and CLI usage.
license	MIT
metadata	{"author":"kreuzberg-dev","version":"0.1.0","repository":"https://github.com/xberg-io/html-to-markdown"}

html-to-markdown

html-to-markdown is a high-performance HTML→Markdown converter with a Rust core and 12 native language bindings. It converts HTML to CommonMark Markdown, Djot, or plain text in a single pass, optionally extracting metadata, tables, inline images, and a structured document tree.

Use this skill when writing code that:

Converts HTML strings, files, or live URLs to Markdown, Djot, or plain text
Extracts metadata (title, OG tags, JSON-LD/Microdata/RDFa, headers, links, images, language) from HTML
Extracts structured table data (GFM markdown + cell grids) from HTML
Extracts a structured document-structure tree
Extracts inline images (data URIs, SVGs) from HTML
Uses preprocessing to clean noisy HTML (ads, navigation, forms) before conversion

Capability map

Capability	CLI	SDKs
HTML→Markdown / Djot / plain text	`html-to-markdown FILE`	`convert(html, options)`
Read HTML from stdin / file / URL	`cat f \| …`, `FILE`, `--url URL`	`convert(htmlString, …)`
~30 config options (headings, code blocks, lists, escaping, wrapping…)	flags	`ConversionOptions`
Metadata extraction	`--json` (default-extracted)	`result.metadata`
Table extraction	`--json` → `tables[]`	`result.tables`
Document structure tree	`--json --include-structure`	`include_document_structure=true`
Inline image extraction	`--json --extract-inline-images`	`extract_images=true`
HTML preprocessing	`--preprocess [--preset …]`	`PreprocessingOptions`
Extraction-only (no Markdown body)	`--json --no-content`	options + read fields

Installation

CLI

# (Homebrew 6.0+ requires explicit trust for third-party taps)
brew trust xberg-io/tap
brew install xberg-io/tap/html-to-markdown
# or run without a persistent install (the CLI proxy package self-installs the binary):
npx @kreuzberg/html-to-markdown-cli --help
uvx --from html-to-markdown-cli html-to-markdown --help
# or download a prebuilt binary from the latest GitHub release:
#   https://github.com/xberg-io/html-to-markdown/releases/latest
# or build from source:
cargo install --git https://github.com/xberg-io/html-to-markdown html-to-markdown-cli

Language SDKs

pip install html-to-markdown                                # Python
npm install @kreuzberg/html-to-markdown                      # TypeScript / Node.js
cargo add html-to-markdown-rs                                # Rust (features: metadata default; full = all)
gem install html-to-markdown                                 # Ruby
composer require xberg-io/html-to-markdown              # PHP
go get github.com/xberg-io/html-to-markdown/packages/go/v3   # Go
dotnet add package KreuzbergDev.HtmlToMarkdown               # C#
npm install @kreuzberg/html-to-markdown-wasm                 # WASM

Java (Maven): dev.kreuzberg:html-to-markdown
Elixir: {:html_to_markdown, "~> 3.6"} in mix.exs
R: install.packages("htmltomarkdown", repos = "https://xberg-io.r-universe.dev")
C (FFI): pre-built .so / .dll / .dylib from GitHub releases

CLI vs SDK — which to use

CLI — one-shot conversions, shell pipelines, fetching a single URL, ad-hoc metadata/table extraction via --json | jq. Flags only, no subcommands. FILE is positional; omit or use - for stdin.
SDK — embedding conversion in application code, batch processing, custom element conversion (visitor pattern, Rust), and tight loops where process spawn overhead matters.
MCP server — html-to-markdown mcp exposes convert/extract as agent tools, so an MCP client can convert HTML directly with no shell-out. This plugin auto-registers it; see the using-the-mcp-server skill.

Both share the same ConversionResult shape, so output is interchangeable.

When to use html-to-markdown vs kreuzberg vs crawlberg

html-to-markdown — you already have HTML (a string, a file, or a single URL) and want clean Markdown plus structured metadata/tables. No OCR, no document parsing, no crawling.
kreuzberg — you have documents (PDF, Office, images, email, archives) and need full text/table/metadata extraction with optional OCR. Use it when the input is not already HTML.
crawlberg — you need to crawl or scrape many pages, follow links, and handle JS-rendered sites with a headless-Chrome fallback. It uses html-to-markdown internally for the HTML→Markdown step.

Rule of thumb: single HTML in → Markdown out = html-to-markdown. Many URLs / a site = crawlberg. Non-HTML documents = kreuzberg.

CLI quick start

# Convert a file to stdout
html-to-markdown input.html

# Convert and save
html-to-markdown input.html -o output.md

# Read from stdin
cat page.html | html-to-markdown

# Fetch and convert a URL
html-to-markdown --url https://example.com > out.md

# Full ConversionResult as JSON (content, tables, metadata, images, warnings)
html-to-markdown --json input.html

# JSON with document structure tree
html-to-markdown --json --include-structure input.html

# Extraction-only (no Markdown body)
html-to-markdown --json --no-content input.html

# Aggressive web-page cleanup
html-to-markdown input.html --preprocess --preset aggressive

SDK quick start

Rust

use html_to_markdown_rs::convert;

let result = convert("<h1>Hello World</h1><p>A paragraph.</p>", None)?;
println!("{}", result.content.unwrap_or_default());

Python

from html_to_markdown import convert

result = convert("<h1>Hello World</h1><p>A paragraph.</p>")
print(result.content)   # # Hello World\n\nA paragraph.
print(result.metadata)  # title, links, headers, …

TypeScript / Node.js

import { convert } from "@kreuzberg/html-to-markdown";

// Node's convert() returns a JSON string — always JSON.parse() it.
const result = JSON.parse(convert("<h1>Hello World</h1><p>A paragraph.</p>"));
console.log(result.content);

ConversionResult fields

All languages return the same structure (dict, object, or struct).

Field	Description
`content`	Converted text (Markdown/Djot/plain). `null` only in extraction-only mode.
`metadata`	Title, OG, headers, links, images, structured data.
`tables`	Tables with `grid` (structured cells) and `markdown` fields.
`images`	Extracted inline images (requires inline-image extraction).
`document`	Structured document tree when structure extraction is enabled.
`warnings`	Non-fatal processing warnings (`message`, `kind`).

Configuration

All languages expose the same ~30 options. See references/configuration.md for the complete table. Common ones:

Option	Values	Default
`heading_style`	`atx`, `underlined`, `atx-closed`	`atx`
`code_block_style`	`backticks`, `indented`, `tildes`	`backticks`
`output_format`	`markdown`, `djot`	`markdown`
`wrap` / `wrap_width`	bool / 20–500	off / `80`
`autolinks` (SDK) / `--no-autolinks` (CLI)	bool / flag	`true` (on); disable in CLI with `--no-autolinks`
preprocessing	`minimal` / `standard` / `aggressive`	off

Rust (builder)

use html_to_markdown_rs::{convert, ConversionOptions, HeadingStyle, OutputFormat};

let options = ConversionOptions::builder()
    .heading_style(HeadingStyle::Atx)
    .output_format(OutputFormat::Markdown)
    .wrap(true)
    .wrap_width(100)
    .build();
let result = convert(html, Some(options))?;

Python (dataclass)

from html_to_markdown import convert, ConversionOptions, PreprocessingOptions

result = convert(
    html,
    ConversionOptions(heading_style="atx", wrap=True, wrap_width=100),
    PreprocessingOptions(enabled=True, preset="aggressive"),
)

Metadata extraction

Metadata is extracted by default; in the CLI it appears under metadata in --json output. Fields include document (title, description, language, charset, open_graph), headers, links (with link_type), images, and structured_data (JSON-LD/Microdata/RDFa). See the extracting-metadata skill for details.

Table extraction

Tables appear in result.tables, each with a pre-rendered markdown string and a structured cell grid. Markdown tables also appear inline in content. See the extracting-tables skill.

Document structure extraction

Enable structure extraction (--include-structure on the CLI, include_document_structure=true in SDKs) to get a semantic node tree under document. Node types include heading, paragraph, list, list_item, table, image, code, quote, group, metadata_block.

Common pitfalls

convert() returns a result object, not a string. Access .content for the Markdown text.
Node.js convert() returns a JSON string. Always JSON.parse(convert(html)) — NAPI-RS serializes the result for performance.
--json outputs JSON, not Markdown. Omit --json for plain Markdown.
--include-structure, --extract-inline-images, and --no-content require --json.
CLI has no subcommands. Everything is a flag; FILE is positional.
--preset, --keep-navigation, --keep-forms require --preprocess.

Additional resources

CLI Reference — every flag, JSON shape, exit codes
Configuration Reference — all 30+ options with defaults
Rust API Reference — signatures, builder, feature flags
Python API Reference — functions, dataclasses, type hints
TypeScript API Reference — functions, interfaces, Buffer support
Other Bindings — Go, Ruby, PHP, Java, C#, Elixir, R, WASM, C FFI

GitHub: https://github.com/xberg-io/html-to-markdown