| name | html-to-markdown |
| description | Convert HTML to Markdown, Djot, or plain text with structured extraction. Use when writing code that calls html-to-markdown APIs in Rust, Python, TypeScript, Go, Ruby, PHP, Java, C#, Elixir, R, C, or WASM. Covers installation, conversion, configuration, metadata extraction, tables, document structure, inline images, URL fetching, and CLI usage. |
| license | MIT |
| metadata | {"author":"kreuzberg-dev","version":"0.1.0","repository":"https://github.com/xberg-io/html-to-markdown"} |
html-to-markdown
html-to-markdown is a high-performance HTML→Markdown converter with a Rust core and 12 native language bindings. It converts HTML to CommonMark Markdown, Djot, or plain text in a single pass, optionally extracting metadata, tables, inline images, and a structured document tree.
Use this skill when writing code that:
- Converts HTML strings, files, or live URLs to Markdown, Djot, or plain text
- Extracts metadata (title, OG tags, JSON-LD/Microdata/RDFa, headers, links, images, language) from HTML
- Extracts structured table data (GFM markdown + cell grids) from HTML
- Extracts a structured document-structure tree
- Extracts inline images (data URIs, SVGs) from HTML
- Uses preprocessing to clean noisy HTML (ads, navigation, forms) before conversion
Capability map
| Capability | CLI | SDKs |
|---|
| HTML→Markdown / Djot / plain text | html-to-markdown FILE | convert(html, options) |
| Read HTML from stdin / file / URL | cat f | …, FILE, --url URL | convert(htmlString, …) |
| ~30 config options (headings, code blocks, lists, escaping, wrapping…) | flags | ConversionOptions |
| Metadata extraction | --json (default-extracted) | result.metadata |
| Table extraction | --json → tables[] | result.tables |
| Document structure tree | --json --include-structure | include_document_structure=true |
| Inline image extraction | --json --extract-inline-images | extract_images=true |
| HTML preprocessing | --preprocess [--preset …] | PreprocessingOptions |
| Extraction-only (no Markdown body) | --json --no-content | options + read fields |
Installation
CLI
brew trust xberg-io/tap
brew install xberg-io/tap/html-to-markdown
npx @kreuzberg/html-to-markdown-cli --help
uvx --from html-to-markdown-cli html-to-markdown --help
cargo install --git https://github.com/xberg-io/html-to-markdown html-to-markdown-cli
Language SDKs
pip install html-to-markdown
npm install @kreuzberg/html-to-markdown
cargo add html-to-markdown-rs
gem install html-to-markdown
composer require xberg-io/html-to-markdown
go get github.com/xberg-io/html-to-markdown/packages/go/v3
dotnet add package KreuzbergDev.HtmlToMarkdown
npm install @kreuzberg/html-to-markdown-wasm
- Java (Maven):
dev.kreuzberg:html-to-markdown
- Elixir:
{:html_to_markdown, "~> 3.6"} in mix.exs
- R:
install.packages("htmltomarkdown", repos = "https://xberg-io.r-universe.dev")
- C (FFI): pre-built
.so / .dll / .dylib from GitHub releases
CLI vs SDK — which to use
- CLI — one-shot conversions, shell pipelines, fetching a single URL, ad-hoc metadata/table extraction via
--json | jq. Flags only, no subcommands. FILE is positional; omit or use - for stdin.
- SDK — embedding conversion in application code, batch processing, custom element conversion (visitor pattern, Rust), and tight loops where process spawn overhead matters.
- MCP server —
html-to-markdown mcp exposes convert/extract as agent tools, so an MCP client can convert HTML directly with no shell-out. This plugin auto-registers it; see the using-the-mcp-server skill.
Both share the same ConversionResult shape, so output is interchangeable.
When to use html-to-markdown vs kreuzberg vs crawlberg
- html-to-markdown — you already have HTML (a string, a file, or a single URL) and want clean Markdown plus structured metadata/tables. No OCR, no document parsing, no crawling.
- kreuzberg — you have documents (PDF, Office, images, email, archives) and need full text/table/metadata extraction with optional OCR. Use it when the input is not already HTML.
- crawlberg — you need to crawl or scrape many pages, follow links, and handle JS-rendered sites with a headless-Chrome fallback. It uses html-to-markdown internally for the HTML→Markdown step.
Rule of thumb: single HTML in → Markdown out = html-to-markdown. Many URLs / a site = crawlberg. Non-HTML documents = kreuzberg.
CLI quick start
html-to-markdown input.html
html-to-markdown input.html -o output.md
cat page.html | html-to-markdown
html-to-markdown --url https://example.com > out.md
html-to-markdown --json input.html
html-to-markdown --json --include-structure input.html
html-to-markdown --json --no-content input.html
html-to-markdown input.html --preprocess --preset aggressive
SDK quick start
Rust
use html_to_markdown_rs::convert;
let result = convert("<h1>Hello World</h1><p>A paragraph.</p>", None)?;
println!("{}", result.content.unwrap_or_default());
Python
from html_to_markdown import convert
result = convert("<h1>Hello World</h1><p>A paragraph.</p>")
print(result.content)
print(result.metadata)
TypeScript / Node.js
import { convert } from "@kreuzberg/html-to-markdown";
const result = JSON.parse(convert("<h1>Hello World</h1><p>A paragraph.</p>"));
console.log(result.content);
ConversionResult fields
All languages return the same structure (dict, object, or struct).
| Field | Description |
|---|
content | Converted text (Markdown/Djot/plain). null only in extraction-only mode. |
metadata | Title, OG, headers, links, images, structured data. |
tables | Tables with grid (structured cells) and markdown fields. |
images | Extracted inline images (requires inline-image extraction). |
document | Structured document tree when structure extraction is enabled. |
warnings | Non-fatal processing warnings (message, kind). |
Configuration
All languages expose the same ~30 options. See references/configuration.md for the complete table. Common ones:
| Option | Values | Default |
|---|
heading_style | atx, underlined, atx-closed | atx |
code_block_style | backticks, indented, tildes | backticks |
output_format | markdown, djot | markdown |
wrap / wrap_width | bool / 20–500 | off / 80 |
autolinks (SDK) / --no-autolinks (CLI) | bool / flag | true (on); disable in CLI with --no-autolinks |
| preprocessing | minimal / standard / aggressive | off |
Rust (builder)
use html_to_markdown_rs::{convert, ConversionOptions, HeadingStyle, OutputFormat};
let options = ConversionOptions::builder()
.heading_style(HeadingStyle::Atx)
.output_format(OutputFormat::Markdown)
.wrap(true)
.wrap_width(100)
.build();
let result = convert(html, Some(options))?;
Python (dataclass)
from html_to_markdown import convert, ConversionOptions, PreprocessingOptions
result = convert(
html,
ConversionOptions(heading_style="atx", wrap=True, wrap_width=100),
PreprocessingOptions(enabled=True, preset="aggressive"),
)
Metadata extraction
Metadata is extracted by default; in the CLI it appears under metadata in --json output. Fields include document (title, description, language, charset, open_graph), headers, links (with link_type), images, and structured_data (JSON-LD/Microdata/RDFa). See the extracting-metadata skill for details.
Table extraction
Tables appear in result.tables, each with a pre-rendered markdown string and a structured cell grid. Markdown tables also appear inline in content. See the extracting-tables skill.
Document structure extraction
Enable structure extraction (--include-structure on the CLI, include_document_structure=true in SDKs) to get a semantic node tree under document. Node types include heading, paragraph, list, list_item, table, image, code, quote, group, metadata_block.
Common pitfalls
convert() returns a result object, not a string. Access .content for the Markdown text.
- Node.js
convert() returns a JSON string. Always JSON.parse(convert(html)) — NAPI-RS serializes the result for performance.
--json outputs JSON, not Markdown. Omit --json for plain Markdown.
--include-structure, --extract-inline-images, and --no-content require --json.
- CLI has no subcommands. Everything is a flag;
FILE is positional.
--preset, --keep-navigation, --keep-forms require --preprocess.
Additional resources
- CLI Reference — every flag, JSON shape, exit codes
- Configuration Reference — all 30+ options with defaults
- Rust API Reference — signatures, builder, feature flags
- Python API Reference — functions, dataclasses, type hints
- TypeScript API Reference — functions, interfaces, Buffer support
- Other Bindings — Go, Ruby, PHP, Java, C#, Elixir, R, WASM, C FFI
GitHub: https://github.com/xberg-io/html-to-markdown