plugins

Use when extracting a page needs scripted interaction first — click, type, press a key, scroll, wait, screenshot, or run JS before capturing the DOM. Covers `crawlberg interact <url> --actions` with the real action schema, result shape, limits, and external-CDP options.

crawlberg

Crawl, scrape, and convert websites to Markdown using the local crawlberg CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out.

crawling-a-site

Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

headless-fallback

Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto|always|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

mapping-urls

Use when the user wants the list of URLs on a site rather than the page content — sitemap analysis, link planning, or seeding another tool. Covers `crawlberg map <url>` with `--limit`, `--search`, robots, output, and how it differs from a full crawl.

scraping-html-to-markdown

Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

serving-the-api

Use when the user wants a long-running HTTP service for scrape/crawl/map instead of one-shot CLI calls or the MCP server — for example wiring crawlberg into other apps over REST. Covers `crawlberg serve`, the Firecrawl-v1-compatible endpoints, `--host`/`--port`, and when to prefer it.

fetching-and-converting-urls

Use when fetching a live URL and converting it to Markdown. Covers --url, custom user agents, preprocessing for noisy pages, and the --json ConversionResult shape.

html-to-markdown

Convert HTML to Markdown, Djot, or plain text with structured extraction. Use when writing code that calls html-to-markdown APIs in Rust, Python, TypeScript, Go, Ruby, PHP, Java, C#, Elixir, R, C, or WASM. Covers installation, conversion, configuration, metadata extraction, tables, document structure, inline images, URL fetching, and CLI usage.

xberg-enterprise

Managed Kreuzberg document intelligence at api.xberg.io. Use when the user wants cloud extraction with webhook delivery, presigned uploads for large files, document versioning and diffing, sandbox keys, or per-project usage tracking — instead of running the local kreuzberg CLI. Covers authentication, the 12 REST endpoints, request/response shapes, error model, and SDK options.

offloading-extraction

Use when the user wants to extract a document via the cloud rather than the local kreuzberg CLI. Covers POST /v1/extract — JSON vs multipart bodies, URL crawls, options block, webhook attachment, and the async response shape.

sandbox-keys

Use when the user wants to try Xberg Enterprise without signing up, or needs an ephemeral key for evaluation, demos, or CI integration tests. Covers POST /v1/sandbox/key — the no-auth endpoint, quota, TTL, and cleanup expectations.

tracking-cloud-jobs

Use when an extraction job has been submitted and the result needs to be retrieved. Covers GET /v1/jobs/{id}, polling cadence with exponential backoff, terminal status detection, and webhook delivery (signature verification, retry semantics).

managing-cloud-usage

Use when the user asks about quota, billing visibility, or processed-page counts. Covers GET /v1/usage — query params, response shape, when to report usage proactively to the user.

presigned-uploads

Use when the user has files larger than ~50 MB to extract via the cloud, or when base64-encoding the body would be wasteful. Covers the three-step presign / PUT / confirm flow against POST /v1/uploads/presign and POST /v1/uploads/confirm.

versioning-documents

Use when the user wants to retrieve a stored document and its extraction result, list a document's versions, or diff two versions. Covers GET /v1/documents/{id}, GET /v1/documents/{id}/versions, and the sync-with-async-fallback diff at GET /v1/documents/{id}/diff plus its poll endpoint.

kreuzberg

Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.

liter-llm

Universal LLM API client for 143 providers with native bindings for 14 languages. Use when writing code that calls LLM APIs via liter-llm in Python, TypeScript, Rust, Go, Java, C#, Ruby, PHP, Elixir, WASM, or C, when running the OpenAI-compatible proxy, or when calling LLMs through the MCP server. Covers chat, streaming, tool calling, embeddings, image generation, speech, transcription, moderation, web search, OCR, reranking, provider routing, middleware, and configuration.

running-the-proxy

Use when running the `liter-llm api` OpenAI-compatible gateway — virtual keys, per-key rate limits, budgets, cost tracking, and model routing. Covers the TOML config and the 22 REST endpoints.

tree-sitter-language-pack

Parse and extract code intelligence from 306 programming languages using tree-sitter grammars. Use when writing code that parses source, extracts structure/imports/exports/symbols/docstrings/comments, detects a language, runs syntax diagnostics, or produces syntax-aware chunks for LLMs — in Rust, Python, Node.js/TypeScript, or the ts-pack CLI. Covers installation, the CLI surface, the SDK surface, and parser-cache management.

using-the-mcp-server

Use when parsing source, extracting code structure, or detecting a language through the tree-sitter-language-pack MCP server's tools, rather than shelling out to the ts-pack CLI. Covers the tool surface, the auto-installing launcher, and when MCP beats the CLI or SDK.

2026-06-22

converting-html

Use when converting HTML to Markdown, Djot, or plain text. Covers output formats, heading and code-block styles, lists, escaping, wrapping, and HTML preprocessing.

extracting-metadata

Use when extracting metadata from HTML — title, description, language, Open Graph, JSON-LD / Microdata / RDFa, headers, links, and images. Covers the --json output shape and the --extract-metadata flag.

extracting-tables

Use when extracting tabular data from HTML. Covers GFM Markdown tables, the structured tables array (grid cells plus pre-rendered markdown), and <br> handling in table cells.

using-the-mcp-server

Use when converting HTML to Markdown or extracting metadata and tables through the html-to-markdown MCP server's tools, rather than shelling out to the CLI. Covers the tool surface, the auto-installing launcher, and when MCP beats the CLI or SDK.

batch-extraction

Use when extracting from many files at once with shared config, bounded parallelism, per-file overrides, and error recovery. Covers the `batch` command, `--file-configs`, `--max-concurrent`, and output layout.

chunking

Use when splitting extracted text into chunks for LLM context windows or RAG ingestion. Covers chunk size, overlap, markdown/yaml/semantic chunkers, tokenizer-based sizing, and the standalone `chunk` command.

extracting-keywords

Use when extracting keywords (YAKE/RAKE) from documents — and, secondarily, when detecting document language or generating embeddings for RAG and search. Covers the keyword config (and its feature gating), `--detect-language`, and the standalone `embed` command with real flags.

calling-llms

Use when sending chat completions through liter-llm and routing to a specific provider via the `provider/model` prefix. Covers the chat call shape, provider routing, model_hint, message roles, and error categories.

embeddings-and-search

Use when generating embeddings, calling the 12 web-search providers, or running OCR over documents with the 4 OCR providers through liter-llm. Covers embed, search, and ocr methods plus reranking.

streaming-responses

Use when streaming tokens incrementally from an LLM via liter-llm over SSE or async iterators. Covers chat_stream, delta handling, and null-content chunks.

tool-calling

Use when defining functions/tools for an LLM to call through liter-llm, or requesting structured JSON outputs. Covers tool schemas, tool_calls handling, and response formats.

using-the-mcp-server

Use when calling LLM APIs through the liter-llm MCP server's 22 tools, and to decide when MCP beats the CLI or SDK. Covers the tool surface, the auto-installing launcher, and authentication.

chunking-for-llms

Use when the user wants to split source code into chunks for an LLM context window without breaking syntax mid-construct. Covers `ts-pack process --chunk-size`, why syntax-aware splits beat fixed-byte splits, picking a size, and the chunk JSON shape.

detecting-languages

Use when the user wants to know which programming language a file or snippet is. Covers implicit detection in `ts-pack parse`/`process`, confirming support with `ts-pack list`/`info`, and the SDK detection functions for path, extension, and raw content.

extracting-code-structure

Use when the user wants structured code metadata from a source file — functions, classes, imports, exports, symbols, docstrings, comments, or syntax diagnostics. Covers `ts-pack process` feature flags, the JSON result shape, and the default feature set.

managing-parsers

Use when the user needs to manage the tree-sitter parser cache — prefetch parsers for offline or CI runs, list what is downloaded, inspect a language, find the cache directory, or clean it. Covers `ts-pack download`, `list`, `info`, `cache-dir`, `clean`, and `init`.

parsing-source

Use when the user wants a tree-sitter syntax tree for a source file — an s-expression dump or JSON tree. Covers `ts-pack parse`, language auto-detection vs `--language`, stdin input, and reading `has_errors`.

extracting-tables

Use when extracting tabular data from PDFs, spreadsheets, or images. Covers layout-aware table detection, table model selection, output formats (markdown / JSON cells), and known limits.

2026-06-08

extracting-with-ocr