| name | chunking |
| description | Use when splitting extracted text into chunks for LLM context windows or RAG ingestion. Covers chunk size, overlap, markdown/yaml/semantic chunkers, tokenizer-based sizing, and the standalone `chunk` command. |
Chunking
Use this when feeding documents into an LLM context window or a vector
store. Kreuzberg chunks two ways: inline during extraction (chunks land on
result.chunks), or standalone via the chunk command for text you
already have. Sizing is character-based by default, or token-based when a
tokenizer model is supplied.
Inline during extraction
Turn on chunking with --chunk and the chunks appear on the structured
result under chunks:
kreuzberg extract report.pdf --chunk --format json | jq '.chunks | length'
kreuzberg extract report.pdf --chunk --chunk-size 1500 --chunk-overlap 300 --format json
Overlap must be smaller than chunk size — the CLI rejects
--chunk-overlap >= --chunk-size. When you set only --chunk-overlap
against an existing config, an overlap that exceeds the size is clamped to
chunk_size / 4.
Standalone chunk command
Chunk text you already have, from --text or stdin. Output defaults to
JSON:
kreuzberg chunk --text "long document text ..." --chunk-size 800 --chunk-overlap 100
kreuzberg extract notes.md | kreuzberg chunk --chunk-size 500 --format json
JSON output carries chunks (array of strings), chunk_count, the
resolved config (max_characters, overlap, chunker_type), and
input_size_bytes. Use --format text for a human-readable dump with
--- chunk N --- separators.
Note: in the JSON output, chunker_type is rendered capitalized ("Text",
"Markdown", "Yaml", "Semantic") because it is emitted via Rust's Debug
formatting, whereas the --chunker-type input flag is lowercase
(text, markdown, yaml, semantic). Lowercase the value before
comparing if you parse it back.
Chunker types
--chunker-type selects the splitting strategy (standalone chunk
command):
| Type | Behavior |
|---|
text | Default. Plain character-window splitting with overlap. |
markdown | Markdown-aware — splits on structure (headings, blocks) where possible. |
yaml | YAML-aware splitting for structured config/data documents. |
semantic | Topic-boundary splitting driven by --topic-threshold (0.0–1.0, default 0.75). |
kreuzberg chunk --text "$(cat README.md)" --chunker-type markdown
kreuzberg chunk --text "$(cat transcript.txt)" --chunker-type semantic --topic-threshold 0.6
Token-based sizing
By default --chunk-size counts characters. To size chunks by tokens for
a specific model, pass --chunking-tokenizer with a HuggingFace tokenizer
id. On the extract command this implicitly enables chunking. Requires the
chunking-tokenizers feature (present in the default CLI build).
kreuzberg extract report.pdf --chunking-tokenizer Xenova/gpt-4o --format json
kreuzberg chunk --text "$(cat doc.txt)" --chunking-tokenizer Xenova/gpt-4o --chunk-size 512
With a tokenizer set, --chunk-size is interpreted in tokens, not
characters.
Config file alternative
Field names in config files are snake_case under [chunking]:
[chunking]
max_characters = 1000
overlap = 200
chunker_type = "markdown"
kreuzberg extract report.pdf --config kreuzberg.toml --format json
CLI flags map to config fields as --chunk-size → max_characters and
--chunk-overlap → overlap. In config files use the snake_case names.
Programmatic access
From Python, enable chunking on the config and read result.chunks:
from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig
config = ExtractionConfig(
chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
)
result = extract_file_sync("report.pdf", config=config)
for chunk in result.chunks:
print(len(chunk))
Python ChunkingConfig uses max_chars / max_overlap. Rust uses
max_characters / overlap. See references/python-api.md and
references/rust-api.md in the sibling kreuzberg skill.
Picking parameters
- RAG / vector store — 500–1000 chars (or 256–512 tokens) with
10–20% overlap. Use
markdown chunking for docs to keep sections whole.
- LLM summarization — larger chunks (1500–4000 chars) with small
overlap; size by tokens to stay under the model window.
- Topic segmentation —
semantic chunker; tune --topic-threshold
down for finer splits, up for coarser ones.
Common pitfalls
- Overlap ≥ size — rejected on
extract; clamped to size / 4 when
only overlap is changed against an existing config.
- Tokenizer without the feature —
--chunking-tokenizer errors if the
CLI was built without chunking-tokenizers. The default build includes it.
- Empty input — the standalone
chunk command bails on empty text;
provide --text or pipe non-empty stdin.
See references/configuration.md for the full [chunking] schema and
references/cli-reference.md for every chunk flag.