تشغيل أي مهارة في Manus بنقرة واحدة

chunking

النجوم٥

التفرعات٠

آخر تحديث٢٠ يونيو ٢٠٢٦ في ١٠:٠٩

Use when splitting extracted text into chunks for LLM context windows or RAG ingestion. Covers chunk size, overlap, markdown/yaml/semantic chunkers, tokenizer-based sizing, and the standalone `chunk` command.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

xberg-io

xberg-io/plugins

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

SKILL.md

readonly

name	chunking
description	Use when splitting extracted text into chunks for LLM context windows or RAG ingestion. Covers chunk size, overlap, markdown/yaml/semantic chunkers, tokenizer-based sizing, and the standalone `chunk` command.

Chunking

Use this when feeding documents into an LLM context window or a vector store. Kreuzberg chunks two ways: inline during extraction (chunks land on result.chunks), or standalone via the chunk command for text you already have. Sizing is character-based by default, or token-based when a tokenizer model is supplied.

Inline during extraction

Turn on chunking with --chunk and the chunks appear on the structured result under chunks:

# 1000-char chunks, 200-char overlap (defaults when --chunk is on)
kreuzberg extract report.pdf --chunk --format json | jq '.chunks | length'

# Explicit size + overlap
kreuzberg extract report.pdf --chunk --chunk-size 1500 --chunk-overlap 300 --format json

Overlap must be smaller than chunk size — the CLI rejects --chunk-overlap >= --chunk-size. When you set only --chunk-overlap against an existing config, an overlap that exceeds the size is clamped to chunk_size / 4.

Standalone `chunk` command

Chunk text you already have, from --text or stdin. Output defaults to JSON:

# From a flag
kreuzberg chunk --text "long document text ..." --chunk-size 800 --chunk-overlap 100

# From stdin (pipe extracted content straight in)
kreuzberg extract notes.md | kreuzberg chunk --chunk-size 500 --format json

JSON output carries chunks (array of strings), chunk_count, the resolved config (max_characters, overlap, chunker_type), and input_size_bytes. Use --format text for a human-readable dump with --- chunk N --- separators.

Note: in the JSON output, chunker_type is rendered capitalized ("Text", "Markdown", "Yaml", "Semantic") because it is emitted via Rust's Debug formatting, whereas the --chunker-type input flag is lowercase (text, markdown, yaml, semantic). Lowercase the value before comparing if you parse it back.

Chunker types

--chunker-type selects the splitting strategy (standalone chunk command):

Type	Behavior
`text`	Default. Plain character-window splitting with overlap.
`markdown`	Markdown-aware — splits on structure (headings, blocks) where possible.
`yaml`	YAML-aware splitting for structured config/data documents.
`semantic`	Topic-boundary splitting driven by `--topic-threshold` (0.0–1.0, default 0.75).

# Markdown-aware chunking keeps headings and blocks intact
kreuzberg chunk --text "$(cat README.md)" --chunker-type markdown

# Semantic chunking — lower threshold = more, smaller topic chunks
kreuzberg chunk --text "$(cat transcript.txt)" --chunker-type semantic --topic-threshold 0.6

Token-based sizing

By default --chunk-size counts characters. To size chunks by tokens for a specific model, pass --chunking-tokenizer with a HuggingFace tokenizer id. On the extract command this implicitly enables chunking. Requires the chunking-tokenizers feature (present in the default CLI build).

# Size chunks by GPT-4o tokens during extraction
kreuzberg extract report.pdf --chunking-tokenizer Xenova/gpt-4o --format json

# Or on the standalone command
kreuzberg chunk --text "$(cat doc.txt)" --chunking-tokenizer Xenova/gpt-4o --chunk-size 512

With a tokenizer set, --chunk-size is interpreted in tokens, not characters.

Config file alternative

Field names in config files are snake_case under [chunking]:

[chunking]
max_characters = 1000
overlap = 200
chunker_type = "markdown"

kreuzberg extract report.pdf --config kreuzberg.toml --format json

CLI flags map to config fields as --chunk-size → max_characters and --chunk-overlap → overlap. In config files use the snake_case names.

Programmatic access

From Python, enable chunking on the config and read result.chunks:

from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig

config = ExtractionConfig(
    chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
)
result = extract_file_sync("report.pdf", config=config)
for chunk in result.chunks:
    print(len(chunk))

Python ChunkingConfig uses max_chars / max_overlap. Rust uses max_characters / overlap. See references/python-api.md and references/rust-api.md in the sibling kreuzberg skill.

Picking parameters

RAG / vector store — 500–1000 chars (or 256–512 tokens) with 10–20% overlap. Use markdown chunking for docs to keep sections whole.
LLM summarization — larger chunks (1500–4000 chars) with small overlap; size by tokens to stay under the model window.
Topic segmentation — semantic chunker; tune --topic-threshold down for finer splits, up for coarser ones.

Common pitfalls

Overlap ≥ size — rejected on extract; clamped to size / 4 when only overlap is changed against an existing config.
Tokenizer without the feature — --chunking-tokenizer errors if the CLI was built without chunking-tokenizers. The default build includes it.
Empty input — the standalone chunk command bails on empty text; provide --text or pipe non-empty stdin.

See references/configuration.md for the full [chunking] schema and references/cli-reference.md for every chunk flag.

المزيد من هذا المستودع

نفس المستودع

automating-the-browser

xberg-io/plugins

Use when extracting a page needs scripted interaction first — click, type, press a key, scroll, wait, screenshot, or run JS before capturing the DOM. Covers `crawlberg interact <url> --actions` with the real action schema, result shape, limits, and external-CDP options.

2026-06-255

crawlberg

xberg-io/plugins

Crawl, scrape, and convert websites to Markdown using the local crawlberg CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out.

2026-06-255

crawling-a-site

xberg-io/plugins

Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

2026-06-255

headless-fallback

xberg-io/plugins

Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto|always|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

2026-06-255

mapping-urls

xberg-io/plugins

Use when the user wants the list of URLs on a site rather than the page content — sitemap analysis, link planning, or seeding another tool. Covers `crawlberg map <url>` with `--limit`, `--search`, robots, output, and how it differs from a full crawl.

2026-06-255

scraping-html-to-markdown

xberg-io/plugins

Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

2026-06-255

name	chunking
description	Use when splitting extracted text into chunks for LLM context windows or RAG ingestion. Covers chunk size, overlap, markdown/yaml/semantic chunkers, tokenizer-based sizing, and the standalone `chunk` command.

Chunking

Inline during extraction

Turn on chunking with --chunk and the chunks appear on the structured result under chunks:

# 1000-char chunks, 200-char overlap (defaults when --chunk is on)
kreuzberg extract report.pdf --chunk --format json | jq '.chunks | length'

# Explicit size + overlap
kreuzberg extract report.pdf --chunk --chunk-size 1500 --chunk-overlap 300 --format json

Standalone `chunk` command

Chunk text you already have, from --text or stdin. Output defaults to JSON:

# From a flag
kreuzberg chunk --text "long document text ..." --chunk-size 800 --chunk-overlap 100

# From stdin (pipe extracted content straight in)
kreuzberg extract notes.md | kreuzberg chunk --chunk-size 500 --format json

Note: in the JSON output, chunker_type is rendered capitalized ("Text", "Markdown", "Yaml", "Semantic") because it is emitted via Rust's Debug formatting, whereas the --chunker-type input flag is lowercase (text, markdown, yaml, semantic). Lowercase the value before comparing if you parse it back.

Chunker types

--chunker-type selects the splitting strategy (standalone chunk command):

Type	Behavior
`text`	Default. Plain character-window splitting with overlap.
`markdown`	Markdown-aware — splits on structure (headings, blocks) where possible.
`yaml`	YAML-aware splitting for structured config/data documents.
`semantic`	Topic-boundary splitting driven by `--topic-threshold` (0.0–1.0, default 0.75).

# Markdown-aware chunking keeps headings and blocks intact
kreuzberg chunk --text "$(cat README.md)" --chunker-type markdown

# Semantic chunking — lower threshold = more, smaller topic chunks
kreuzberg chunk --text "$(cat transcript.txt)" --chunker-type semantic --topic-threshold 0.6

Token-based sizing

# Size chunks by GPT-4o tokens during extraction
kreuzberg extract report.pdf --chunking-tokenizer Xenova/gpt-4o --format json

# Or on the standalone command
kreuzberg chunk --text "$(cat doc.txt)" --chunking-tokenizer Xenova/gpt-4o --chunk-size 512

With a tokenizer set, --chunk-size is interpreted in tokens, not characters.

Config file alternative

Field names in config files are snake_case under [chunking]:

[chunking]
max_characters = 1000
overlap = 200
chunker_type = "markdown"

kreuzberg extract report.pdf --config kreuzberg.toml --format json

CLI flags map to config fields as --chunk-size → max_characters and --chunk-overlap → overlap. In config files use the snake_case names.

Programmatic access

From Python, enable chunking on the config and read result.chunks:

from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig

config = ExtractionConfig(
    chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
)
result = extract_file_sync("report.pdf", config=config)
for chunk in result.chunks:
    print(len(chunk))

Python ChunkingConfig uses max_chars / max_overlap. Rust uses max_characters / overlap. See references/python-api.md and references/rust-api.md in the sibling kreuzberg skill.

Picking parameters

RAG / vector store — 500–1000 chars (or 256–512 tokens) with 10–20% overlap. Use markdown chunking for docs to keep sections whole.
LLM summarization — larger chunks (1500–4000 chars) with small overlap; size by tokens to stay under the model window.
Topic segmentation — semantic chunker; tune --topic-threshold down for finer splits, up for coarser ones.

Common pitfalls

Overlap ≥ size — rejected on extract; clamped to size / 4 when only overlap is changed against an existing config.
Tokenizer without the feature — --chunking-tokenizer errors if the CLI was built without chunking-tokenizers. The default build includes it.
Empty input — the standalone chunk command bails on empty text; provide --text or pipe non-empty stdin.

See references/configuration.md for the full [chunking] schema and references/cli-reference.md for every chunk flag.

chunking

Chunking

Inline during extraction

Standalone chunk command

Chunker types

Token-based sizing

Config file alternative

Programmatic access

Picking parameters

Common pitfalls

المزيد من هذا المستودع

المزيد من هذا المستودع

Chunking

Inline during extraction

Standalone chunk command

Chunker types

Token-based sizing

Config file alternative

Programmatic access

Picking parameters

Common pitfalls

Standalone `chunk` command

Standalone `chunk` command