Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

batch-extraction

Étoiles5

Forks0

Mis à jour20 juin 2026 à 10:09

Use when extracting from many files at once with shared config, bounded parallelism, per-file overrides, and error recovery. Covers the `batch` command, `--file-configs`, `--max-concurrent`, and output layout.

Installation

Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.

Exécuter dans Manus

Source

xberg-io

xberg-io/plugins

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Téléchargement

Exécuter dans Manus

SKILL.md

readonly

name	batch-extraction
description	Use when extracting from many files at once with shared config, bounded parallelism, per-file overrides, and error recovery. Covers the `batch` command, `--file-configs`, `--max-concurrent`, and output layout.

Batch extraction

Use this when processing a directory or glob of documents in one pass. kreuzberg batch shares one extraction config across every file, runs extractions concurrently, and returns one structured array — failures on individual files do not abort the run.

Basic usage

# Glob expands to many paths; results come back as a JSON array (default)
kreuzberg batch *.pdf

# Mixed formats, markdown content for LLM ingestion
kreuzberg batch docs/*.docx --content-format markdown

# Recurse with the shell, then extract
kreuzberg batch $(find ./corpus -name '*.pdf')

batch defaults to --format json (vs --format text for single extract). Each array entry is a full extraction result, so downstream code can index by position into the input path list.

kreuzberg batch reports/*.pdf \
  | jq '.[] | {chars: (.content | length), mime: .mime_type}'

Parallelism

--max-concurrent caps how many files extract at once (default: CPU count). Lower it on memory-constrained hosts or when OCR/ML models are active, since each in-flight extraction holds its own buffers:

# Cap at 4 concurrent extractions
kreuzberg batch scans/*.pdf --ocr true --max-concurrent 4

--max-threads additionally caps total internal threads (Rayon, ONNX intra-op, the batch semaphore) for tightly constrained environments:

kreuzberg batch *.pdf --max-concurrent 2 --max-threads 4

Per-file config overrides

A single shared config does not always fit. --file-configs points at a JSON file mapping each path to its own override object, merged on top of the shared config for that file only:

{
  "scan.pdf": { "force_ocr": true },
  "report.pdf": { "output_format": "markdown" },
  "data.xlsx": { "output_format": "json" }
}

kreuzberg batch scan.pdf report.pdf data.xlsx --file-configs overrides.json

Keys are file paths (matching the paths passed on the command line); values are per-file extraction config objects in snake_case, the same shape as a config file.

Output layout

For text/toon output with image extraction, --output-dir controls where referenced image files (e.g. image_0.png) are written; the directory must already exist. JSON output embeds image bytes inline and ignores --output-dir.

mkdir -p out/images
kreuzberg batch slides/*.pptx --extract-images true --output-dir out/images --format text

Error recovery

Batch extraction is fault-tolerant per file: one unreadable or corrupt document does not stop the rest. Inspect results for partial content and surfaced errors rather than relying on the process exit code alone. Pair with --max-concurrent to avoid exhausting memory when a few large files sit in a big batch.

Shared config

Every extract flag also applies to batch (OCR, chunking, layout, content format, etc.) and is shared across all files unless a --file-configs entry overrides it:

kreuzberg batch invoices/*.pdf \
  --layout --layout-table-model slanet_wireless \
  --content-format markdown --max-concurrent 8

A config file works too and auto-discovers from the cwd upward:

output_format = "markdown"

[ocr]
backend = "tesseract"
language = "eng"

kreuzberg batch corpus/*.pdf --config kreuzberg.toml

Programmatic access

From Python, use the batch helpers (async and sync):

from kreuzberg import batch_extract_files, batch_extract_files_sync, ExtractionConfig

config = ExtractionConfig(output_format="markdown")

# Async
results = await batch_extract_files(["a.pdf", "b.docx", "c.xlsx"], config=config)

# Sync
results = batch_extract_files_sync(["a.pdf", "b.docx"], config=config)

for result in results:
    print(len(result.content))

Node.js mirrors this with batchExtractFiles; Rust uses batch_extract_file (requires the tokio-runtime feature). See references/python-api.md, references/nodejs-api.md, and references/rust-api.md in the sibling kreuzberg skill.

MCP

When the kreuzberg MCP server is registered, prefer the batch_extract_files tool over shelling out — it takes the file list and a config object and returns structured results directly.

Common pitfalls

Default format differs — batch defaults to --format json, extract to --format text. Set --format explicitly if a script depends on one shape.
--output-dir must exist — the CLI does not create it.
Memory blowups — large batches with OCR/layout active need a lower --max-concurrent; the default is CPU count.
--file-configs path keys — must match the paths as passed on the command line, not absolute-resolved variants.

See references/cli-reference.md for the full batch flag set.

Plus depuis ce dépôt

même dépôt

automating-the-browser

xberg-io/plugins

Use when extracting a page needs scripted interaction first — click, type, press a key, scroll, wait, screenshot, or run JS before capturing the DOM. Covers `crawlberg interact <url> --actions` with the real action schema, result shape, limits, and external-CDP options.

2026-06-255

crawlberg

xberg-io/plugins

Crawl, scrape, and convert websites to Markdown using the local crawlberg CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out.

2026-06-255

crawling-a-site

xberg-io/plugins

Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

2026-06-255

headless-fallback

xberg-io/plugins

Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto|always|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

2026-06-255

mapping-urls

xberg-io/plugins

Use when the user wants the list of URLs on a site rather than the page content — sitemap analysis, link planning, or seeding another tool. Covers `crawlberg map <url>` with `--limit`, `--search`, robots, output, and how it differs from a full crawl.

2026-06-255

scraping-html-to-markdown

xberg-io/plugins

Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

2026-06-255

name	batch-extraction
description	Use when extracting from many files at once with shared config, bounded parallelism, per-file overrides, and error recovery. Covers the `batch` command, `--file-configs`, `--max-concurrent`, and output layout.

Batch extraction

Basic usage

# Glob expands to many paths; results come back as a JSON array (default)
kreuzberg batch *.pdf

# Mixed formats, markdown content for LLM ingestion
kreuzberg batch docs/*.docx --content-format markdown

# Recurse with the shell, then extract
kreuzberg batch $(find ./corpus -name '*.pdf')

batch defaults to --format json (vs --format text for single extract). Each array entry is a full extraction result, so downstream code can index by position into the input path list.

kreuzberg batch reports/*.pdf \
  | jq '.[] | {chars: (.content | length), mime: .mime_type}'

Parallelism

# Cap at 4 concurrent extractions
kreuzberg batch scans/*.pdf --ocr true --max-concurrent 4

--max-threads additionally caps total internal threads (Rayon, ONNX intra-op, the batch semaphore) for tightly constrained environments:

kreuzberg batch *.pdf --max-concurrent 2 --max-threads 4

Per-file config overrides

A single shared config does not always fit. --file-configs points at a JSON file mapping each path to its own override object, merged on top of the shared config for that file only:

{
  "scan.pdf": { "force_ocr": true },
  "report.pdf": { "output_format": "markdown" },
  "data.xlsx": { "output_format": "json" }
}

kreuzberg batch scan.pdf report.pdf data.xlsx --file-configs overrides.json

Keys are file paths (matching the paths passed on the command line); values are per-file extraction config objects in snake_case, the same shape as a config file.

Output layout

mkdir -p out/images
kreuzberg batch slides/*.pptx --extract-images true --output-dir out/images --format text

Error recovery

Shared config

Every extract flag also applies to batch (OCR, chunking, layout, content format, etc.) and is shared across all files unless a --file-configs entry overrides it:

kreuzberg batch invoices/*.pdf \
  --layout --layout-table-model slanet_wireless \
  --content-format markdown --max-concurrent 8

A config file works too and auto-discovers from the cwd upward:

output_format = "markdown"

[ocr]
backend = "tesseract"
language = "eng"

kreuzberg batch corpus/*.pdf --config kreuzberg.toml

Programmatic access

From Python, use the batch helpers (async and sync):

from kreuzberg import batch_extract_files, batch_extract_files_sync, ExtractionConfig

config = ExtractionConfig(output_format="markdown")

# Async
results = await batch_extract_files(["a.pdf", "b.docx", "c.xlsx"], config=config)

# Sync
results = batch_extract_files_sync(["a.pdf", "b.docx"], config=config)

for result in results:
    print(len(result.content))

MCP

When the kreuzberg MCP server is registered, prefer the batch_extract_files tool over shelling out — it takes the file list and a config object and returns structured results directly.

Common pitfalls

Default format differs — batch defaults to --format json, extract to --format text. Set --format explicitly if a script depends on one shape.
--output-dir must exist — the CLI does not create it.
Memory blowups — large batches with OCR/layout active need a lower --max-concurrent; the default is CPU count.
--file-configs path keys — must match the paths as passed on the command line, not absolute-resolved variants.

See references/cli-reference.md for the full batch flag set.