Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

offloading-extraction

Étoiles5

Forks0

Mis à jour25 juin 2026 à 06:10

Use when the user wants to extract a document via the cloud rather than the local kreuzberg CLI. Covers POST /v1/extract — JSON vs multipart bodies, URL crawls, options block, webhook attachment, and the async response shape.

Installation

Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.

Exécuter dans Manus

Source

xberg-io

xberg-io/plugins

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Téléchargement

Exécuter dans Manus

SKILL.md

readonly

Plus depuis ce dépôt

même dépôt

automating-the-browser

xberg-io/plugins

Use when extracting a page needs scripted interaction first — click, type, press a key, scroll, wait, screenshot, or run JS before capturing the DOM. Covers `crawlberg interact <url> --actions` with the real action schema, result shape, limits, and external-CDP options.

2026-06-255

crawlberg

xberg-io/plugins

Crawl, scrape, and convert websites to Markdown using the local crawlberg CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out.

2026-06-255

crawling-a-site

xberg-io/plugins

Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

2026-06-255

headless-fallback

xberg-io/plugins

Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto|always|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

2026-06-255

mapping-urls

xberg-io/plugins

Use when the user wants the list of URLs on a site rather than the page content — sitemap analysis, link planning, or seeding another tool. Covers `crawlberg map <url>` with `--limit`, `--search`, robots, output, and how it differs from a full crawl.

2026-06-255

scraping-html-to-markdown

xberg-io/plugins

Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

2026-06-255

name	offloading-extraction
description	Use when the user wants to extract a document via the cloud rather than the local kreuzberg CLI. Covers POST /v1/extract — JSON vs multipart bodies, URL crawls, options block, webhook attachment, and the async response shape.

Offloading extraction

POST /v1/extract is the single submit endpoint. It returns 202 Accepted with job_ids (extraction) and crawl_job_ids (URL crawls) — never the extraction result inline. Pair every submit with either a poll loop (tracking-cloud-jobs skill) or a webhook.

When to reach for this

File is on a remote URL.
File is on disk but the local kreuzberg CLI is not installed.
You want server-side parallelism for a batch.
The user wants webhook-delivered results to skip blocking.
File is larger than ~50 MB → use presigned-uploads instead — the base64 JSON body is too big.

Endpoint

POST https://api.xberg.io/v1/extract
Authorization: Bearer $KREUZBERG_API_KEY
Content-Type: application/json | multipart/form-data

Returns 202 Accepted with ExtractResponse.

Three submission shapes

1. Base64 JSON (small files, <5 MB recommended)

curl -X POST https://api.xberg.io/v1/extract \
  -H "Authorization: Bearer $KREUZBERG_API_KEY" \
  -H "Content-Type: application/json" \
  -d @- <<JSON
{
  "documents": [
    {
      "filename": "invoice.pdf",
      "mime_type": "application/pdf",
      "data": "$(base64 -w0 invoice.pdf)"
    }
  ],
  "options": {
    "extraction_config": {
      "output_format": "markdown",
      "ocr": { "backend": "tesseract", "language": "eng" }
    }
  }
}
JSON

2. Multipart (binary, recommended for anything over ~1 MB)

curl -X POST https://api.xberg.io/v1/extract \
  -H "Authorization: Bearer $KREUZBERG_API_KEY" \
  -F "file=@invoice.pdf;type=application/pdf" \
  -F 'options={"extraction_config":{"output_format":"markdown"}};type=application/json'

Add a webhook part as a JSON string:

  -F 'webhook={"url":"https://hooks.example.com/x","secret":"shh"};type=application/json'

3. URL crawl

curl -X POST https://api.xberg.io/v1/extract \
  -H "Authorization: Bearer $KREUZBERG_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [{"url": "https://example.com/docs"}],
    "crawl_config": {"max_depth": 2, "max_pages": 50, "stay_on_domain": true},
    "webhook": {"url": "https://hooks.example.com/x"}
  }'

URL crawls return crawl_job_ids instead of (or alongside) job_ids.

Response (202)

{
  "job_ids": ["550e8400-e29b-41d4-a716-446655440000"],
  "crawl_job_ids": [],
  "status": "pending"
}

status is always pending at submit time; the per-job status is retrieved via GET /v1/jobs/{id}.

The `options` block

Shape mirrors the local ExtractionConfig:

{
  "extraction_config": {
    "output_format": "markdown",
    "ocr": { "backend": "tesseract", "language": "eng+deu" },
    "extract_tables": true,
    "extract_images": false,
    "chunking": { "max_chars": 4000, "overlap": 200 }
  }
}

Supported output_format values: markdown, text, json, djot, html. Default is markdown.

The `webhook` block

{
  "url": "https://hooks.example.com/x",
  "secret": "shared-secret-32-bytes-min",
  "metadata": { "request_id": "abc123", "user_id": "u_42" }
}

secret is the HMAC key used to sign the webhook payload — see tracking-cloud-jobs for verification. metadata is echoed back in the delivered payload, useful for correlating server-side requests.

TypeScript SDK

import { KreuzbergCloud } from "@kreuzberg/cloud";
import { readFile } from "node:fs/promises";

const client = new KreuzbergCloud({ apiKey: process.env.KREUZBERG_API_KEY! });

const data = await readFile("invoice.pdf");
const job = await client.extract({
  file: { name: "invoice.pdf", data, mimeType: "application/pdf" },
  options: { extractionConfig: { outputFormat: "markdown" } },
});
console.log(job.id, job.status);

For submit + wait in one call:

const result = await client.extractAndWait({
  file: { name: "invoice.pdf", data },
});
console.log(result.result?.content);

Python SDK

from pathlib import Path
from xberg_enterprise import KreuzbergCloud

with KreuzbergCloud(api_key=os.environ["KREUZBERG_API_KEY"]) as client:
    job = client.extract(file=Path("invoice.pdf"))
    print(job.id, job.status)

Submit + wait:

job = client.extract_and_wait(file=Path("invoice.pdf"))
print(job.result.content if job.result else job.status)

Batch submission

JSON: pass multiple entries in documents. Multipart: repeat the file part. SDKs expose extractBatch / extract_batch helpers that fan out correctly per platform (parallel HTTP for the async Python client, sequential for the sync one).

Errors

Status	Cause	Fix
`400`	Empty `documents` and `urls`	Provide at least one.
`400`	Bad MIME type	Use a real RFC 6838 type, e.g. `application/pdf`.
`401`	Missing Bearer	Set `Authorization` header.
`413`	Request body too large	Switch to presigned uploads.
`429`	Quota or rate limit	Backoff; check `quota_remaining` via `/v1/usage`.

Next step

After every submit, hand off to the tracking-cloud-jobs skill — cloud extraction is asynchronous and the result is delivered via either polling or webhook callback. Never assume a result is ready immediately after the 202 response.

offloading-extraction

Plus depuis ce dépôt

Plus depuis ce dépôt

Offloading extraction

When to reach for this

Endpoint

Three submission shapes

1. Base64 JSON (small files, <5 MB recommended)

2. Multipart (binary, recommended for anything over ~1 MB)

3. URL crawl

Response (202)

The options block

The webhook block

TypeScript SDK

Python SDK

Batch submission

Errors

Next step

Offloading extraction

When to reach for this

Endpoint

Three submission shapes

1. Base64 JSON (small files, <5 MB recommended)

2. Multipart (binary, recommended for anything over ~1 MB)

3. URL crawl

Response (202)

The options block

The webhook block

TypeScript SDK

Python SDK

Batch submission

Errors

Next step

The `options` block

The `webhook` block

The `options` block

The `webhook` block