一键在 Manus 中运行任何 Skill

extracting-tables

星标5

分支0

更新时间2026年6月8日 18:00

Use when extracting tabular data from PDFs, spreadsheets, or images. Covers layout-aware table detection, table model selection, output formats (markdown / JSON cells), and known limits.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

xberg-io

xberg-io/plugins

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

SKILL.md

readonly

name	extracting-tables
description	Use when extracting tabular data from PDFs, spreadsheets, or images. Covers layout-aware table detection, table model selection, output formats (markdown / JSON cells), and known limits.

Extracting tables

Use this when the user wants structured tabular data — financial statements, scientific tables, invoices, spreadsheet-style PDFs. Kreuzberg detects tables via a layout model (RT-DETR v2) and reconstructs cell structure with a configurable table model.

Basic usage

# Markdown tables embedded in the content stream
kreuzberg extract report.pdf --layout --content-format markdown

# Structured JSON output, tables appear under result.tables
kreuzberg extract report.pdf --layout --format json

--layout turns on layout-aware extraction; without it, tables fall back to plain text reflow and you lose cell boundaries.

Output shapes

Two surfaces, picked via --format (CLI shape) and --content-format (content rendering):

Markdown tables in content — --content-format markdown. Tables appear inline as | col | col | blocks. Good for LLM ingestion.
Structured tables array — --format json. Each entry has cells[][] (rows × cols), markdown (pre-rendered), page_index, bbox. Use this when downstream code needs exact cell access.

Both are populated at once when --layout is on. The tables array is always structured; the content stream switches representation.

kreuzberg extract financials.pdf --layout --format json \
  | jq '.tables[] | {page: .page_index, rows: (.cells | length)}'

Table models

--layout-table-model picks the reconstruction backend:

Model	Best for	Notes
`tatr`	dense complex tables (academic, financial)	Default. Heaviest, highest accuracy.
`slanet_auto`	dispatches per-table to wired/wireless	Good when table styles are mixed.
`slanet_wired`	tables with visible borders	Faster than tatr.
`slanet_wireless`	tables without borders (whitespace-separated)	For invoices, simple grids.
`slanet_plus`	hybrid wired / wireless	Lighter than `slanet_auto`.
`disabled`	layout detection only, no table structure	Use to skip table model cost.

kreuzberg extract bank-statement.pdf \
  --layout --layout-table-model tatr --content-format markdown

Drop --layout-confidence when the layout model misses tables (default threshold ~0.5):

kreuzberg extract noisy-scan.pdf --layout --layout-confidence 0.3

Spreadsheets

.xlsx, .ods, .csv, .tsv are extracted by dedicated parsers — no layout model needed. Each sheet becomes a markdown table (or structured table) automatically:

kreuzberg extract workbook.xlsx --content-format markdown
kreuzberg extract data.csv --format json

Pass --no-cache=true only when iterating on the same file with different configs.

Config file alternative

# `output_format` in config files equals `--content-format` on the CLI.
output_format = "markdown"

[layout_detection]
enabled = true
confidence_threshold = 0.5
table_model = "tatr"

Then:

kreuzberg extract report.pdf --format json

Programmatic access

From Python, structured tables live on result.tables:

from kreuzberg import extract_file_sync, ExtractionConfig, LayoutDetectionConfig

config = ExtractionConfig(
    layout_detection=LayoutDetectionConfig(enabled=True, table_model="tatr"),
    output_format="markdown",
)
result = extract_file_sync("report.pdf", config=config)
for table in result.tables:
    print(table.markdown)        # rendered markdown
    print(table.cells[0][0])     # cell access

Node.js mirrors this (extractFile, result.tables, camelCase fields). See references/python-api.md and references/nodejs-api.md in the sibling kreuzberg skill for full type signatures.

Known limitations

Merged cells — reconstructed as repeated values across the spanned region; the merge is not preserved as metadata in v0.1.
Rotated tables — enable --ocr-auto-rotate true for image-based PDFs before extraction.
Nested tables — flattened. Detection succeeds; structural nesting is lost.
Multi-page tables — each page yields a separate tables[] entry. Stitch by matching column headers if needed.
ONNX Runtime required — layout and table models are unavailable in WASM builds and on the Android x86_64 emulator; native targets ship full support.

Common failure modes

Empty tables with --layout on — confidence threshold too high or table model mismatched. Drop --layout-confidence to 0.3, try --layout-table-model tatr.
Markdown tables look ragged — switch --layout-table-model to slanet_wired for bordered grids or slanet_wireless for invoices.
Slow extraction — tatr is heavy. Use slanet_auto or slanet_plus as a default; reach for tatr only when accuracy matters.

See references/cli-reference.md for the full layout flag set and references/advanced-features.md for the layout pipeline internals.

同仓库更多 Skills

同仓库

automating-the-browser

xberg-io/plugins

Use when extracting a page needs scripted interaction first — click, type, press a key, scroll, wait, screenshot, or run JS before capturing the DOM. Covers `crawlberg interact <url> --actions` with the real action schema, result shape, limits, and external-CDP options.

2026-06-255

crawlberg

xberg-io/plugins

Crawl, scrape, and convert websites to Markdown using the local crawlberg CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out.

2026-06-255

crawling-a-site

xberg-io/plugins

Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

2026-06-255

headless-fallback

xberg-io/plugins

Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto|always|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

2026-06-255

mapping-urls

xberg-io/plugins

Use when the user wants the list of URLs on a site rather than the page content — sitemap analysis, link planning, or seeding another tool. Covers `crawlberg map <url>` with `--limit`, `--search`, robots, output, and how it differs from a full crawl.

2026-06-255

scraping-html-to-markdown

xberg-io/plugins

Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

2026-06-255

name	extracting-tables
description	Use when extracting tabular data from PDFs, spreadsheets, or images. Covers layout-aware table detection, table model selection, output formats (markdown / JSON cells), and known limits.

Extracting tables

Basic usage

# Markdown tables embedded in the content stream
kreuzberg extract report.pdf --layout --content-format markdown

# Structured JSON output, tables appear under result.tables
kreuzberg extract report.pdf --layout --format json

--layout turns on layout-aware extraction; without it, tables fall back to plain text reflow and you lose cell boundaries.

Output shapes

Two surfaces, picked via --format (CLI shape) and --content-format (content rendering):

Markdown tables in content — --content-format markdown. Tables appear inline as | col | col | blocks. Good for LLM ingestion.
Structured tables array — --format json. Each entry has cells[][] (rows × cols), markdown (pre-rendered), page_index, bbox. Use this when downstream code needs exact cell access.

Both are populated at once when --layout is on. The tables array is always structured; the content stream switches representation.

kreuzberg extract financials.pdf --layout --format json \
  | jq '.tables[] | {page: .page_index, rows: (.cells | length)}'

Table models

--layout-table-model picks the reconstruction backend:

Model	Best for	Notes
`tatr`	dense complex tables (academic, financial)	Default. Heaviest, highest accuracy.
`slanet_auto`	dispatches per-table to wired/wireless	Good when table styles are mixed.
`slanet_wired`	tables with visible borders	Faster than tatr.
`slanet_wireless`	tables without borders (whitespace-separated)	For invoices, simple grids.
`slanet_plus`	hybrid wired / wireless	Lighter than `slanet_auto`.
`disabled`	layout detection only, no table structure	Use to skip table model cost.

kreuzberg extract bank-statement.pdf \
  --layout --layout-table-model tatr --content-format markdown

Drop --layout-confidence when the layout model misses tables (default threshold ~0.5):

kreuzberg extract noisy-scan.pdf --layout --layout-confidence 0.3

Spreadsheets

.xlsx, .ods, .csv, .tsv are extracted by dedicated parsers — no layout model needed. Each sheet becomes a markdown table (or structured table) automatically:

kreuzberg extract workbook.xlsx --content-format markdown
kreuzberg extract data.csv --format json

Pass --no-cache=true only when iterating on the same file with different configs.

Config file alternative

# `output_format` in config files equals `--content-format` on the CLI.
output_format = "markdown"

[layout_detection]
enabled = true
confidence_threshold = 0.5
table_model = "tatr"

Then:

kreuzberg extract report.pdf --format json

Programmatic access

From Python, structured tables live on result.tables:

from kreuzberg import extract_file_sync, ExtractionConfig, LayoutDetectionConfig

config = ExtractionConfig(
    layout_detection=LayoutDetectionConfig(enabled=True, table_model="tatr"),
    output_format="markdown",
)
result = extract_file_sync("report.pdf", config=config)
for table in result.tables:
    print(table.markdown)        # rendered markdown
    print(table.cells[0][0])     # cell access

Node.js mirrors this (extractFile, result.tables, camelCase fields). See references/python-api.md and references/nodejs-api.md in the sibling kreuzberg skill for full type signatures.

Known limitations

Merged cells — reconstructed as repeated values across the spanned region; the merge is not preserved as metadata in v0.1.
Rotated tables — enable --ocr-auto-rotate true for image-based PDFs before extraction.
Nested tables — flattened. Detection succeeds; structural nesting is lost.
Multi-page tables — each page yields a separate tables[] entry. Stitch by matching column headers if needed.
ONNX Runtime required — layout and table models are unavailable in WASM builds and on the Android x86_64 emulator; native targets ship full support.

Common failure modes

Empty tables with --layout on — confidence threshold too high or table model mismatched. Drop --layout-confidence to 0.3, try --layout-table-model tatr.
Markdown tables look ragged — switch --layout-table-model to slanet_wired for bordered grids or slanet_wireless for invoices.
Slow extraction — tatr is heavy. Use slanet_auto or slanet_plus as a default; reach for tatr only when accuracy matters.

See references/cli-reference.md for the full layout flag set and references/advanced-features.md for the layout pipeline internals.