Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

extracting-keywords

Étoiles5

Forks0

Mis à jour20 juin 2026 à 10:09

Use when extracting keywords (YAKE/RAKE) from documents — and, secondarily, when detecting document language or generating embeddings for RAG and search. Covers the keyword config (and its feature gating), `--detect-language`, and the standalone `embed` command with real flags.

Installation

Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.

Exécuter dans Manus

Source

xberg-io

xberg-io/plugins

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Téléchargement

Exécuter dans Manus

SKILL.md

readonly

name	extracting-keywords
description	Use when extracting keywords (YAKE/RAKE) from documents — and, secondarily, when detecting document language or generating embeddings for RAG and search. Covers the keyword config (and its feature gating), `--detect-language`, and the standalone `embed` command with real flags.

Extracting keywords, language, and embeddings

Use this for the enrichment surface around extraction: statistical keyword extraction, language detection, and vector embeddings. Keywords and language detection ride along with extraction and land on the result; embeddings are produced by a dedicated embed command.

Keywords (YAKE / RAKE)

Keyword extraction is configured via the [keywords] config block (or inline JSON) — there is no single --keywords CLI flag. When enabled, extracted keywords appear on result.keywords. Two algorithms are available:

YAKE ("yake") — statistical, unsupervised single-document extraction. Good general default.
RAKE ("rake") — co-occurrence / phrase-based. Favors multi-word key phrases.

Feature-gated: keyword extraction requires the CLI to be built with the keywords-yake and/or keywords-rake Cargo features (both are in the default/full build). If the CLI was built without them, the [keywords] config block is silently ignored — result.keywords simply stays empty rather than erroring. The "yake" algorithm needs keywords-yake; "rake" needs keywords-rake.

Enable via inline JSON on the CLI:

kreuzberg extract paper.pdf --format json \
  --config-json '{"keywords":{"algorithm":"yake","max_keywords":15,"language":"en"}}' \
  | jq '.keywords'

Or in a config file:

[keywords]
algorithm = "rake"       # "yake" or "rake"
max_keywords = 10        # default 10
min_score = 0.0          # filter below this score (ranges differ per algorithm)
ngram_range = [1, 3]     # unigrams..trigrams (default)
language = "en"          # stopword language; omit to skip stopword filtering

kreuzberg extract report.pdf --config kreuzberg.toml --format json | jq '.keywords'

Field notes:

max_keywords caps how many keywords are returned (default 10).
min_score filters low-scoring keywords; note YAKE scores are lower-is-better while RAKE scores are higher-is-better, so a single threshold behaves differently per algorithm.
ngram_range is [min, max]: [1,1] unigrams only, [1,2] adds bigrams, [1,3] (default) adds trigrams.
language enables stopword filtering for that language; omit it to disable stopword filtering entirely.

Language detection

Language detection is a real CLI flag: --detect-language. Detected languages appear on result.detected_languages:

kreuzberg extract multilingual.pdf --detect-language true --format json \
  | jq '.detected_languages'

In a config file it lives under [language_detection]:

[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false

The CLI flag enables detection with min_confidence = 0.8 and single-language mode; use the config block to detect multiple languages or tune confidence.

Embeddings (`embed` command)

The standalone embed command produces vector embeddings for text from --text (repeatable) or stdin. It does not run extraction — pipe extracted content in if you want document embeddings.

# Local ONNX preset model (default provider)
kreuzberg embed --text "first passage" --text "second passage" --preset balanced

# Embed extracted document text
kreuzberg extract report.pdf | kreuzberg embed --preset quality

Presets for the local provider: fast, balanced (default), quality, multilingual. Output defaults to JSON (--format json).

--provider selects the embedding source:

Provider	Flag	Notes
`local`	`--preset <fast\|balanced\|quality\|multilingual>`	Default. ONNX model, no API key.
`llm`	`--model <id>` `--api-key <key>`	liter-llm routing, e.g. `openai/text-embedding-3-small`.
`plugin`	`--plugin <name>`	A backend pre-registered in-process via the plugin API.

# Provider-hosted embeddings via an LLM
kreuzberg embed --text "query text" \
  --provider llm --model openai/text-embedding-3-small --api-key "$OPENAI_API_KEY"

Local embedding presets must be downloaded first if not cached. Pre-warm them with the cache command:

kreuzberg cache warm --embedding-model balanced   # one preset
kreuzberg cache warm --all-embeddings             # all four presets

Programmatic access

Keywords and detected languages live on the extraction result:

from kreuzberg import extract_file_sync, ExtractionConfig

result = extract_file_sync(
    "paper.pdf",
    config=ExtractionConfig(),  # configure keywords/language_detection on the config
)
print(result.keywords)             # extracted keywords (when enabled)
print(result.detected_languages)   # detected languages (when enabled)

See references/python-api.md and references/configuration.md in the sibling kreuzberg skill for the keyword / language-detection config classes and the embedding presets.

Common pitfalls

No --keywords flag — keyword extraction is config-only. Use --config-json '{"keywords":{...}}' or a [keywords] config block.
min_score direction — lower is better for YAKE, higher is better for RAKE; pick the threshold to match the algorithm.
Embeddings ≠ extraction — embed only takes raw text. Pipe kreuzberg extract output into it for document vectors.
Cold embedding models — first local run downloads the preset; run kreuzberg cache warm --all-embeddings to pre-populate.

See references/advanced-features.md for the embeddings pipeline and references/cli-reference.md for the embed and cache warm flag sets.

Plus depuis ce dépôt

même dépôt

automating-the-browser

xberg-io/plugins

Use when extracting a page needs scripted interaction first — click, type, press a key, scroll, wait, screenshot, or run JS before capturing the DOM. Covers `crawlberg interact <url> --actions` with the real action schema, result shape, limits, and external-CDP options.

2026-06-255

crawlberg

xberg-io/plugins

Crawl, scrape, and convert websites to Markdown using the local crawlberg CLI and its MCP server. Use when the user wants to fetch a page, follow links across a domain, enumerate URLs, or drive a real browser. Covers installation, the subcommands (scrape, crawl, map, interact, mcp, serve), output formats (JSON + Markdown), browser fallback, and when to prefer the MCP server over shelling out.

2026-06-255

crawling-a-site

xberg-io/plugins

Use when the user wants to follow links across a domain and capture every reachable page as Markdown. Covers `crawlberg crawl` with depth, page caps, concurrency, rate limiting, domain scoping, robots, and output selection.

2026-06-255

headless-fallback

xberg-io/plugins

Use when a static fetch returns nothing useful and the page needs a real browser. Covers `--browser-mode auto|always|never`, external CDP via `--browser-endpoint`, symptoms of JS-only pages and WAF blocks, and the performance cost.

2026-06-255

mapping-urls

xberg-io/plugins

Use when the user wants the list of URLs on a site rather than the page content — sitemap analysis, link planning, or seeding another tool. Covers `crawlberg map <url>` with `--limit`, `--search`, robots, output, and how it differs from a full crawl.

2026-06-255

scraping-html-to-markdown

xberg-io/plugins

Use when the user wants a single page rendered as clean Markdown plus structured metadata. Covers `crawlberg scrape <url>`, JSON vs Markdown output, what metadata is returned, and how to handle JS-heavy pages.

2026-06-255

name	extracting-keywords
description	Use when extracting keywords (YAKE/RAKE) from documents — and, secondarily, when detecting document language or generating embeddings for RAG and search. Covers the keyword config (and its feature gating), `--detect-language`, and the standalone `embed` command with real flags.

Extracting keywords, language, and embeddings

Keywords (YAKE / RAKE)

YAKE ("yake") — statistical, unsupervised single-document extraction. Good general default.
RAKE ("rake") — co-occurrence / phrase-based. Favors multi-word key phrases.

Feature-gated: keyword extraction requires the CLI to be built with the keywords-yake and/or keywords-rake Cargo features (both are in the default/full build). If the CLI was built without them, the [keywords] config block is silently ignored — result.keywords simply stays empty rather than erroring. The "yake" algorithm needs keywords-yake; "rake" needs keywords-rake.

Enable via inline JSON on the CLI:

kreuzberg extract paper.pdf --format json \
  --config-json '{"keywords":{"algorithm":"yake","max_keywords":15,"language":"en"}}' \
  | jq '.keywords'

Or in a config file:

[keywords]
algorithm = "rake"       # "yake" or "rake"
max_keywords = 10        # default 10
min_score = 0.0          # filter below this score (ranges differ per algorithm)
ngram_range = [1, 3]     # unigrams..trigrams (default)
language = "en"          # stopword language; omit to skip stopword filtering

kreuzberg extract report.pdf --config kreuzberg.toml --format json | jq '.keywords'

Field notes:

max_keywords caps how many keywords are returned (default 10).
min_score filters low-scoring keywords; note YAKE scores are lower-is-better while RAKE scores are higher-is-better, so a single threshold behaves differently per algorithm.
ngram_range is [min, max]: [1,1] unigrams only, [1,2] adds bigrams, [1,3] (default) adds trigrams.
language enables stopword filtering for that language; omit it to disable stopword filtering entirely.

Language detection

Language detection is a real CLI flag: --detect-language. Detected languages appear on result.detected_languages:

kreuzberg extract multilingual.pdf --detect-language true --format json \
  | jq '.detected_languages'

In a config file it lives under [language_detection]:

[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false

The CLI flag enables detection with min_confidence = 0.8 and single-language mode; use the config block to detect multiple languages or tune confidence.

Embeddings (`embed` command)

The standalone embed command produces vector embeddings for text from --text (repeatable) or stdin. It does not run extraction — pipe extracted content in if you want document embeddings.

# Local ONNX preset model (default provider)
kreuzberg embed --text "first passage" --text "second passage" --preset balanced

# Embed extracted document text
kreuzberg extract report.pdf | kreuzberg embed --preset quality

Presets for the local provider: fast, balanced (default), quality, multilingual. Output defaults to JSON (--format json).

--provider selects the embedding source:

Provider	Flag	Notes
`local`	`--preset <fast\|balanced\|quality\|multilingual>`	Default. ONNX model, no API key.
`llm`	`--model <id>` `--api-key <key>`	liter-llm routing, e.g. `openai/text-embedding-3-small`.
`plugin`	`--plugin <name>`	A backend pre-registered in-process via the plugin API.

# Provider-hosted embeddings via an LLM
kreuzberg embed --text "query text" \
  --provider llm --model openai/text-embedding-3-small --api-key "$OPENAI_API_KEY"

Local embedding presets must be downloaded first if not cached. Pre-warm them with the cache command:

kreuzberg cache warm --embedding-model balanced   # one preset
kreuzberg cache warm --all-embeddings             # all four presets

Programmatic access

Keywords and detected languages live on the extraction result:

from kreuzberg import extract_file_sync, ExtractionConfig

result = extract_file_sync(
    "paper.pdf",
    config=ExtractionConfig(),  # configure keywords/language_detection on the config
)
print(result.keywords)             # extracted keywords (when enabled)
print(result.detected_languages)   # detected languages (when enabled)

See references/python-api.md and references/configuration.md in the sibling kreuzberg skill for the keyword / language-detection config classes and the embedding presets.

Common pitfalls

No --keywords flag — keyword extraction is config-only. Use --config-json '{"keywords":{...}}' or a [keywords] config block.
min_score direction — lower is better for YAKE, higher is better for RAKE; pick the threshold to match the algorithm.
Embeddings ≠ extraction — embed only takes raw text. Pipe kreuzberg extract output into it for document vectors.
Cold embedding models — first local run downloads the preset; run kreuzberg cache warm --all-embeddings to pre-populate.

See references/advanced-features.md for the embeddings pipeline and references/cli-reference.md for the embed and cache warm flag sets.

extracting-keywords

Extracting keywords, language, and embeddings

Keywords (YAKE / RAKE)

Language detection

Embeddings (embed command)

Programmatic access

Common pitfalls

Plus depuis ce dépôt

Plus depuis ce dépôt

Extracting keywords, language, and embeddings

Keywords (YAKE / RAKE)

Language detection

Embeddings (embed command)

Programmatic access

Common pitfalls

Embeddings (`embed` command)

Embeddings (`embed` command)