ワンクリックで
chunking-embeddings
Chunking, embeddings, and RAG pipeline integration
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
メニュー
Chunking, embeddings, and RAG pipeline integration
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
| description | Chunking, embeddings, and RAG pipeline integration |
| name | chunking-embeddings |
| priority | critical |
Text splitting strategies, embedding generation with FastEmbed, RAG pipeline integration
Location: crates/xberg/src/chunking/, crates/xberg/src/embeddings.rs
Extracted Text
|
[1. Normalization] -> Clean whitespace, remove control chars
|
[2. Chunk Strategy Selection] -> Fixed-size, semantic, syntax-aware, recursive
|
[3. Overlap Management] -> Control context window overlap
|
[4. Optional Embedding] -> Generate vectors with FastEmbed
|
Output: Vec<Chunk> with text, vectors, metadata
Location: crates/xberg/src/chunking/mod.rs
| Strategy | Pattern | Best For |
|---|---|---|
| Fixed-Size | Sliding window with configurable overlap | Uniform chunks for embedding models with fixed token limits |
| Semantic | Split by sentences, merge/split by similarity threshold | Smart context preservation for LLM consumption and semantic search |
| Syntax-Aware | Split by paragraph/section/heading/code-block structure | Preserving document structure (sections, code blocks) in RAG |
| Recursive (LangChain pattern) | Try separators in order: \n\n, \n, , | Best general-purpose chunking; auto-finds optimal split points |
Key config fields per strategy (see struct definitions in chunking/mod.rs):
chunk_size, overlap, trim_whitespacetarget_chunk_size, min/max_chunk_size, semantic_threshold, use_sentence_boundarieschunk_by (Paragraph/Section/Heading/Sentence/CodeBlock), max_chunk_size, respect_code_blocksseparators[], chunk_size, overlapLocation: crates/xberg/src/chunking/mod.rs
| Preset | Chunk Size | Overlap | Strategy | Use Case |
|---|---|---|---|---|
| Balanced | 512 tokens | 50 | Semantic | RAG sweet spot |
| Compact | 256 tokens | 32 | Fixed-Size | Dense vectors |
| Extended | 1024 tokens | 100 | Recursive | Full context |
| Minimal | 128 tokens | 16 | (default) | Lightweight embeddings |
Usage: set config.chunking.preset = Some("balanced") in ExtractionConfig.
Location: crates/xberg/src/embeddings.rs
| Model | Dimensions | Notes |
|---|---|---|
BAAI/bge-small-en-v1.5 (default) | 384 | Fast, excellent for RAG |
BAAI/bge-small-zh-v1.5 | 384 | Chinese optimized |
BAAI/bge-base-en-v1.5 | 768 | Better quality, slower |
jinaai/jina-embeddings-v2-base-en | 768 | Long context (up to 8192 tokens) |
Custom(path) | varies | Custom ONNX model path |
TextEmbeddingManager provides singleton-cached models per config. Pattern:
get_or_init_model() -- lazy-loads ONNX model (downloads if needed), caches in Arc<RwLock<HashMap>>embed_chunks() -- collects chunk texts, calls model.embed(texts, batch_size), zips results back to ChunkWithEmbeddingDefault config: batch_size=256, device=CPU, parallel_requests=4.
Embeddings require ONNX Runtime. Feature-gated via:
[features]
embeddings = ["dep:fastembed", "dep:ort"]
Install: brew install onnxruntime (macOS) / apt install libonnxruntime libonnxruntime-dev (Linux). Verify: echo $ORT_DYLIB_PATH.
The full extraction-to-RAG pipeline:
extract_file(path, config) -> ExtractionResultresult.content -> Vec<Chunk>TextEmbeddingManager::embed_chunks() -> Vec<ChunkWithEmbedding>RagDocument { file_path, metadata, chunks } ready for vector DB ingestionSee ChunkWithEmbedding struct in types.rs: contains text, embedding: Vec<f32>, dimensions, norm, metadata.
REST API server and MCP protocol integration
Document extraction pipeline architecture and patterns
Plugin architecture, registration, and trait patterns
Format-specific document extraction workflows