Run any Skill in Manus with one click

$pwd:

rag-patterns

Name: Rag Patterns
Author: softspark

// RAG: embeddings, chunking, hybrid search (BM25+vector), reranking, CRAG, multi-hop. Triggers: RAG, embedding, pgvector, Qdrant, Pinecone, Weaviate, reranker, semantic search.

Run Skill in Manus

$ git log --oneline --stat

stars:143

forks:18

updated:May 6, 2026 at 11:14

SKILL.md

readonly

package.json

"author": "softspark"

"repository": "softspark/ai-toolkit"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	rag-patterns
description	RAG: embeddings, chunking, hybrid search (BM25+vector), reranking, CRAG, multi-hop. Triggers: RAG, embedding, pgvector, Qdrant, Pinecone, Weaviate, reranker, semantic search.
effort	medium
user-invocable	false
allowed-tools	Read

RAG Patterns Skill

Core Patterns

1. Hybrid Search

Combine dense (vector) and sparse (BM25) retrieval with RRF fusion:

# RAG-MCP hybrid search
result = await hybrid_search_kb(
    query="rate limiting configuration",
    service="nginx",
    limit=10
)

2. Corrective RAG (CRAG)

Self-correcting retrieval with relevance validation:

result = await crag_search(
    query="fuzzy query",
    relevance_threshold=0.4,
    max_retries=2
)

# Or via smart_query
result = await smart_query(query="...", use_crag=True)

3. HyDE (Hypothetical Document Embeddings)

Generate hypothetical answers for better retrieval on conceptual queries:

result = await smart_query(
    query="conceptual question about design patterns",
    use_hyde=True
)

4. Multi-hop Retrieval

Complex queries requiring multiple retrieval steps:

result = await multi_hop_search(
    query="Compare nginx with varnish for Magento cache",
    max_hops=3
)

# Or via smart_query
result = await smart_query(query="compare A vs B", use_multi_hop=True)

Indexing Best Practices

Aspect	Recommendation
Chunk size	512-1024 tokens
Overlap	10-20% of chunk
Structure	Preserve headers, sections
Metadata	Include title, path, date, category, tags
Frontmatter	YAML with standardized fields

Frontmatter Template

---
title: "Document Title"
service: {project-name}
category: reference|howto|procedures|troubleshooting|decisions|best-practices
tags: [tag1, tag2, tag3]
last_updated: "YYYY-MM-DD"
---

MCP Tools Reference (v5.5.0)

Tool	Use Case	Speed
`smart_query` ⭐	Default for 90% of queries	2-4s
`hybrid_search_kb`	Raw vector + text search	<1s
`get_document`	Full document content	<1s
`crag_search`	Vague/fuzzy queries	1-3s
`multi_hop_search`	Complex reasoning	20-30s

Tool Selection Guide

# Default - auto-routing
smart_query("specific technical question")

# Vague query - self-correcting
crag_search("jak to skonfigurować")

# Complex comparison
multi_hop_search("nginx vs varnish performance comparison")

# Known document
get_document(path="kb/reference/architecture.md")

Quality Metrics

Metric	Description	Target
Faithfulness	Answer based on context	>70%
Relevancy	Answer addresses question	>70%
Context Precision	Found context is accurate	>60%
Latency (p95)	Response time	<2s
Precision@k	Relevant results in top-k	>80%

Retrieval Optimization

Reranking

# Retrieve more, rerank to top-k
initial_results = await hybrid_search_kb(query, limit=20)
reranked = rerank_results(initial_results, query)
final_results = reranked[:5]

Context Window Management

Place critical info at start/end (serial position effect)
Summarize long documents before insertion
Use tiered context: critical → supporting → background

Query Enhancement

Query expansion with synonyms
Query decomposition for complex questions
Entity extraction for filtering

Anti-Patterns

❌ Don't:

Skip reranking for final results
Use very large chunks (>2000 tokens)
Ignore metadata in retrieval
Trust LLM output without citation
Use latest for model versions

✅ Do:

Use top-k=20 → rerank → top-5
Chunk semantically (by section)
Enrich metadata at indexing time
Require source attribution in answers
Pin model versions for reproducibility

RAG System Implementation

Key Files (Typical Structure)

scripts/
├── search_core.py           # Core search
├── query_enhancements.py    # HyDE, query expansion
├── corrective_rag.py        # CRAG
├── multi_hop.py             # Multi-hop
├── unified_indexer.py       # Indexing
└── rag_evaluator.py         # Evaluation

Running RAG Commands

Direct execution:

# Index KB
make index

# Evaluate RAG
python scripts/evaluate_rag.py

# Detect gaps
python scripts/knowledge_gaps.py --detect

Docker execution (if containerized):

# Index KB
docker exec {app-container} make index

# Evaluate RAG
docker exec {api-container} python3 scripts/evaluate_rag.py

# Detect gaps
docker exec {api-container} python3 scripts/knowledge_gaps.py --detect

Rules

MUST chunk by document structure (headers, lists, code fences), not by fixed byte/token count — structure-aware chunking recovers 20-40% of retrieval quality on technical docs
MUST always use hybrid search (BM25 + vector) for keyword-heavy queries — pure vector search misses exact identifiers (function names, config keys)
NEVER trust a single embedding model on multilingual corpora; pair with a bilingual model or translate queries at the edge
NEVER index without content-hash change detection — full rebuilds on every change waste embedding budget and corrupt orphan tracking
CRITICAL: every response includes verifiable citations (source path + exact chunk). A RAG answer without traceable sources is a hallucination wearing a badge.
MANDATORY: evaluate with a golden dataset (faithfulness, relevancy, context precision) before promoting any pipeline change to production

Gotchas

Top-k cosine similarity is not relevance — semantically close chunks may be topically wrong. Always compare hybrid vs pure-vector scores on a held-out set before committing to one.
Default embedding models (e.g., text-embedding-ada-002) underperform on long technical docs (>8k tokens). For long-form content consider chunking before embedding, not embedding then slicing.
Chunk overlap (10-20%) helps narrative text but duplicates storage and token cost. Code and structured tables do not benefit from overlap — disable per content type.
Cross-encoder rerankers (e.g., bge-reranker) add 100-300ms per query. For real-time UX, rerank only the top-20 candidates, not the top-100.
RAG failure modes are structural (retrieval, routing, chunking), not prompt-level. Before "tuning the prompt", check retrieval metrics — a prompt fix on top of broken retrieval is theater.
Query rewriting (HyDE, hypothetical doc generation) improves some queries and degrades others. A/B test before enabling globally; a blanket "always rewrite" often regresses simple lookups.

When NOT to Load

For executing a reindex — use /index (task skill)
For measuring RAG quality — use /evaluate (task skill)
For chunking documentation strategy without an index — this skill assumes you already have a vector store; use /architecture-decision for pipeline choice
For MCP-specific retrieval via smart_query() — the tool is already built; reach for this skill only when tuning the underlying index
For prompt engineering alone without retrieval concerns — use /prompt-caching-patterns or the relevant language skill

name	rag-patterns
description	RAG: embeddings, chunking, hybrid search (BM25+vector), reranking, CRAG, multi-hop. Triggers: RAG, embedding, pgvector, Qdrant, Pinecone, Weaviate, reranker, semantic search.
effort	medium
user-invocable	false
allowed-tools	Read