// Optimize RAG performance with reranking, caching, parallel processing, and production deployment patterns. Use when improving retrieval quality, adding rerankers, deploying to production, implementing caching strategies, or optimizing for scale and latency.
| name | optimizing-rag |
| description | Optimize RAG performance with reranking, caching, parallel processing, and production deployment patterns. Use when improving retrieval quality, adding rerankers, deploying to production, implementing caching strategies, or optimizing for scale and latency. |
Guide for improving RAG systems with reranking, caching, production optimizations, and deployment patterns. Focus on quick wins and production-grade improvements.
Benefit: Transforms any embedding into competitive performance Effort: Minimal code change
from llama_index.postprocessor.cohere_rerank import CohereRerank
# For English content
reranker = CohereRerank(
top_n=5,
model="rerank-english-v3.0",
api_key="YOUR_COHERE_API_KEY"
)
# For Thai/multilingual content
reranker = CohereRerank(
top_n=5,
model="rerank-multilingual-v3.0",
api_key="YOUR_COHERE_API_KEY"
)
# Apply to query engine
query_engine = index.as_query_engine(
similarity_top_k=10, # Retrieve more candidates
node_postprocessors=[reranker] # Rerank to top 5
)
Best Practice: Always retrieve 10x candidates, rerank to final top_k
Benefit: 391s â 31s for 32 PDF files Effort: One parameter change
from llama_index.core import SimpleDirectoryReader
# Sequential (slow)
documents = SimpleDirectoryReader(input_dir="./data").load_data()
# Parallel (13x faster)
documents = SimpleDirectoryReader(input_dir="./data").load_data(
num_workers=10 # Adjust based on CPU cores
)
Benefit: Reduced API calls, better throughput Effort: Configuration change
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
embed_batch_size=100 # Up from default 10
)
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
# Open-source alternative to Cohere
reranker = FlagEmbeddingReranker(
model="BAAI/bge-reranker-large",
top_n=5
)
query_engine = index.as_query_engine(
similarity_top_k=10,
node_postprocessors=[reranker]
)
Performance: OpenAI + bge-reranker-large: 0.910 hit rate, 0.856 MRR
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
# Create pipeline with transformations
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=512),
OpenAIEmbedding()
]
)
# Run and cache
nodes = pipeline.run(documents=documents)
pipeline.persist("./pipeline_cache")
# Subsequent runs reuse cached results
pipeline.load("./pipeline_cache")
nodes = pipeline.run(documents=documents) # Only processes new/changed docs
from llama_index.ingestion import IngestionPipeline, IngestionCache
from llama_index.storage.kvstore.redis import RedisKVStore
# Distributed caching
ingest_cache = IngestionCache(
cache=RedisKVStore.from_host_and_port(
host="redis-server",
port=6379
),
collection="rag_pipeline_cache"
)
pipeline = IngestionPipeline(
transformations=[...],
cache=ingest_cache
)
# Cache shared across instances
nodes = pipeline.run(documents=documents)
Metadata Pre-filtering (Sub-50ms):
from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter
# Filter before vector search (90% reduction possible)
filters = MetadataFilters(
filters=[
ExactMatchFilter(key="category", value="technical"),
ExactMatchFilter(key="year", value="2024")
]
)
query_engine = index.as_query_engine(
filters=filters, # Narrow search space first
similarity_top_k=5
)
Document Summary Retrieval (For 100+ docs):
from llama_index.core import DocumentSummaryIndex
# Two-stage: document-level â chunk-level
summary_index = DocumentSummaryIndex.from_documents(
documents,
response_synthesizer=response_synthesizer
)
retriever = summary_index.as_retriever(similarity_top_k=3)
Chunk Decoupling (Precision + Context):
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
# Embed sentences, retrieve with context windows
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3, # Sentences before/after
window_metadata_key="window"
)
# Replace with window for synthesis
postprocessor = MetadataReplacementPostProcessor(
target_metadata_key="window"
)
query_engine = index.as_query_engine(
node_postprocessors=[postprocessor],
similarity_top_k=6
)
from llama_index.core.postprocessor import SimilarityPostprocessor
# Progressive refinement
query_engine = index.as_query_engine(
similarity_top_k=20,
node_postprocessors=[
SimilarityPostprocessor(similarity_cutoff=0.7), # Filter low scores
CohereRerank(top_n=10), # Rerank top candidates
MetadataReplacementPostProcessor(...) # Expand context
]
)
src/ PipelineAdd Reranking to All 7 Strategies:
src/10_basic_query_engine.py â Add CohereReranksrc/16_hybrid_search.py â Add reranker after fusionEnable Caching:
src/09_enhanced_batch_embeddings.py â Add pipeline cachingsrc/02_prep_doc_for_embedding.py â Cache preprocessingsrc-iLand/ PipelineThai-Optimized Reranking:
# In src-iLand/retrieval/retrievers/
reranker = CohereRerank(
top_n=5,
model="rerank-multilingual-v3.0" # Thai support
)
Fast Metadata Filtering (Already implemented):
src-iLand/retrieval/fast_metadata_index.py â Sub-50ms filteringBatch Processing Optimization:
src-iLand/docs_embedding/batch_embedding.py â Increase batch_sizesrc-iLand/data_processing/ â Enable parallel loadingLoad these when you need comprehensive details:
reference-reranking.md: Complete reranking guide
reference-production.md: Production optimization patterns
reference-advanced-retrieval.md: Advanced retrieval strategies
Step 1: Choose reranker
reference-reranking.md for comparisonStep 2: Install dependencies
pip install llama-index-postprocessor-cohere-rerank
# OR
pip install llama-index-postprocessor-flag-embedding-reranker
Step 3: Update query engine
similarity_top_k from 5 â 10top_n=5Step 4: Test impact
Step 5: Deploy to all strategies
Step 1: Choose caching backend
Step 2: Wrap pipeline
pipeline = IngestionPipeline(
transformations=[splitter, embedder],
# cache=... (local or Redis)
)
Step 3: Initial run (builds cache)
nodes = pipeline.run(documents=documents)
pipeline.persist("./cache") # Local only
Step 4: Subsequent runs (uses cache)
Step 5: Monitor cache size
Step 1: Review production checklist
reference-production.md for full guideStep 2: Implement error handling
Step 3: Add monitoring
Step 4: Set up caching
Step 5: Deploy with redundancy
Step 1: Add metadata filtering
Step 2: Implement document summaries
Step 3: Enable fast metadata indexing
src-iLand/retrieval/fast_metadata_index.pyStep 4: Use async operations
retriever = index.as_retriever(use_async=True)
Step 5: Monitor and tune
| Embedding | Without Rerank | With Cohere Rerank | Improvement |
|---|---|---|---|
| OpenAI | 0.870 hit rate | 0.927 hit rate | +6.6% |
| JinaAI Base | 0.880 hit rate | 0.933 hit rate | +6.0% |
| bge-large | 0.820 hit rate | 0.876 hit rate | +6.8% |
Reranking Best Practices:
Caching Cautions:
Production Essentials:
This skill includes utility scripts in the scripts/ directory:
Validates RAG configuration before deployment:
python .claude/skills/optimizing-rag/scripts/validate_config.py \
--config-file ./config.yaml
Checks:
Measures retrieval performance:
python .claude/skills/optimizing-rag/scripts/benchmark_performance.py \
--index-path ./index \
--queries-file ./test_queries.txt
Reports:
After optimizing:
evaluating-rag skill to measure improvements with hit rate and MRR