| name | rlama |
| description | Build and query fully local RAG knowledge bases from documents (PDF, MD, code, etc.) using RLAMA and Ollama — no cloud, no data leaving the machine. Triggers on 'local knowledge base', 'search documents', 'document Q&A', 'RAG query', 'ingest files', 'semantic search'. |
| allowed-tools | Bash(rlama:*), Read |
RLAMA - Local RAG System
RLAMA (Retrieval-Augmented Language Model Adapter) provides fully local, offline RAG for semantic search over your documents.
When to Use This Skill
- Building knowledge bases from local documents
- Searching personal notes, research papers, or code documentation
- Document-based Q&A without sending data to the cloud
- Indexing project documentation for quick semantic lookup
- Creating searchable archives of PDFs, markdown, or code files
Prerequisites
RLAMA requires Ollama running locally:
ollama list
brew services start ollama
Quick Reference
Query a RAG (Default: Retrieve-Only)
Always use retrieve-only mode by default. Claude synthesizes far better answers than local 7B models. The raw chunks give Claude direct evidence to reason over and cite.
python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query"
python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" -k 20
python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" --json
python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" --rebuild-cache
python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py --list
First run per collection builds an embedding cache (~60s for 4K chunks). Subsequent queries are <1s.
Local LLM Query (Fallback Only)
Use rlama run only when Claude is not in the loop (e.g., standalone CLI usage, cron jobs, scripts):
rlama run <rag-name> --query "your question here"
rlama run <rag-name> --query "explain the authentication flow" --context-size 30
rlama run <rag-name> --query "what are the API endpoints?" --show-context
Script wrapper for cleaner output:
python3 ~/.claude/skills/rlama/scripts/rlama_query.py <rag-name> "your query"
python3 ~/.claude/skills/rlama/scripts/rlama_query.py my-docs "what is the main idea?" --show-sources
External LLM Synthesis (optional—retrieve chunks AND synthesize via OpenRouter, TogetherAI, Ollama, or any OpenAI-compatible endpoint):
python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" --synthesize --synth-model anthropic/claude-sonnet-4
python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" --synthesize --provider togetherai
python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" --synthesize --provider ollama
python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag-name> "your query" --synthesize --endpoint https://my-api.com/v1/chat/completions
Environment variables for synthesis:
| Variable | Provider |
|---|
OPENROUTER_API_KEY | OpenRouter (default, auto-detected first) |
TOGETHER_API_KEY | TogetherAI |
SYNTH_API_KEY | Custom endpoint (via --endpoint) |
| (none needed) | Ollama (local, no auth) |
Provider auto-detection: model names with / → OpenRouter, otherwise → TogetherAI. Falls back to whichever API key is set.
Quality tiers:
| Tier | Method | Quality | Latency | Default? |
|---|
| Best | Retrieve-only → Claude synthesizes | Strongest synthesis | ~1s retrieve | YES |
| Good | --synthesize --synth-model anthropic/claude-sonnet-4 | Strong, cited | ~3s | |
| Decent | --synthesize --provider togetherai (Llama 70B) | Solid for factual | ~2s | |
| Reasoning | --synthesize --reasoning (Qwen 3.5 9B) | Strong local, cited | ~8s | |
| Local | --synthesize --provider ollama (Qwen 2.5 7B) | Basic, may hedge | ~5s | |
| Baseline | rlama_query.py (RLAMA built-in) | Weakest, no prompt control | ~3s | |
Small local models (7B) use a tuned prompt optimized for Qwen (structured output, anti-hedge, domain-keyword aware). Cloud providers use a strict research-grade prompt with mandatory citations. Reasoning mode (--reasoning) uses qwen3.5:9b with the strict prompt and 4096 max tokens—best local option for complex cross-document synthesis.
First run builds an embedding cache (~30s for 3K chunks, ~10min for 25K chunks). Subsequent queries are <1s. Large RAGs use incremental checkpointing—if Ollama crashes mid-build, re-run to resume from the last checkpoint. Individual chunks are truncated to 5K chars to stay within the embedding model's context window.
Benchmarking:
python3 ~/.claude/skills/rlama/scripts/rlama_bench.py <rag-name> --retrieval-only
python3 ~/.claude/skills/rlama/scripts/rlama_bench.py <rag-name> --provider ollama --verbose
python3 ~/.claude/skills/rlama/scripts/rlama_bench.py <rag-name> --provider ollama --case 0
python3 ~/.claude/skills/rlama/scripts/rlama_bench.py <rag-name> --provider ollama --json
Scores: retrieval precision, topic coverage, grounding, directness (anti-hedge), composite (0-100).
Create a RAG
Index documents from a folder into a new RAG system:
rlama rag llama3.2 <rag-name> <folder-path>
rlama rag llama3.2 my-notes ~/Notes
rlama rag llama3.2 project-docs ./docs
rlama rag llama3.2 research-papers ~/Papers
rlama rag llama3.2 codebase ./src --exclude-dir=node_modules,dist,.git --exclude-ext=.log,.tmp
rlama rag llama3.2 markdown-docs ./docs --process-ext=.md,.txt
rlama rag llama3.2 my-rag ./docs --chunking=semantic --chunk-size=1500 --chunk-overlap=300
Chunking strategies:
hybrid (default) - Combines semantic and fixed chunking
semantic - Respects document structure (paragraphs, sections)
fixed - Fixed character count chunks
hierarchical - Preserves document hierarchy
List RAG Systems
rlama list
rlama list-docs <rag-name>
rlama list-chunks <rag-name> --document=filename.pdf
Manage Documents
Add documents to existing RAG:
rlama add-docs <rag-name> <folder-or-file>
rlama add-docs my-notes ~/Notes/new-notes
rlama add-docs research ./papers/new-paper.pdf
Remove a document:
rlama remove-doc <rag-name> <document-id>
rlama remove-doc my-notes old-note.md
rlama remove-doc research outdated-paper.pdf
rlama remove-doc my-notes old-note.md --force
Delete a RAG
rlama delete <rag-name>
rm -rf ~/.rlama/<rag-name>
Advanced Features
Web Crawling
Create a RAG from website content:
rlama crawl-rag llama3.2 docs-rag https://docs.example.com
rlama crawl-add-docs my-rag https://blog.example.com
Directory Watching
Automatically update RAG when files change:
rlama watch <rag-name> <folder-path>
rlama check-watched <rag-name>
rlama watch-off <rag-name>
Website Watching
Monitor websites for content updates:
rlama web-watch <rag-name> https://docs.example.com
rlama check-web-watched <rag-name>
rlama web-watch-off <rag-name>
Reranking
Improve result relevance with reranking:
rlama add-reranker <rag-name>
rlama update-reranker <rag-name> --reranker-weight=0.8
rlama rag llama3.2 my-rag ./docs --disable-reranker
API Server
Run RLAMA as an API server for programmatic access:
rlama api --port 11249
curl -X POST http://localhost:11249/rag \
-H "Content-Type: application/json" \
-d '{
"rag_name": "my-docs",
"prompt": "What are the key points?",
"context_size": 20
}'
Model Management
rlama update-model <rag-name> <new-model>
rlama update-model my-rag deepseek-r1:8b
rlama rag hf.co/username/repo my-rag ./docs
rlama rag hf.co/username/repo:Q4_K_M my-rag ./docs
export OPENAI_API_KEY="your-key"
rlama rag gpt-4-turbo my-openai-rag ./docs
Configuration
Data Directory
By default, RLAMA stores data in ~/.rlama/. Change this with --data-dir:
rlama --data-dir=/path/to/custom list
rlama --data-dir=/projects/rag-data rag llama3.2 project-rag ./docs
export RLAMA_DATA_DIR="/path/to/custom"
Ollama Configuration
rlama --host=192.168.1.100 --port=11434 run my-rag
export OLLAMA_HOST="http://192.168.1.100:11434"
Default Model
The skill uses qwen2.5:7b by default (changed from llama3.2 in Jan 2026). For legacy mode:
python3 ~/.claude/skills/rlama/scripts/rlama_manage.py create my-rag ./docs --legacy
rlama rag deepseek-r1:8b my-rag ./docs
rlama run my-rag --query "question" -m deepseek-r1:8b
Recommended models:
| Model | Size | Best For |
|---|
qwen2.5:7b | 7B | Default—fast RAG queries (recommended) |
qwen3.5:9b | 9B | Reasoning mode—deeper synthesis, strict citations (--reasoning) |
llama3.2 | 3B | Fast, legacy default (use --legacy) |
deepseek-r1:8b | 8B | Complex questions |
llama3.3:70b | 70B | Highest quality (slow) |
Reasoning mode (--reasoning flag) uses qwen3.5:9b for local Ollama synthesis with the strict research-grade prompt (normally reserved for cloud providers). This gives research-quality cited answers without leaving the machine. Override the model via RLAMA_REASONING_MODEL env var.
python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag> "complex query" --synthesize --reasoning
python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag> "complex query" --synthesize --reasoning --think
python3 ~/.claude/skills/rlama/scripts/rlama_retrieve.py <rag> "query" --synthesize --provider ollama --synth-model qwen3.5:9b
| Flag | Model | Think | Prompt | Max Tokens | Timeout |
|---|
| (default) | qwen2.5:7b | off | light (anti-hedge) | 2048 | 120s |
--reasoning | qwen3.5:9b | off | strict (cited) | 4096 | 300s |
--reasoning --think | qwen3.5:9b | on | strict (cited) | 4096 | 300s |
Thinking mode produces internal chain-of-thought reasoning before the answer. The thinking text is included in JSON output (synthesis.thinking field) but not printed in plain text mode. Use for ambiguous cross-document analysis where you want to see the model's working.
Supported File Types
RLAMA indexes these formats:
- Text:
.txt, .md, .markdown
- Documents:
.pdf, .docx, .doc
- Code:
.py, .js, .ts, .go, .rs, .java, .rb, .cpp, .c, .h
- Data:
.json, .yaml, .yml, .csv
- Web:
.html, .htm
- Org-mode:
.org
Example Workflows
Personal Knowledge Base
rlama rag llama3.2 personal-kb ~/Documents
rlama add-docs personal-kb ~/Notes
rlama add-docs personal-kb ~/Downloads/papers
rlama run personal-kb --query "what did I write about project management?"
Code Documentation
rlama rag llama3.2 project-docs ./docs ./README.md
rlama run project-docs --query "how does authentication work?" --context-size 25
Research Papers
rlama rag llama3.2 papers ~/Papers --exclude-ext=.bib
rlama add-docs papers ./new-paper.pdf
rlama run papers --query "what methods are used for evaluation?" --context-size 30
Interactive Wizard
For guided RAG creation:
rlama wizard
Resilient Indexing (Skip Problem Files)
For folders with mixed content where some files may exceed embedding context limits (e.g., large PDFs), use the resilient script that processes files individually and skips failures:
python3 ~/.claude/skills/rlama/scripts/rlama_resilient.py create my-rag ~/Documents
python3 ~/.claude/skills/rlama/scripts/rlama_resilient.py add my-rag ~/MoreDocs
python3 ~/.claude/skills/rlama/scripts/rlama_resilient.py create research ~/Papers --docs-only
python3 ~/.claude/skills/rlama/scripts/rlama_resilient.py create my-rag ~/Docs --legacy
The script reports which files were added and which were skipped due to errors.
Progress Monitoring
Monitor long-running RLAMA operations in real-time using the logging system.
Tail the Log File
tail -f ~/.rlama/logs/rlama.log
tail -f ~/.rlama/logs/rlama.log | grep my-rag
tail -f ~/.rlama/logs/rlama.log | jq -r '"\(.ts) [\(.cat)] \(.msg)"'
tail -f ~/.rlama/logs/rlama.log | jq -r 'select(.data.i) | "\(.ts) [\(.cat)] \(.data.i)/\(.data.total) \(.data.file // .data.status)"'
Check Operation Status
python3 ~/.claude/skills/rlama/scripts/rlama_status.py
python3 ~/.claude/skills/rlama/scripts/rlama_status.py --recent
python3 ~/.claude/skills/rlama/scripts/rlama_status.py --all
python3 ~/.claude/skills/rlama/scripts/rlama_status.py --follow
python3 ~/.claude/skills/rlama/scripts/rlama_status.py --json
Log File Format
Logs are written in JSON Lines format to ~/.rlama/logs/rlama.log:
{"ts": "2026-02-03T12:34:56.789", "level": "info", "cat": "INGEST", "msg": "Progress 45/100", "data": {"op_id": "ingest_abc123", "i": 45, "total": 100, "file": "doc.pdf", "eta_sec": 85}}
Operations State
Active and recent operations are tracked in ~/.rlama/logs/operations.json:
{
"active": {
"ingest_abc123": {
"type": "ingest",
"rag_name": "my-docs",
"started": "2026-02-03T12:30:00",
"processed": 45,
"total": 100,
"eta_sec": 85
}
},
"recent": [...]
}
Troubleshooting
"Ollama not found"
ollama --version
ollama list
brew services start ollama
ollama serve
"Model not found"
ollama pull llama3.2
ollama pull nomic-embed-text
Slow Indexing
- Use smaller embedding models
- Exclude large binary files:
--exclude-ext=.bin,.zip,.tar
- Exclude build directories:
--exclude-dir=node_modules,dist,build
Poor Query Results
- Increase context size:
--context-size=30
- Use a better model:
-m deepseek-r1:8b
- Re-index with semantic chunking:
--chunking=semantic
- Enable reranking:
rlama add-reranker <rag-name>
Index Corruption
rm -rf ~/.rlama/<rag-name>
rlama rag llama3.2 <rag-name> <folder-path>
CLI Reference
Full command reference available at:
rlama --help
rlama <command> --help
Or see references/rlama-commands.md for complete documentation.