with one click
rag-chatbot
Guide for building RAG (retrieval-augmented generation) chatbots and Q&A systems with Elasticsearch. Use when a developer wants to build a chatbot, Q&A system, or AI assistant that answers questions from their own data.
Menu
Guide for building RAG (retrieval-augmented generation) chatbots and Q&A systems with Elasticsearch. Use when a developer wants to build a chatbot, Q&A system, or AI assistant that answers questions from their own data.
Based on SOC occupation classification
| name | rag-chatbot |
| description | Guide for building RAG (retrieval-augmented generation) chatbots and Q&A systems with Elasticsearch. Use when a developer wants to build a chatbot, Q&A system, or AI assistant that answers questions from their own data. |
Guide developers through building retrieval-augmented generation (RAG) systems with Elasticsearch as the retrieval backend. Use this guide when they want a chatbot, Q&A interface, or AI assistant that answers from their own documents.
This skill provides deep implementation detail for RAG and chatbot patterns. It is not the main conversation driver.
After applying the guidance here, re-read /elasticsearch-onboarding to resume the structured onboarding playbook (Steps 1–7: intent → data → mapping → build → test → iterate). That playbook controls sequencing, the one-question-at-a-time rule, and the Dev Tools API-snippet workflow. If /elasticsearch-onboarding has not been loaded yet in this conversation, load it now — it is the primary conversation flow for all Elasticsearch search onboarding.
Apply this guide when the developer signals:
Do not use this guide when: the developer only needs search results (not generated answers) — point them to keyword-search or vector-hybrid-search instead.
Verify models before recommending: Check the latest Elastic docs before recommending embedding models or LLMs. Elastic offers managed models via EIS (Elastic Inference Service) — the developer may not need an external API key. Jina v3 is the current default embedding model for semantic_text on EIS; Jina v5-small is available for cost-sensitive workloads. EIS also provides managed rerankers (Jina Reranker v2/v3). Check https://www.elastic.co/docs/explore-analyze/elastic-inference/eis for current models.
RAG has four stages:
User Question
│
▼
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ Embed query │────▶│ Elasticsearch │────▶│ LLM (GPT, │
│ │ │ kNN retrieval │ │ Claude, etc)│
└─────────────┘ └──────────────────┘ └─────────────┘
│ │
Top-k chunks Answer + sources
Chunking strategy depends on document structure. Ask the developer about their content.
| Strategy | When to Use | Chunk Size |
|---|---|---|
| Fixed-size | Uniform text, no clear sections | 500-1000 tokens |
| Paragraph-based | Well-structured docs with natural breaks | 1 paragraph per chunk |
| Section-based | Documents with headers (H1/H2/H3) | 1 section per chunk |
| Recursive | Mixed content, need flexibility | LangChain's RecursiveCharacterTextSplitter |
Important considerations:
Python chunking example:
def chunk_documents(documents: list[dict], chunk_size: int = 500, overlap: int = 100) -> list[dict]:
"""Split documents into overlapping chunks with metadata."""
chunks = []
for doc in documents:
text = doc["content"]
words = text.split()
for i in range(0, len(words), chunk_size - overlap):
chunk_text = " ".join(words[i:i + chunk_size])
if not chunk_text.strip():
continue
chunks.append({
"content": chunk_text,
"source_title": doc.get("title", ""),
"source_url": doc.get("url", ""),
"chunk_index": len(chunks),
"parent_doc_id": doc.get("id", ""),
})
return chunks
LangChain chunking:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(documents)
Store chunk text, embedding, and metadata for retrieval and source citation.
PUT /knowledge-base
{
"mappings": {
"properties": {
"content": { "type": "text" },
"embedding": {
"type": "dense_vector",
"dims": 768,
"index": true,
"similarity": "cosine"
},
"source_title": { "type": "text", "fields": { "keyword": { "type": "keyword" } } },
"source_url": { "type": "keyword" },
"section_header": { "type": "text" },
"parent_doc_id": { "type": "keyword" },
"chunk_index": { "type": "integer" },
"created_at": { "type": "date" }
}
}
}
Use an ingest pipeline to embed chunks at index time.
PUT _ingest/pipeline/embed-knowledge-base
{
"processors": [
{
"inference": {
"model_id": "e5-multilingual",
"input_output": [
{
"input_field": "content",
"output_field": "embedding"
}
]
}
}
]
}
Bulk index chunks:
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch(cloud_id="...", api_key="...")
def index_chunks(chunks: list[dict]) -> tuple[int, list]:
actions = [
{"_index": "knowledge-base", "_source": chunk, "pipeline": "embed-knowledge-base"}
for chunk in chunks
]
return helpers.bulk(es, actions, raise_on_error=False, raise_on_exception=False)
GET /knowledge-base/_search
{
"knn": {
"field": "embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "e5-multilingual",
"model_text": "How do I configure index mappings?"
}
},
"k": 5,
"num_candidates": 50
},
"_source": ["content", "source_title", "source_url", "section_header"]
}
Combine keyword and semantic for more robust retrieval:
POST /knowledge-base/_search
{
"size": 5,
"query": {
"bool": {
"should": [
{ "match": { "content": "configure index mappings" } },
{
"knn": {
"field": "embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "e5-multilingual",
"model_text": "How do I configure index mappings?"
}
},
"k": 5,
"num_candidates": 50
}
}
]
}
},
"rank": { "rrf": {} },
"_source": ["content", "source_title", "source_url"]
}
GET /knowledge-base/_search
{
"knn": {
"field": "embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "e5-multilingual",
"model_text": "How do I configure mappings?"
}
},
"k": 5,
"num_candidates": 50,
"filter": {
"term": { "source_title.keyword": "Elasticsearch Guide" }
}
}
}
Pass retrieved chunks to an LLM with a grounded prompt.
from openai import OpenAI
from elasticsearch import Elasticsearch
es = Elasticsearch(cloud_id="...", api_key="...")
llm = OpenAI()
def ask(question: str, k: int = 5) -> dict:
# 1. Retrieve relevant chunks
resp = es.search(
index="knowledge-base",
knn={
"field": "embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "e5-multilingual",
"model_text": question
}
},
"k": k,
"num_candidates": k * 10
},
source=["content", "source_title", "source_url"]
)
chunks = resp["hits"]["hits"]
# 2. Build context from retrieved chunks
context_parts = []
sources = []
for i, hit in enumerate(chunks):
src = hit["_source"]
context_parts.append(f"[{i+1}] {src['content']}")
sources.append({"title": src.get("source_title", ""), "url": src.get("source_url", "")})
context = "\n\n".join(context_parts)
# 3. Generate answer
completion = llm.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": (
"Answer the user's question using ONLY the provided context. "
"Cite sources using [1], [2], etc. "
"If the context doesn't contain enough information, say so."
)},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
],
temperature=0.2
)
return {
"answer": completion.choices[0].message.content,
"sources": sources
}
For multi-turn conversations, include chat history in the prompt and optionally reformulate the question.
def ask_with_history(question: str, history: list[dict], k: int = 5) -> dict:
# Reformulate question using chat history for better retrieval
if history:
reformulation = llm.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": (
"Rewrite the user's question as a standalone search query, "
"incorporating context from the conversation history."
)},
*history,
{"role": "user", "content": question}
],
temperature=0
)
search_query = reformulation.choices[0].message.content
else:
search_query = question
# Retrieve using reformulated query
result = ask(search_query, k=k)
# Generate with full conversation context
messages = [
{"role": "system", "content": (
"Answer the user's question using the provided context. "
"Cite sources using [1], [2], etc. "
"Consider the conversation history for context."
)},
*history,
{"role": "user", "content": f"Context:\n{result['context']}\n\nQuestion: {question}"}
]
completion = llm.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0.2
)
return {
"answer": completion.choices[0].message.content,
"sources": result["sources"]
}
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route("/chat", methods=["POST"])
def chat():
body = request.json
question = body.get("question", "")
history = body.get("history", [])
if not question:
return jsonify({"error": "Missing question"}), 400
result = ask_with_history(question, history)
return jsonify(result)
| Lever | Effect |
|---|---|
| Chunk size | Smaller = more precise retrieval, less context per chunk. Larger = more context but noisier. |
| k (num results) | More chunks = more context for the LLM but risks dilution. Start with 3-5. |
| Hybrid retrieval | Adds keyword matching; helps when questions contain specific terms or identifiers. |
| Reranking | Retrieve 20, rerank to top 5 with a cross-encoder for best precision. |
| Metadata filtering | Scope retrieval to relevant sources, time ranges, or categories. |
| Question | Answer |
|---|---|
| "The chatbot hallucinates" | Strengthen the system prompt ("only use provided context"), reduce temperature, add "I don't know" instructions. |
| "Answers are too vague" | Reduce chunk size for more precise passages; increase k for more context. |
| "How do I cite sources?" | Include source metadata in retrieval, reference in prompt with numbered citations. |
| "How do I handle long documents?" | Chunk with overlap; consider hierarchical retrieval (retrieve chunk, then fetch parent section). |
| "How do I update the knowledge base?" | Re-chunk and re-index changed documents. Use parent_doc_id to delete old chunks before re-indexing. |
| "Which LLM should I use?" | GPT-4o-mini for cost efficiency, GPT-4o or Claude for quality. Any OpenAI-compatible API works. |
Register and implement custom workflow steps from an external Kibana plugin using `@kbn/workflows-extensions`. Use when adding or modifying a step type with `registerStepDefinition`, designing input/output/config Zod schemas, implementing `createServerStepDefinition` / `createPublicStepDefinition`, choosing `StepCategory`, building `editorHandlers` (selection / dynamicSchema), wiring `callKibanaApi` / `onCancel`, deciding sync vs async loader registration, updating `APPROVED_STEP_DEFINITIONS`, or reviewing PRs that touch any of these.
Register and implement custom workflow triggers from an external Kibana plugin using `@kbn/workflows-extensions`. Use when adding or modifying an event-driven trigger with `registerTriggerDefinition`, designing `eventSchema` Zod schemas, writing `documentation` and KQL `snippets`, wiring `emitEvent` via request context or `getClient`, choosing sync vs async public loader registration, updating `APPROVED_TRIGGER_DEFINITIONS`, or reviewing PRs that touch any of these. Always ask for the user's plugin id first to locate the correct plugin and file paths.
Register and roll out managed workflows from a Kibana plugin using `@kbn/workflows-extensions` and `@kbn/workflows/managed`. Use when adding or modifying a code-owned workflow definition, `registerManagedWorkflowOwner`, `initManagedWorkflowsClient`, `install` / `uninstall` / `ready`, choosing `lifecycle` / `versionStrategy` / `enablement`, authoring `yaml` vs `yamlTemplate`, space-scoped vs global installs, `getWorkflowStatus`, or `execute`, or reviewing PRs that touch managed workflow definitions or rollout. Always ask for the user's plugin id first to locate the correct plugin and definition file paths.
Implement and quality-check OpenTelemetry metric instrumentation in Kibana code that uses `@kbn/metrics`. Use whenever the user wants to add, change, or review OTel metrics — including any call to `metrics.getMeter`, `meter.createCounter`/`createUpDownCounter`/`createGauge`/`createHistogram`/`createObservable*`/`addBatchObservableCallback`, edits to `kibana.yml` `telemetry.metrics` config, or questions like "is this metric well-designed?", "what should I name this counter?", or "which instrument type is right here?". Trigger this skill even when the user does not say "OTel" or "OpenTelemetry" but is clearly adding observability to Kibana server code and already knows what they want to measure.
Primary guided playbook for Elasticsearch search in Kibana Agent Builder: intent → data → mapping → Dev Tools API snippets (SENSE), with one question at a time. Load this skill whenever the user wants to learn Elasticsearch search, get started, begin building, take first steps, onboard, follow a walkthrough or tutorial, go from zero to a working query, or get structured help setting up indices and search — including casual openers like hi, help, getting started, new to Elasticsearch, how do I build search, or I want to try search. Use when they need end-to-end onboarding, not a single narrow API answer. If they only ask what they can build with Elastic (exploration without the full playbook), prefer invoking /use-case-library first; you can still load this skill afterward for the guided build.
Topic-driven, hands-on Elasticsearch tutorial flow that runs in Kibana Dev Console. Use whenever the user says "walk me through", "give me a tutorial for", "teach me", "show me how X works", "tutorial on", or similar topical learning intent — and they are NOT asking you to build their real, specific use case. Topics are open-ended: any Elasticsearch / Kibana search concept the user names (e.g. mappings, analyzers, bool queries, semantic_text, kNN, RRF, aggregations, ingest pipelines, reranking, data streams, ES|QL). Tutorials use sample data on isolated resources, present every step as a SENSE snippet to run in Dev Tools, and end with cleanup plus pointers to docs and the onboarding / pattern skills.