Run any Skill in Manus with one click

$pwd:

generate-rag-dataset

Name: Generate Rag Dataset
Author: langwatch

// Generate a synthetic evaluation dataset from your RAG knowledge base. Creates diverse Q&A pairs with expected answers and relevant context, ready for LangWatch experiments and platform import. Use when you need test data for your RAG pipeline.

Run Skill in Manus

$ git log --oneline --stat

stars:2

forks:1

updated:April 24, 2026 at 09:38

SKILL.md

readonly

name	generate-rag-dataset
description	Generate a synthetic evaluation dataset from your RAG knowledge base. Creates diverse Q&A pairs with expected answers and relevant context, ready for LangWatch experiments and platform import. Use when you need test data for your RAG pipeline.
license	MIT
compatibility	Requires LangWatch SDK. Works with Claude Code and similar coding agents.
metadata	{"category":"recipe"}

Generate a RAG Evaluation Dataset

This recipe analyzes your RAG knowledge base and generates a comprehensive Q&A evaluation dataset.

Step 1: Analyze the Knowledge Base

Read the codebase to find the knowledge base:

Document files (PDFs, markdown, text files)
Database schemas (if documents are stored in a DB)
Vector store configuration (what's being embedded)
Chunking strategy (how documents are split)

Read every document you can access. Understand:

What topics does the knowledge base cover?
What's the depth of information?
What terminology is used?
What are the boundaries (what's NOT covered)?

Step 2: Generate Diverse Question Types

Create questions across these categories:

Factual Recall

Direct questions answerable from a single passage:

"What is the recommended threshold for X?"
"When should Y be applied?"

Multi-Hop Reasoning

Questions requiring information from multiple passages:

"Given condition A and condition B, what should be done?"
"How do X and Y interact when Z occurs?"

Comparison

Questions comparing concepts within the knowledge base:

"What's the difference between approach A and approach B?"
"When should you use X instead of Y?"

Edge Cases

Questions about boundary conditions or unusual scenarios:

"What happens if the measurement is outside normal range?"
"What if two recommendations conflict?"

Negative Cases

Questions about topics NOT covered by the knowledge base:

"Does the system support Z?" (when it doesn't)
Questions requiring external knowledge the KB doesn't have

These help test that the agent correctly says "I don't know" rather than hallucinating.

Step 3: Include Context Per Row

For each Q&A pair, include the relevant document chunk(s) that contain the answer. This enables:

Platform experiments without the full RAG pipeline
Evaluating answer quality independent of retrieval quality
Testing with different prompts using the same retrieved context

Format:

{
    "input": "When should I irrigate apple orchards?",
    "expected_output": "Irrigate when soil moisture exceeds 35 kPa...",
    "context": "## Irrigation Management\nSoil moisture threshold for apple orchards: maintain between 25-35 kPa...",
    "question_type": "factual_recall"
}

Step 4: Export Formats

Create both:

Python DataFrame (for SDK experiments)

import pandas as pd
df = pd.DataFrame(dataset)
df.to_csv("rag_evaluation_dataset.csv", index=False)

Platform-Ready CSV

Export with columns: input, expected_output, context, question_type This can be imported directly into LangWatch platform datasets.

Step 5: Validate Dataset Quality

Before using the dataset:

Check topic coverage — are all knowledge base topics represented?
Verify answers are actually in the context — no hallucinated expected outputs
Check question diversity — not all the same type
Verify negative cases have appropriate "I don't know" expected outputs
Run a quick experiment to baseline accuracy

Common Mistakes

Do NOT generate questions without reading the actual knowledge base first
Do NOT skip negative cases — testing "I don't know" is crucial for RAG
Do NOT use the same question pattern for every entry — diversify types
Do NOT forget to include the relevant context per row
Do NOT generate expected outputs that aren't actually in the knowledge base

related-skills.json

same repository

datasets.md

from "langwatch/skills"

Generate realistic synthetic evaluation datasets by analyzing the user's codebase, prompts, production traces, and reference materials. Interactive, consultant-style — asks clarifying questions, proposes a plan, generates a preview for approval, then delivers a complete dataset uploaded to LangWatch. Use when user asks to generate, create, or build a dataset for evaluation, testing, or benchmarking.

2026-04-282

analytics.md

from "langwatch/skills"

Analyze your AI agent's performance using LangWatch analytics. Use when the user wants to understand costs, latency, error rates, usage trends, or debug specific traces. Works with any LangWatch-instrumented agent.

2026-04-242

evaluations.md

from "langwatch/skills"

Set up comprehensive evaluations for your AI agent with LangWatch — experiments (batch testing), evaluators (scoring functions), datasets, online evaluation (production monitoring), and guardrails (real-time blocking). Supports both code (SDK) and platform (CLI) approaches. Use when the user wants to evaluate, test, benchmark, monitor, or safeguard their agent.

2026-04-242

level-up.md

from "langwatch/skills"

Take your AI agent to the next level with full LangWatch integration. Adds tracing, prompt versioning, evaluation experiments, and simulation tests in one go. Use when the user wants comprehensive observability, testing, and prompt management for their agent.

2026-04-242

prompts.md

from "langwatch/skills"

Version and manage your agent's prompts with LangWatch Prompts CLI. Use for both onboarding (set up prompt versioning for an entire codebase) and targeted operations (version a specific prompt, create a new prompt version). Supports Python and TypeScript.

2026-04-242

debug-instrumentation.md

from "langwatch/skills"

Debug and improve your LangWatch traces. Inspects production traces for missing input/output, disconnected spans, unlabeled traces, and missing metadata. Use when traces look broken or incomplete.

2026-04-242

package.json

"author": "langwatch"

"repository": "langwatch/skills"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

name	generate-rag-dataset
description	Generate a synthetic evaluation dataset from your RAG knowledge base. Creates diverse Q&A pairs with expected answers and relevant context, ready for LangWatch experiments and platform import. Use when you need test data for your RAG pipeline.
license	MIT
compatibility	Requires LangWatch SDK. Works with Claude Code and similar coding agents.
metadata	{"category":"recipe"}

Generate a RAG Evaluation Dataset

This recipe analyzes your RAG knowledge base and generates a comprehensive Q&A evaluation dataset.

Step 1: Analyze the Knowledge Base

Read the codebase to find the knowledge base:

Document files (PDFs, markdown, text files)
Database schemas (if documents are stored in a DB)
Vector store configuration (what's being embedded)
Chunking strategy (how documents are split)

Read every document you can access. Understand:

What topics does the knowledge base cover?
What's the depth of information?
What terminology is used?
What are the boundaries (what's NOT covered)?

Step 2: Generate Diverse Question Types

Create questions across these categories:

Factual Recall

Direct questions answerable from a single passage:

"What is the recommended threshold for X?"
"When should Y be applied?"

Multi-Hop Reasoning

Questions requiring information from multiple passages:

"Given condition A and condition B, what should be done?"
"How do X and Y interact when Z occurs?"

Comparison

Questions comparing concepts within the knowledge base:

"What's the difference between approach A and approach B?"
"When should you use X instead of Y?"

Edge Cases

Questions about boundary conditions or unusual scenarios:

"What happens if the measurement is outside normal range?"
"What if two recommendations conflict?"

Negative Cases

Questions about topics NOT covered by the knowledge base:

"Does the system support Z?" (when it doesn't)
Questions requiring external knowledge the KB doesn't have

These help test that the agent correctly says "I don't know" rather than hallucinating.

Step 3: Include Context Per Row

For each Q&A pair, include the relevant document chunk(s) that contain the answer. This enables:

Platform experiments without the full RAG pipeline
Evaluating answer quality independent of retrieval quality
Testing with different prompts using the same retrieved context

Format:

{
    "input": "When should I irrigate apple orchards?",
    "expected_output": "Irrigate when soil moisture exceeds 35 kPa...",
    "context": "## Irrigation Management\nSoil moisture threshold for apple orchards: maintain between 25-35 kPa...",
    "question_type": "factual_recall"
}

Step 4: Export Formats

Create both:

Python DataFrame (for SDK experiments)

import pandas as pd
df = pd.DataFrame(dataset)
df.to_csv("rag_evaluation_dataset.csv", index=False)

Platform-Ready CSV

Export with columns: input, expected_output, context, question_type This can be imported directly into LangWatch platform datasets.

Step 5: Validate Dataset Quality

Before using the dataset:

Check topic coverage — are all knowledge base topics represented?
Verify answers are actually in the context — no hallucinated expected outputs
Check question diversity — not all the same type
Verify negative cases have appropriate "I don't know" expected outputs
Run a quick experiment to baseline accuracy

Common Mistakes

Do NOT generate questions without reading the actual knowledge base first
Do NOT skip negative cases — testing "I don't know" is crucial for RAG
Do NOT use the same question pattern for every entry — diversify types
Do NOT forget to include the relevant context per row
Do NOT generate expected outputs that aren't actually in the knowledge base

generate-rag-dataset

Generate a RAG Evaluation Dataset

Step 1: Analyze the Knowledge Base

Step 2: Generate Diverse Question Types

Factual Recall

Multi-Hop Reasoning

Comparison

Edge Cases

Negative Cases

Step 3: Include Context Per Row

Step 4: Export Formats

Python DataFrame (for SDK experiments)

Platform-Ready CSV

Step 5: Validate Dataset Quality

Common Mistakes

More from this repository

More from this repository

Generate a RAG Evaluation Dataset

Step 1: Analyze the Knowledge Base

Step 2: Generate Diverse Question Types

Factual Recall

Multi-Hop Reasoning

Comparison

Edge Cases

Negative Cases

Step 3: Include Context Per Row

Step 4: Export Formats

Python DataFrame (for SDK experiments)

Platform-Ready CSV

Step 5: Validate Dataset Quality

Common Mistakes