一键导入
一键导入
Create and curate evaluation datasets with ground truth for TruLens
Configure feedback functions and selectors for TruLens evaluations
Configure and use feedback functions as runtime blocking guardrails
Execute TruLens evaluations and view results
Diagnose low evaluation scores and generate actionable improvement recommendations
| skill_spec_version | 0.1.0 |
| name | trulens-evaluation-workflow |
| version | 1.0.0 |
| description | Systematically evaluate your LLM application with TruLens |
| tags | ["trulens","llm","evaluation","workflow","orchestration"] |
A systematic approach to evaluating your LLM application.
Use this skill when you want to:
Before implementing, always ask the user these questions:
Ask: "Which evaluation metrics would you like to use?"
| App Type | Recommended Metrics | Description |
|---|---|---|
| RAG | RAG Triad | Context Relevance, Groundedness, Answer Relevance |
| Agent | Agent GPA | Tool Selection, Tool Calling, Execution Efficiency, etc. |
| Simple | Answer Relevance | Basic input-to-output relevance check |
| Custom | Ask user | Let user describe what they want to evaluate |
For Agents, also ask:
┌─────────────────────────────────────────────────────────────────┐
│ TruLens Evaluation Workflow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. INSTRUMENT 2. CURATE 3. CONFIGURE │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Capture data │ → │ Build test │ → │ Choose │ │
│ │ from your │ │ datasets │ │ metrics │ │
│ │ app │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ↓ ↓ │
│ └─────────────────────┬─────────────────────┘ │
│ ↓ │
│ 4. RUN & ANALYZE │
│ ┌──────────────┐ │
│ │ Execute evals│ │
│ │ & iterate │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
| Step | Skill | When to Use |
|---|---|---|
| 1. Instrument | instrumentation/ | Setting up a new app, adding custom spans, capturing specific data for evals |
| 2. Curate | dataset-curation/ | Creating test datasets, storing ground truth, ingesting external logs |
| 3. Configure | evaluation-setup/ | Choosing metrics (RAG triad vs Agent GPA), setting up feedback functions |
| 4. Run | running-evaluations/ | Executing evaluations, viewing results, comparing versions |
Answer these questions to find where to start:
"I have a new LLM app that isn't instrumented yet"
→ Start with instrumentation/ skill
"My app is instrumented but I don't have test data"
→ Go to dataset-curation/ skill
"I have data but haven't set up evaluations"
→ Go to evaluation-setup/ skill
"Everything is set up, I just need to run evals"
→ Go to running-evaluations/ skill
"I want to see traces of my app's execution"
→ Use instrumentation/ - capture spans and view in dashboard
"I want to evaluate my RAG's retrieval quality"
→ Use evaluation-setup/ - configure RAG Triad metrics
"I want to evaluate my agent's tool usage"
→ Use evaluation-setup/ - configure Agent GPA metrics
"I want to compare two versions of my app"
→ Use running-evaluations/ - version comparison pattern
"I want to evaluate against known correct answers"
→ Use dataset-curation/ - create ground truth dataset
TruLlama or TruChainTruGraph (for LangGraph/Deep Agents)Note: For LangGraph-based frameworks like Deep Agents, always use TruGraph rather than manual @instrument() decorators. TruGraph automatically creates the correct span types and captures all graph transitions.
"Do I need to use all four skills?" No. Instrumentation and evaluation-setup are essential. Dataset-curation is optional (for ground truth comparisons). Running-evaluations is needed to execute and view results.
"What order should I use them?" Generally: Instrument → (optionally) Curate → Configure → Run. But you can revisit any step as needed.
"Can I add more evaluations later?" Yes. You can always add new feedback functions and re-run evaluations on existing traces.
"How do I know if my app is a RAG or Agent?"
If your app does both (e.g., agentic RAG), use metrics from both categories.
If you're unsure which skill to use, describe your goal and I'll guide you to the right one.
TruGraph for LangGraph-based apps (including Deep Agents).on_input() and .on_output() feedback shortcuts require RECORD_ROOT spans@instrument(span_type=SpanType.AGENT) will NOT work with selector shortcutsSome LangGraph/Deep Agents versions use NotRequired type annotations that older Pydantic versions can't handle. If you see PydanticForbiddenQualifier errors, update to the latest TruLens version.