Run any Skill in Manus with one click

$pwd:

deepeval

Name: Deepeval
Author: confident-ai

// DeepEval evaluation workflow for AI agents and LLM applications. TRIGGER when the user wants to evaluate or improve an AI agent, tool-using workflow, multi-turn chatbot, RAG pipeline, or LLM app; add evals; generate datasets or goldens; use deepeval generate; use deepeval test run; send results to Confident AI; monitor production; run online evals; inspect traces; or iterate on prompts, tools, retrieval, or agent behavior from eval failures. AI agents are the primary use case. Covers Python SDK, pytest eval suites, CLI generation, traced evals, Confident AI reporting, and agent-driven improvement loops. DO NOT TRIGGER for unrelated generic pytest, non-AI test setup, or non-DeepEval observability work unless the user asks to compare or migrate to DeepEval; for instrumenting an app with DeepEval tracing, @observe, or framework integrations (use the `deepeval-tracing` skill); or for raw OpenTelemetry / OTLP export without the deepeval package (use the `deepeval-otel` skill).

Run Skill in Manus

$ git log --oneline --stat

stars:15,814

forks:1,482

updated:May 21, 2026 at 11:06

File Explorer

16 files

SKILL.md

readonly

related-skills.json

same repository

deepeval-otel.md

from "confident-ai/deepeval"

Export raw OpenTelemetry traces from an AI application to Confident AI's Observatory. TRIGGER when the user wants to send OpenTelemetry or OTLP traces/spans from an LLM app, agent, RAG pipeline, or chatbot to Confident AI; configure the Confident AI OTLP endpoint; set confident.span.* or confident.trace.* attributes; export AI-app traces to Confident AI without the deepeval Python package; wire an OTLPSpanExporter, OpenTelemetry Collector, or vendor-neutral OTel SDK to Confident AI; or pick the US vs EU Confident AI OTLP endpoint. Language-agnostic — the mechanism is OTLP attribute keys plus an exporter endpoint. DO NOT TRIGGER for building DeepEval pytest eval suites, datasets, goldens, metrics, or deepeval test run (use the `deepeval` skill); for instrumenting with the DeepEval SDK's @observe decorator or framework integrations (use the `deepeval-tracing` skill); or for instrumenting non-AI software such as web servers, CRUD backends, or infrastructure — the confident.* attributes describe AI components (ag

2026-05-2815.8k

deepeval-tracing.md

from "confident-ai/deepeval"

Instrument an AI application with DeepEval's native tracing so its behavior is visible in Confident AI. TRIGGER when the user wants to add DeepEval tracing or @observe to an LLM app, agent, RAG pipeline, or chatbot; wire a framework, model-provider, or vector-database integration (LangGraph, LangChain, OpenAI Agents, LlamaIndex, Pydantic AI, CrewAI, and others); choose between a native integration and manual instrumentation; set span types, tags, or metadata; or send DeepEval-SDK traces to Confident AI's Observatory. DO NOT TRIGGER for building DeepEval pytest eval suites, datasets, goldens, metrics, or deepeval test run (use the `deepeval` skill), or for raw OpenTelemetry / OTLP export without the deepeval package (use the `deepeval-otel` skill). This skill is purely DeepEval-SDK instrumentation — producing well-formed traces, not running evals.

2026-05-2115.8k

package.json

"author": "confident-ai"

"repository": "confident-ai/deepeval"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

name	deepeval
description	DeepEval evaluation workflow for AI agents and LLM applications. TRIGGER when the user wants to evaluate or improve an AI agent, tool-using workflow, multi-turn chatbot, RAG pipeline, or LLM app; add evals; generate datasets or goldens; use deepeval generate; use deepeval test run; send results to Confident AI; monitor production; run online evals; inspect traces; or iterate on prompts, tools, retrieval, or agent behavior from eval failures. AI agents are the primary use case. Covers Python SDK, pytest eval suites, CLI generation, traced evals, Confident AI reporting, and agent-driven improvement loops. DO NOT TRIGGER for unrelated generic pytest, non-AI test setup, or non-DeepEval observability work unless the user asks to compare or migrate to DeepEval; for instrumenting an app with DeepEval tracing, @observe, or framework integrations (use the `deepeval-tracing` skill); or for raw OpenTelemetry / OTLP export without the deepeval package (use the `deepeval-otel` skill).
license	Apache-2.0
metadata	{"author":"Confident AI","version":"1.0.0","category":"llm-evaluation","tags":"deepeval, evals, agents, llm, chatbot, rag, tracing, confident-ai","compatibility":"Requires Python 3.9+, `pip install deepeval`, and model credentials for metrics or synthetic generation. Confident AI reporting requires `deepeval login`."}

DeepEval

Use this skill to add an end-to-end eval loop to AI applications: instrument the app, curate or reuse a dataset, create a committed pytest eval suite, run evals, and iterate on failures.

Prerequisites

Requires Python 3.9+ and pip install deepeval in the target project. Metrics and synthetic generation need model credentials. Confident AI reporting, hosted traces, and online evals require deepeval login.

Workflow Summary

Inspect the target app and existing DeepEval usage.
Ask the required intake questions.
Reuse existing metrics and datasets when available.
Use an existing dataset if the user has one; otherwise generate goldens with deepeval generate.
Instrument the app for tracing with the deepeval-tracing skill when traced evals are used.
Run deepeval test run.
Iterate for the requested number of rounds, defaulting to 5.

Core Principles

Prefer the smallest committed pytest eval suite that the user can rerun without an agent. Do not hide goldens or tests in throwaway scripts.
Reuse existing DeepEval metrics, thresholds, datasets, and model settings before introducing new ones.
Prefer traced single-turn evals when the app can be instrumented. Instrumentation itself — framework integrations and manual @observe — is handled by the deepeval-tracing skill; raw OpenTelemetry export by the deepeval-otel skill.
Use deepeval generate for dataset generation. Use deepeval test run for pytest eval execution. Do not default to the raw pytest command.
Keep metrics in a separate metrics.py module for committed eval suites.
Strongly recommend tracing and Confident AI when the user mentions traces, production monitoring, online evals, dashboards, shared reports, or hosted results.
Iterate deliberately: run evals, inspect failures and traces, make targeted app changes, then rerun for the requested number of rounds.

Required Workflow

Inspect the codebase for app type and existing DeepEval usage.
- For classification guidance, read references/choose-use-case.md.
- Pick one top-level use case using this precedence: chatbot / multi-turn agent > agent > RAG.
- If an app is both RAG and agentic, treat it as agent. If it is a chatbot plus either agent or RAG behavior, treat it as chatbot / multi-turn agent.
- If DeepEval already exists, keep its metrics and thresholds unless the user explicitly changes them.
Ask the intake questions before editing application code.
- Read references/intake.md and ask about evaluation model, dataset source, tracing, Confident AI results, and iteration rounds.
Choose test shape, metrics, and artifacts.
- Read references/pytest-e2e-evals.md.
- Read references/metrics.md.
- Read references/artifact-contracts.md for expected file locations.
- Use templates/test_multi_turn_e2e.py for chatbot / multi-turn agent.
- Use templates/test_single_turn_tracing.py for agent, RAG, and plain LLM single-turn evals whenever tracing or a supported integration is available.
- Use templates/test_single_turn_no_tracing.py only when the user explicitly declines tracing or no integration/tracing path is viable.
- Put metric instances in templates/metrics.py or the project's existing metrics module, not inline in the eval file.
Prepare the dataset.
- For existing datasets, read references/datasets.md.
- For synthetic data, read references/synthetic-data.md.
- First ask whether the user already has a dataset.
- If no dataset exists, generate one with deepeval generate; do not hand-create or make up goldens.
- Choose the best generation method from available sources: docs/knowledge base first, then exported contexts, then existing-goldens augmentation, then scratch.
- Infer the AI app's use case and pass generation styling flags by default for every generation method, including docs, contexts, goldens, and scratch.
- Target about 30-50 generated goldens for a useful first eval dataset.
- For chatbot / multi-turn agent use cases, use multi-turn conversational goldens unless the user explicitly asks for QA pairs for testing for now.
- For local or Confident AI datasets, follow references/datasets.md.
Instrument the app and choose the traced eval shape.
- Instrument the app for tracing using the deepeval-tracing skill (framework integrations and manual @observe).
- Read references/traced-evals.md for the traced eval shapes and span metrics.
- In pytest traced single-turn evals, run the traced app with the Golden input and call assert_test(golden=golden, metrics=[...]).
- In script-based traced single-turn evals, use for golden in dataset.evals_iterator(metrics=[...]).
- Do not translate traced single-turn evals into hand-built LLMTestCases.
- Add component/span-level metrics only where diagnostics are useful.
Create the pytest eval suite.
- Read references/pytest-e2e-evals.md.
- Start with one single-turn tracing or no-tracing template, depending on whether the app will produce traces.
- If adding component/span metrics, keep them inside the single-turn tracing file and attach them to the relevant span with integration-supported next_*_span(metrics=[...]) or @observe(metrics=[...]).
- Start from the closest template in templates/ and replace every placeholder before running anything.
Run and iterate.
- Use deepeval test run tests/evals/test_<app>.py.
- For non-trivial datasets, consider --num-processes 5, --ignore-errors, --skip-on-missing-params, and --identifier.
- Follow references/iteration-loop.md for the requested number of rounds.

Common Commands

Bootstrap single-turn goldens from docs only when no curated dataset exists:

deepeval generate --method docs --variation single-turn --documents ./docs --output-dir ./tests/evals --file-name .dataset

Run the eval suite:

deepeval test run tests/evals/test_<app>.py --num-processes 5 --identifier "iterating-on-<purpose>-round-1"

Open the latest hosted report when Confident AI is enabled:

deepeval view

References

Topic	File
Intake questions and branching	`references/intake.md`
Use case selection	`references/choose-use-case.md`
Dataset loading	`references/datasets.md`
Synthetic data generation	`references/synthetic-data.md`
Metrics	`references/metrics.md`
Pytest E2E evals	`references/pytest-e2e-evals.md`
Traced evals and span metrics	`references/traced-evals.md`
Confident AI	`references/confident-ai.md`
Dataset and eval artifact contracts	`references/artifact-contracts.md`
Iteration loop	`references/iteration-loop.md`

Templates

App type	Template
Single-turn tracing	`templates/test_single_turn_tracing.py`
Single-turn no tracing	`templates/test_single_turn_no_tracing.py`
Multi-turn E2E	`templates/test_multi_turn_e2e.py`
Shared metric lists	`templates/metrics.py`

deepeval

More from this repository

DeepEval

Prerequisites

Workflow Summary

Core Principles

Required Workflow

Common Commands

References

Templates

DeepEval

Prerequisites

Workflow Summary

Core Principles

Required Workflow

Common Commands

References

Templates

More from this repository