| description | Use when writing eval code, configuring eval infrastructure, creating golden datasets, setting up PromptRegistry, authoring CI eval gates, or working with any eval tool: DeepEval, Ragas, Giskard OSS v3, Promptfoo, Langfuse, Arize Phoenix, adk eval, ADK User Simulation, Vertex GenAI Eval. Covers per-agent accuracy thresholds, CI tier structure (R1-R4), MCP eval suites, golden dataset structure, and PromptRegistry architecture. Also covers pytest harness configuration (asyncio_mode, InMemoryRunner, parametrize-over-golden). |
| metadata | {"triggers":"deepeval, ragas, giskard, promptfoo, langfuse, phoenix, PromptRegistry, golden dataset, evalset, eval-smoke, eval-adk, redteam, MCPUseMetric, Faithfulness, ToolCallAccuracy, HallucinationsV1, tests/golden, dataset_manifest, seed-prompts, R1 gate, R2 nightly, R3 pre-release, R4 canary, ci tier, eval threshold, ArenaGEval, GeminiModel judge, set_default_generator, promptfooconfig.yaml, Scenario, Suite, make eval-all-local, make redteam-promptfoo, make giskard-scan, make mcp-eval-all, make seed-prompts-local, eval-reviewer, per-agent threshold, golden case, asyncio_mode, InMemoryRunner, pytest-asyncio, dataset_manifest.yaml, diff_langfuse_prompts, emergency pack, prompt registry, prompt lifecycle","related-skills":"google-adk, adk-eval-guide, python-dev, adk-observability-guide","domain":"agentic-ai","role":"specialist","scope":"evaluation","output-format":"code"} |