with one click
evals-write-spec
Write LLM evaluation spec files with datasets, tasks, and evaluators using the @kbn/evals Playwright fixture. Use when authoring new eval specs, adding datasets or evaluators, or debugging evaluation test failures.
Write LLM evaluation spec files with datasets, tasks, and evaluators using the @kbn/evals Playwright fixture. Use when authoring new eval specs, adding datasets or evaluators, or debugging evaluation test failures.
Implement and quality-check OpenTelemetry metric instrumentation in Kibana code that uses `@kbn/metrics`. Use whenever the user wants to add, change, or review OTel metrics — including any call to `metrics.getMeter`, `meter.createCounter`/`createUpDownCounter`/`createGauge`/`createHistogram`/`createObservable*`/`addBatchObservableCallback`, edits to `kibana.yml` `telemetry.metrics` config, or questions like "is this metric well-designed?", "what should I name this counter?", or "which instrument type is right here?". Trigger this skill even when the user does not say "OTel" or "OpenTelemetry" but is clearly adding observability to Kibana server code and already knows what they want to measure.
Primary guided playbook for Elasticsearch search in Kibana Agent Builder: intent → data → mapping → Dev Tools API snippets (SENSE), with one question at a time. Load this skill whenever the user wants to learn Elasticsearch search, get started, begin building, take first steps, onboard, follow a walkthrough or tutorial, go from zero to a working query, or get structured help setting up indices and search — including casual openers like hi, help, getting started, new to Elasticsearch, how do I build search, or I want to try search. Use when they need end-to-end onboarding, not a single narrow API answer. If they only ask what they can build with Elastic (exploration without the full playbook), prefer invoking /use-case-library first; you can still load this skill afterward for the guided build.
Topic-driven, hands-on Elasticsearch tutorial flow that runs in Kibana Dev Console. Use whenever the user says "walk me through", "give me a tutorial for", "teach me", "show me how X works", "tutorial on", or similar topical learning intent — and they are NOT asking you to build their real, specific use case. Topics are open-ended: any Elasticsearch / Kibana search concept the user names (e.g. mappings, analyzers, bool queries, semantic_text, kNN, RRF, aggregations, ingest pipelines, reranking, data streams, ES|QL). Tutorials use sample data on isolated resources, present every step as a SENSE snippet to run in Dev Tools, and end with cleanup plus pointers to docs and the onboarding / pattern skills.
GitHub interactions via gh CLI for the Kibana repo. Use when performing any GitHub interaction — creating, viewing, or modifying PRs or issues, posting comments or reviews, checking CI status, applying labels, creating releases, or making any gh/API call.
Register and implement custom workflow steps from an external Kibana plugin using `@kbn/workflows-extensions`. Use when adding or modifying a step type with `registerStepDefinition`, designing input/output/config Zod schemas, implementing `createServerStepDefinition` / `createPublicStepDefinition`, choosing `StepCategory`, building `editorHandlers` (selection / dynamicSchema), wiring `callKibanaApi` / `onCancel`, deciding sync vs async loader registration, updating `APPROVED_STEP_DEFINITIONS`, or reviewing PRs that touch any of these.
Investigate Scout and FTR flaky test failures in Kibana. Use when triaging a failed-test issue, a Buildkite-reported failure, a test path that has been failing intermittently, or any time the user asks to look at a flaky test, deflake a test, or stabilize a test.
| name | evals-write-spec |
| disable-model-invocation | true |
| description | Write LLM evaluation spec files with datasets, tasks, and evaluators using the @kbn/evals Playwright fixture. Use when authoring new eval specs, adding datasets or evaluators, or debugging evaluation test failures. |
Eval specs use the evaluate Playwright fixture (not test). A spec file follows this structure:
import { evaluate, tags, selectEvaluators, type Example, type TaskOutput } from '@kbn/evals';
evaluate.describe('Suite name', { tag: tags.serverless.observability.complete }, () => {
evaluate.beforeAll(async ({ fetch, log }) => {
// one-time setup: install docs, create agents, load archives
});
evaluate.afterAll(async ({ fetch, log }) => {
// teardown: uninstall docs, delete agents, unload archives
});
evaluate('test name', async ({ executorClient, connector }) => {
await executorClient.runExperiment(
{ datasets: [dataset], task },
evaluators
);
});
});
When a suite has a custom src/evaluate.ts, import from there instead of @kbn/evals:
import { evaluate } from '../src/evaluate';
Every evaluate.describe must have a tag. Common choices:
| Tag | When to use |
|---|---|
tags.serverless.observability.complete | Observability domain evals |
tags.serverless.security.complete | Security domain evals |
tags.serverless.search | Search domain evals |
tags.stateful.classic | Stateful-only evals |
Import tags from @kbn/scout or @kbn/evals (re-exported).
A dataset is an array of examples with typed input, output (expected), and optional metadata:
type MyExample = Example<
{ question: string },
{ expectedAnswer: string },
{ tags?: string[] }
>;
const dataset = {
name: 'my-dataset',
description: 'What this dataset tests',
examples: [
{
input: { question: 'What is 2+2?' },
output: { expectedAnswer: '4' },
metadata: { tags: ['math'] },
},
],
};
Keep datasets focused. For local iteration, use --grep to run a subset:
node scripts/evals start --grep "my test name"
The task function receives an example and returns the output to evaluate:
task: async ({ input }) => {
const result = await someKibanaApi(input.question);
return { answer: result.content };
}
Tasks can use any fixture available in the evaluate callback: fetch, inferenceClient, connector, esClient, kbnClient, or custom fixtures like chatClient.
There are two ways to provide evaluators to runExperiment:
selectEvaluators -- typed wrapper that enforces Example/TaskOutput genericsDeterministic, no LLM call. Use for binary checks:
{
name: 'NonEmpty',
kind: 'CODE',
evaluate: async ({ output }) => ({
score: output?.documents?.length > 0 ? 1 : 0,
}),
}
Use evaluators.criteria(criteriaArray) for subjective quality checks. The judge LLM scores each criterion:
evaluators.criteria([
'The response correctly identifies the top users.',
'The response includes risk scores.',
]).evaluate({ input, output, expected, metadata })
Compares output against expected answer:
evaluators.correctnessAnalysis().evaluate({ input, output, expected, metadata })
Checks if output is grounded in provided context:
evaluators.groundednessAnalysis().evaluate({ input, output, expected, metadata })
Available from evaluators.traceBasedEvaluators:
inputTokens, outputTokens, cachedTokens -- token usagetoolCalls -- number of tool callslatency -- span latency in secondsThese read from the tracing ES cluster and require EDOT to be running.
For retrieval-augmented generation with ground truth:
import { createPrecisionAtKEvaluator, createRecallAtKEvaluator, createF1AtKEvaluator } from '@kbn/evals';
See evaluator-patterns.md for full examples.
| Fixture | Scope | Description |
|---|---|---|
executorClient | worker | Runs experiments, exports scores to ES |
inferenceClient | worker | Inference REST client bound to connector |
connector | worker | The model connector being evaluated |
evaluationConnector | worker | The judge connector |
evaluators | worker | DefaultEvaluators (criteria, correctness, groundedness, trace-based) |
fetch | worker | HttpHandler for Kibana API calls |
esClient | worker | Elasticsearch client (Scout cluster) |
kbnClient | worker | Kibana client with retries |
traceEsClient | worker | ES client for trace queries |
evaluationsEsClient | worker | ES client for evaluation score storage |
log | worker | ToolingLog for structured logging |
repetitions | worker | Number of experiment repetitions |
config | worker | Scout server config (hosts, auth) |
evaluateDataset PatternFor suites with many specs that share the same task + evaluator wiring, extract a reusable helper:
src/evaluate_dataset.ts:
import type { DefaultEvaluators, EvalsExecutorClient } from '@kbn/evals';
import type { MyChatClient } from './chat_client';
export type EvaluateDataset = (opts: {
dataset: { name: string; description: string; examples: MyExample[] };
}) => Promise<void>;
export function createEvaluateDataset({
chatClient, evaluators, executorClient,
}: {
chatClient: MyChatClient;
evaluators: DefaultEvaluators;
executorClient: EvalsExecutorClient;
}): EvaluateDataset {
return async ({ dataset }) => {
await executorClient.runExperiment(
{
datasets: [dataset],
task: async ({ input }) => {
const response = await chatClient.converse({ messages: [{ message: input.question }] });
return { messages: response.messages, steps: response.steps };
},
},
[myCriteriaEvaluator, myToolCallsEvaluator]
);
};
}
In the spec:
import { evaluate as base } from '../src/evaluate';
import type { EvaluateDataset } from '../src/evaluate_dataset';
import { createEvaluateDataset } from '../src/evaluate_dataset';
const evaluate = base.extend<{ evaluateDataset: EvaluateDataset }, {}>({
evaluateDataset: [
({ chatClient, evaluators, executorClient }, use) => {
use(createEvaluateDataset({ chatClient, evaluators, executorClient }));
},
{ scope: 'test' },
],
});
evaluate.describe('My suite', { tag: tags.serverless.search }, () => {
evaluate('my test', async ({ evaluateDataset }) => {
await evaluateDataset({ dataset: { name: '...', description: '...', examples: [...] } });
});
});
Use evaluate.beforeAll / evaluate.afterAll for expensive one-time operations:
/internal/product_doc_base/installfetch or kbnClientesArchiver.load(archivePath) (requires custom fixture)Always clean up in afterAll -- delete agents, uninstall docs, unload archives.
# Full interactive flow
node scripts/evals start
# Specify model and judge
node scripts/evals start --model <connector-id> --judge <connector-id>
# Filter to a specific test
node scripts/evals start --grep "my test name"
# Run directly (services already running)
node scripts/evals run --model <connector-id> --judge <connector-id>
tag on evaluate.describe -- Scout validates tags at runtime.afterAll cleanup -- leftover agents/docs pollute subsequent runs.--grep to target a single evaluate() block.evaluate from @kbn/evals when the suite has a custom src/evaluate.ts -- you'll miss custom fixtures.test instead of evaluate -- the evaluate fixture provides all the evals-specific wiring.evals-create-suite skill