Run any Skill in Manus with one click

evals-write-spec

Name: Evals Write Spec
Author: elastic

Write LLM evaluation spec files with datasets, tasks, and evaluators using the @kbn/evals Playwright fixture. Use when authoring new eval specs, adding datasets or evaluators, or debugging evaluation test failures.

Run Skill in Manus

Skill metadata

Stars21,125

Forks8,587

UpdatedMay 29, 2026 at 21:18

File Explorer

2 files

SKILL.md

readonly

More from this repository

same repository

kibana-otel-instrumentation

elastic/kibana

Implement and quality-check OpenTelemetry metric instrumentation in Kibana code that uses `@kbn/metrics`. Use whenever the user wants to add, change, or review OTel metrics — including any call to `metrics.getMeter`, `meter.createCounter`/`createUpDownCounter`/`createGauge`/`createHistogram`/`createObservable*`/`addBatchObservableCallback`, edits to `kibana.yml` `telemetry.metrics` config, or questions like "is this metric well-designed?", "what should I name this counter?", or "which instrument type is right here?". Trigger this skill even when the user does not say "OTel" or "OpenTelemetry" but is clearly adding observability to Kibana server code and already knows what they want to measure.

2026-06-0321.1k

elasticsearch-onboarding

elastic/kibana

Primary guided playbook for Elasticsearch search in Kibana Agent Builder: intent → data → mapping → Dev Tools API snippets (SENSE), with one question at a time. Load this skill whenever the user wants to learn Elasticsearch search, get started, begin building, take first steps, onboard, follow a walkthrough or tutorial, go from zero to a working query, or get structured help setting up indices and search — including casual openers like hi, help, getting started, new to Elasticsearch, how do I build search, or I want to try search. Use when they need end-to-end onboarding, not a single narrow API answer. If they only ask what they can build with Elastic (exploration without the full playbook), prefer invoking /use-case-library first; you can still load this skill afterward for the guided build.

2026-06-0221.1k

elasticsearch-tutorial

elastic/kibana

Topic-driven, hands-on Elasticsearch tutorial flow that runs in Kibana Dev Console. Use whenever the user says "walk me through", "give me a tutorial for", "teach me", "show me how X works", "tutorial on", or similar topical learning intent — and they are NOT asking you to build their real, specific use case. Topics are open-ended: any Elasticsearch / Kibana search concept the user names (e.g. mappings, analyzers, bool queries, semantic_text, kNN, RRF, aggregations, ingest pipelines, reranking, data streams, ES|QL). Tutorials use sample data on isolated resources, present every step as a SENSE snippet to run in Dev Tools, and end with cleanup plus pointers to docs and the onboarding / pattern skills.

2026-06-0221.1k

kbn-github

elastic/kibana

GitHub interactions via gh CLI for the Kibana repo. Use when performing any GitHub interaction — creating, viewing, or modifying PRs or issues, posting comments or reviews, checking CI status, applying labels, creating releases, or making any gh/API call.

2026-06-0221.1k

workflows-custom-steps

elastic/kibana

Register and implement custom workflow steps from an external Kibana plugin using `@kbn/workflows-extensions`. Use when adding or modifying a step type with `registerStepDefinition`, designing input/output/config Zod schemas, implementing `createServerStepDefinition` / `createPublicStepDefinition`, choosing `StepCategory`, building `editorHandlers` (selection / dynamicSchema), wiring `callKibanaApi` / `onCancel`, deciding sync vs async loader registration, updating `APPROVED_STEP_DEFINITIONS`, or reviewing PRs that touch any of these.

2026-06-0121.1k

flaky-test-investigator

elastic/kibana

Investigate Scout and FTR flaky test failures in Kibana. Use when triaging a failed-test issue, a Buildkite-reported failure, a test path that has been failing intermittently, or any time the user asks to look at a flaky test, deflake a test, or stabilize a test.

2026-06-0121.1k

Source

elastic

elastic/kibana

View GitHub Repository View Creator Repositories

Install

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	evals-write-spec
disable-model-invocation	true
description	Write LLM evaluation spec files with datasets, tasks, and evaluators using the @kbn/evals Playwright fixture. Use when authoring new eval specs, adding datasets or evaluators, or debugging evaluation test failures.

Write Eval Specs

Spec File Anatomy

Eval specs use the evaluate Playwright fixture (not test). A spec file follows this structure:

import { evaluate, tags, selectEvaluators, type Example, type TaskOutput } from '@kbn/evals';

evaluate.describe('Suite name', { tag: tags.serverless.observability.complete }, () => {
  evaluate.beforeAll(async ({ fetch, log }) => {
    // one-time setup: install docs, create agents, load archives
  });

  evaluate.afterAll(async ({ fetch, log }) => {
    // teardown: uninstall docs, delete agents, unload archives
  });

  evaluate('test name', async ({ executorClient, connector }) => {
    await executorClient.runExperiment(
      { datasets: [dataset], task },
      evaluators
    );
  });
});

When a suite has a custom src/evaluate.ts, import from there instead of @kbn/evals:

import { evaluate } from '../src/evaluate';

Datasets

A dataset is an array of examples with typed input, output (expected), and optional metadata:

type MyExample = Example<
  { question: string },
  { expectedAnswer: string },
  { tags?: string[] }
>;

const dataset = {
  name: 'my-dataset',
  description: 'What this dataset tests',
  examples: [
    {
      input: { question: 'What is 2+2?' },
      output: { expectedAnswer: '4' },
      metadata: { tags: ['math'] },
    },
  ],
};

Keep datasets focused. For local iteration, use --grep to run a subset:

node scripts/evals start --grep "my test name"

Tasks

The task function receives an example and returns the output to evaluate:

task: async ({ input }) => {
  const result = await someKibanaApi(input.question);
  return { answer: result.content };
}

Tasks can use any fixture available in the evaluate callback: fetch, inferenceClient, connector, esClient, kbnClient, or custom fixtures like chatClient.

Evaluators

There are two ways to provide evaluators to runExperiment:

Inline array -- pass evaluator objects directly (simple suites)
selectEvaluators -- typed wrapper that enforces Example/TaskOutput generics

CODE Evaluators

Deterministic, no LLM call. Use for binary checks:

{
  name: 'NonEmpty',
  kind: 'CODE',
  evaluate: async ({ output }) => ({
    score: output?.documents?.length > 0 ? 1 : 0,
  }),
}

LLM-as-Judge Criteria

Use evaluators.criteria(criteriaArray) for subjective quality checks. The judge LLM scores each criterion:

evaluators.criteria([
  'The response correctly identifies the top users.',
  'The response includes risk scores.',
]).evaluate({ input, output, expected, metadata })

Correctness Analysis

Compares output against expected answer:

evaluators.correctnessAnalysis().evaluate({ input, output, expected, metadata })

Groundedness Analysis

Checks if output is grounded in provided context:

evaluators.groundednessAnalysis().evaluate({ input, output, expected, metadata })

Trace-Based Evaluators

Available from evaluators.traceBasedEvaluators:

inputTokens, outputTokens, cachedTokens -- token usage
toolCalls -- number of tool calls
latency -- span latency in seconds

These read from the tracing ES cluster and require EDOT to be running.

RAG Evaluators

For retrieval-augmented generation with ground truth:

import { createPrecisionAtKEvaluator, createRecallAtKEvaluator, createF1AtKEvaluator } from '@kbn/evals';

See evaluator-patterns.md for full examples.

Available Fixtures

Fixture	Scope	Description
`executorClient`	worker	Runs experiments, exports scores to ES
`inferenceClient`	worker	Inference REST client bound to connector
`connector`	worker	The model connector being evaluated
`evaluationConnector`	worker	The judge connector
`evaluators`	worker	`DefaultEvaluators` (criteria, correctness, groundedness, trace-based)
`fetch`	worker	`HttpHandler` for Kibana API calls
`esClient`	worker	Elasticsearch client (Scout cluster)
`kbnClient`	worker	Kibana client with retries
`traceEsClient`	worker	ES client for trace queries
`evaluationsEsClient`	worker	ES client for evaluation score storage
`log`	worker	`ToolingLog` for structured logging
`repetitions`	worker	Number of experiment repetitions
`config`	worker	Scout server config (hosts, auth)

The `evaluateDataset` Pattern

For suites with many specs that share the same task + evaluator wiring, extract a reusable helper:

src/evaluate_dataset.ts:

import type { DefaultEvaluators, EvalsExecutorClient } from '@kbn/evals';
import type { MyChatClient } from './chat_client';

export type EvaluateDataset = (opts: {
  dataset: { name: string; description: string; examples: MyExample[] };
}) => Promise<void>;

export function createEvaluateDataset({
  chatClient, evaluators, executorClient,
}: {
  chatClient: MyChatClient;
  evaluators: DefaultEvaluators;
  executorClient: EvalsExecutorClient;
}): EvaluateDataset {
  return async ({ dataset }) => {
    await executorClient.runExperiment(
      {
        datasets: [dataset],
        task: async ({ input }) => {
          const response = await chatClient.converse({ messages: [{ message: input.question }] });
          return { messages: response.messages, steps: response.steps };
        },
      },
      [myCriteriaEvaluator, myToolCallsEvaluator]
    );
  };
}

In the spec:

import { evaluate as base } from '../src/evaluate';
import type { EvaluateDataset } from '../src/evaluate_dataset';
import { createEvaluateDataset } from '../src/evaluate_dataset';

const evaluate = base.extend<{ evaluateDataset: EvaluateDataset }, {}>({
  evaluateDataset: [
    ({ chatClient, evaluators, executorClient }, use) => {
      use(createEvaluateDataset({ chatClient, evaluators, executorClient }));
    },
    { scope: 'test' },
  ],
});

evaluate.describe('My suite', { tag: tags.serverless.search }, () => {
  evaluate('my test', async ({ evaluateDataset }) => {
    await evaluateDataset({ dataset: { name: '...', description: '...', examples: [...] } });
  });
});

Setup and Teardown

Use evaluate.beforeAll / evaluate.afterAll for expensive one-time operations:

Install product docs: POST to /internal/product_doc_base/install
Create agents/rules: Use fetch or kbnClient
Load ES archives: Use esArchiver.load(archivePath) (requires custom fixture)

Always clean up in afterAll -- delete agents, uninstall docs, unload archives.

Running Locally

# Full interactive flow
node scripts/evals start

# Specify model and judge
node scripts/evals start --model <connector-id> --judge <connector-id>

# Filter to a specific test
node scripts/evals start --grep "my test name"

# Run directly (services already running)
node scripts/evals run --model <connector-id> --judge <connector-id>

Common Mistakes

Forgetting the tag on evaluate.describe -- Scout validates tags at runtime.
Missing afterAll cleanup -- leftover agents/docs pollute subsequent runs.
Overly large datasets for local iteration -- use --grep to target a single evaluate() block.
Importing evaluate from @kbn/evals when the suite has a custom src/evaluate.ts -- you'll miss custom fixtures.
Using test instead of evaluate -- the evaluate fixture provides all the evals-specific wiring.

References

Evaluator type examples with real code: references/evaluator-patterns.md
Suite scaffolding: use the evals-create-suite skill

Tag	When to use
`tags.serverless.observability.complete`	Observability domain evals
`tags.serverless.security.complete`	Security domain evals
`tags.serverless.search`	Search domain evals
`tags.stateful.classic`	Stateful-only evals

Tag	When to use
`tags.serverless.observability.complete`	Observability domain evals
`tags.serverless.security.complete`	Security domain evals
`tags.serverless.search`	Search domain evals
`tags.stateful.classic`	Stateful-only evals

evals-write-spec

Write Eval Specs

Spec File Anatomy

Tags

Datasets

Tasks

Evaluators

CODE Evaluators

LLM-as-Judge Criteria

Correctness Analysis

Groundedness Analysis

Trace-Based Evaluators

RAG Evaluators

Available Fixtures

The `evaluateDataset` Pattern

Setup and Teardown

Running Locally

Common Mistakes

References

Write Eval Specs

Spec File Anatomy

Tags

Datasets

Tasks

Evaluators

CODE Evaluators

LLM-as-Judge Criteria

Correctness Analysis

Groundedness Analysis

Trace-Based Evaluators

RAG Evaluators

Available Fixtures

The `evaluateDataset` Pattern

Setup and Teardown

Running Locally

Common Mistakes

References

evals-write-spec

More from this repository

More from this repository

Write Eval Specs

Spec File Anatomy

Tags

Datasets

Tasks

Evaluators

CODE Evaluators

LLM-as-Judge Criteria

Correctness Analysis

Groundedness Analysis

Trace-Based Evaluators

RAG Evaluators

Available Fixtures

The evaluateDataset Pattern

Setup and Teardown

Running Locally

Common Mistakes

References

Write Eval Specs

Spec File Anatomy

Tags

Datasets

Tasks

Evaluators

CODE Evaluators

LLM-as-Judge Criteria

Correctness Analysis

Groundedness Analysis

Trace-Based Evaluators

RAG Evaluators

Available Fixtures

The evaluateDataset Pattern

Setup and Teardown

Running Locally

Common Mistakes

References

The `evaluateDataset` Pattern

The `evaluateDataset` Pattern