Run any Skill in Manus with one click

evals-create-suite

Scaffold a new LLM evaluation suite package with Playwright config, evaluate fixture, and package files. Use when creating a new eval suite, adding an evals package for a plugin, or setting up the boilerplate for offline LLM evaluations.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/elastic/kibana --skill evals-create-suite

Copy and paste this command into Claude Code to install the skill

Source

elastic/kibana

Stars21,125

Forks8,587

UpdatedApril 23, 2026 at 22:14

SKILL.md

readonly

More from this repository

same repository

kibana-otel-instrumentation

elastic/kibana

Implement and quality-check OpenTelemetry metric instrumentation in Kibana code that uses `@kbn/metrics`. Use whenever the user wants to add, change, or review OTel metrics — including any call to `metrics.getMeter`, `meter.createCounter`/`createUpDownCounter`/`createGauge`/`createHistogram`/`createObservable*`/`addBatchObservableCallback`, edits to `kibana.yml` `telemetry.metrics` config, or questions like "is this metric well-designed?", "what should I name this counter?", or "which instrument type is right here?". Trigger this skill even when the user does not say "OTel" or "OpenTelemetry" but is clearly adding observability to Kibana server code and already knows what they want to measure.

2026-06-0321.1k

elasticsearch-onboarding

elastic/kibana

Primary guided playbook for Elasticsearch search in Kibana Agent Builder: intent → data → mapping → Dev Tools API snippets (SENSE), with one question at a time. Load this skill whenever the user wants to learn Elasticsearch search, get started, begin building, take first steps, onboard, follow a walkthrough or tutorial, go from zero to a working query, or get structured help setting up indices and search — including casual openers like hi, help, getting started, new to Elasticsearch, how do I build search, or I want to try search. Use when they need end-to-end onboarding, not a single narrow API answer. If they only ask what they can build with Elastic (exploration without the full playbook), prefer invoking /use-case-library first; you can still load this skill afterward for the guided build.

2026-06-0221.1k

elasticsearch-tutorial

elastic/kibana

Topic-driven, hands-on Elasticsearch tutorial flow that runs in Kibana Dev Console. Use whenever the user says "walk me through", "give me a tutorial for", "teach me", "show me how X works", "tutorial on", or similar topical learning intent — and they are NOT asking you to build their real, specific use case. Topics are open-ended: any Elasticsearch / Kibana search concept the user names (e.g. mappings, analyzers, bool queries, semantic_text, kNN, RRF, aggregations, ingest pipelines, reranking, data streams, ES|QL). Tutorials use sample data on isolated resources, present every step as a SENSE snippet to run in Dev Tools, and end with cleanup plus pointers to docs and the onboarding / pattern skills.

2026-06-0221.1k

kbn-github

elastic/kibana

GitHub interactions via gh CLI for the Kibana repo. Use when performing any GitHub interaction — creating, viewing, or modifying PRs or issues, posting comments or reviews, checking CI status, applying labels, creating releases, or making any gh/API call.

2026-06-0221.1k

workflows-custom-steps

elastic/kibana

Register and implement custom workflow steps from an external Kibana plugin using `@kbn/workflows-extensions`. Use when adding or modifying a step type with `registerStepDefinition`, designing input/output/config Zod schemas, implementing `createServerStepDefinition` / `createPublicStepDefinition`, choosing `StepCategory`, building `editorHandlers` (selection / dynamicSchema), wiring `callKibanaApi` / `onCancel`, deciding sync vs async loader registration, updating `APPROVED_STEP_DEFINITIONS`, or reviewing PRs that touch any of these.

2026-06-0121.1k

flaky-test-investigator

elastic/kibana

Investigate Scout and FTR flaky test failures in Kibana. Use when triaging a failed-test issue, a Buildkite-reported failure, a test path that has been failing intermittently, or any time the user asks to look at a flaky test, deflake a test, or stabilize a test.

2026-06-0121.1k

Source

elastic

elastic/kibana

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	evals-create-suite
disable-model-invocation	true
description	Scaffold a new LLM evaluation suite package with Playwright config, evaluate fixture, and package files. Use when creating a new eval suite, adding an evals package for a plugin, or setting up the boilerplate for offline LLM evaluations.

Create an Eval Suite

Overview

Eval suites live in dedicated kbn-evals-suite-<name> packages. Each suite is a self-contained Playwright project that uses the evaluate fixture from @kbn/evals to run LLM experiments with datasets, tasks, and evaluators.

Inputs to Collect

Suite name (kebab-case, e.g. my-feature)
Parent directory under x-pack/ (e.g. x-pack/platform/packages/shared/ai-infra/ or x-pack/solutions/security/test/)
Owner GitHub team handle (e.g. @elastic/appex-ai-infra)
Group (platform, security, observability, search)
Visibility (shared or private)
Whether custom fixtures are needed (chat client, esArchiver, supertest, etc.)

Do NOT Use `node scripts/scout.js generate`

Eval suites are not standard Scout test configs. The Scout generator creates test/scout/ directories that are picked up by Scout's CI discovery glob -- this will break because evals configs use createPlaywrightEvalsConfig (not createPlaywrightConfig) and contain non-JS files (like .text prompt files) that Playwright cannot parse.

The Scout team has explicitly asked that eval configs live outside test/scout/ directories. All eval suites place their playwright.config.ts in the package root.

Directory Layout

kbn-evals-suite-<name>/
├── evals/
│   └── <name>.spec.ts          # evaluation spec(s)
├── src/
│   └── evaluate.ts             # re-export or extend the base evaluate fixture
├── playwright.config.ts        # MUST be in package root, NOT under test/scout/
├── package.json
├── kibana.jsonc
└── tsconfig.json

File Templates

`kibana.jsonc`

{
  "type": "functional-tests",
  "id": "@kbn/evals-suite-<name>",
  "owner": "@elastic/<team>",
  "group": "<platform|security|observability|search>",
  "visibility": "<shared|private>"
}

type must be "functional-tests" -- not "shared-common" or "plugin".

`package.json`

{
  "name": "@kbn/evals-suite-<name>",
  "private": true,
  "version": "1.0.0",
  "license": "Elastic License 2.0"
}

`tsconfig.json`

{
  "extends": "@kbn/tsconfig-base/tsconfig.json",
  "compilerOptions": {
    "outDir": "target/types",
    "types": ["jest", "node"]
  },
  "include": ["**/*.ts"],
  "exclude": ["target/**/*"],
  "kbn_references": [
    "@kbn/evals",
    "@kbn/scout"
  ]
}

Add any additional package refs your suite imports to kbn_references (e.g. @kbn/inference-common, @kbn/es-archiver).

`playwright.config.ts`

import Path from 'path';
import { createPlaywrightEvalsConfig } from '@kbn/evals';

export default createPlaywrightEvalsConfig({
  testDir: Path.resolve(__dirname, './evals'),
  timeout: 30 * 60_000,
});

Options:

testDir (required) -- directory containing .spec.ts files
timeout (optional, default 5 * 60_000) -- per-test timeout in ms
repetitions (optional, default 1) -- overridable via EVALUATION_REPETITIONS env var

`src/evaluate.ts`

Simple (no custom fixtures):

import { evaluate } from '@kbn/evals';

export { evaluate };

Extended (with custom fixtures):

import { evaluate as base } from '@kbn/evals';
import { MyChatClient } from './chat_client';

export const evaluate = base.extend<
  {},
  { chatClient: MyChatClient }
>({
  chatClient: [
    async ({ fetch, log, connector }, use) => {
      await use(new MyChatClient(fetch, log, connector.id));
    },
    { scope: 'worker' },
  ],
});

When to Extend `evaluate`

Use the base evaluate directly when your task calls Kibana APIs through the built-in fetch, inferenceClient, or executorClient fixtures.

Extend when you need:

A chat client that wraps a specific Kibana API endpoint (e.g. /api/agent_builder/converse)
An evaluateDataset helper that encapsulates the runExperiment + evaluator wiring for a consistent pattern across specs
esArchiver for loading/unloading ES archives in setup/teardown
supertest for direct HTTP assertions against Kibana
Domain-specific API clients (e.g. QuickstartClient)

Real examples

Suite	Approach	Why
`llm-tasks`	Base `evaluate` directly	Calls task functions in-process; custom CODE evaluators inline
`agent-builder`	Extended with `chatClient` + Phoenix executor	Needs HTTP chat client and external Phoenix executor
`security-solution-evals`	Extended with `chatClient`, `esArchiver`, `supertest`, `quickApiClient`	Domain-heavy setup: loads ES archives, uses generated API client

Suite Registration

Add an entry to .buildkite/pipelines/evals/evals.suites.json:

{
  "id": "<name>",
  "name": "<Human Readable Name>",
  "configPath": "<repo-relative path to playwright.config.ts>",
  "tags": ["<group>", "<name>"],
  "ciLabels": ["evals:<name>"]
}

Registration is optional for local dev (suites are auto-discovered from createPlaywrightEvalsConfig imports), but required for CI labeling and node scripts/evals list.

Post-Scaffold Steps

Run yarn kbn bootstrap to register the new package.
Verify the suite appears: node scripts/evals list.
Create your first spec file under evals/ (see the evals-write-spec skill).
Run locally: node scripts/evals start --model <connector-id> --judge <connector-id>.

Common Mistakes

Placing configs under test/scout/ -- Scout's CI discovery will find them and crash. Keep playwright.config.ts in the package root.
Using node scripts/scout.js generate -- this creates Scout test scaffolds, not eval suites. Scaffold manually using the templates above.
Setting type to anything other than "functional-tests" in kibana.jsonc.
Forgetting @kbn/evals in kbn_references -- causes TS resolution failures.
Using Path.join instead of Path.resolve for testDir -- Playwright needs an absolute path.
Creating evals/ specs that import from @kbn/evals but the suite's src/evaluate.ts re-exports a different fixture -- always import evaluate from the suite's own src/evaluate when extending.
Forgetting to run yarn kbn bootstrap after creating the package.

evals-create-suite

More from this repository

More from this repository

Create an Eval Suite

Overview

Inputs to Collect

Do NOT Use node scripts/scout.js generate

Directory Layout

File Templates

kibana.jsonc

package.json

tsconfig.json

playwright.config.ts

src/evaluate.ts

When to Extend evaluate

Real examples

Suite Registration

Post-Scaffold Steps

Common Mistakes

Create an Eval Suite

Overview

Inputs to Collect

Do NOT Use node scripts/scout.js generate

Directory Layout

File Templates

kibana.jsonc

package.json

tsconfig.json

playwright.config.ts

src/evaluate.ts

When to Extend evaluate

Real examples

Suite Registration

Post-Scaffold Steps

Common Mistakes

Do NOT Use `node scripts/scout.js generate`

`kibana.jsonc`

`package.json`

`tsconfig.json`

`playwright.config.ts`

`src/evaluate.ts`

When to Extend `evaluate`

Do NOT Use `node scripts/scout.js generate`

`kibana.jsonc`

`package.json`

`tsconfig.json`

`playwright.config.ts`

`src/evaluate.ts`

When to Extend `evaluate`