with one click
golem-skill-harness
Developing, testing, and running Golem skill tests with the skill test harness. Use when creating new skills, writing scenario YAML files, running skill tests locally, or debugging skill test failures.
Menu
Developing, testing, and running Golem skill tests with the skill test harness. Use when creating new skills, writing scenario YAML files, running skill tests locally, or debugging skill test failures.
Cutting a new docs version, promoting next to a release, and managing versioned documentation content under docs/src/content/. Use when releasing a new Golem version, backporting docs fixes to an older release, renaming a docs version, or adding/removing a version from the version selector.
Adding or modifying HTTP REST API endpoints in Golem services. Use when creating new endpoints, changing existing API routes, or updating request/response types for the Golem REST API.
Final checks before submitting a pull request. Use when preparing to create a PR, to ensure formatting, linting, and the correct tests have been run.
Adding a new component or agent templates to an existing Golem application. Use when adding a second component, adding agent templates like human-in-the-loop or snapshotting to an existing component, or converting a single-component app to multi-component.
Defining environment variables for Golem agents in `golem.yaml` (`env`, `envDefaults`, `secretDefaults`) or via CLI. Use when adding, setting, or overriding env vars on a component, agent, template, preset, or environment, or when wiring template substitution and merge modes.
Adding initial files to Golem agent filesystems via the `files:` section in `golem.yaml`. Use when provisioning local or remote files into an agent's virtual filesystem, setting read-only / read-write permissions, or configuring file mounts at the component, agent, template, or preset level.
| name | golem-skill-harness |
| description | Developing, testing, and running Golem skill tests with the skill test harness. Use when creating new skills, writing scenario YAML files, running skill tests locally, or debugging skill test failures. |
The skill test harness lives in golem-skills/tests/harness/. It drives coding agents (Claude Code, OpenCode, Codex) through scenario YAML files, verifying that skills are activated and produce correct results. Skill definitions live in golem-skills/skills/.
Skills in golem-skills/skills/ are organized by language scope:
golem-skills/skills/
common/ # Language-independent skills (included for all languages)
golem-new-project/
SKILL.md
rust/ # Rust-specific skills (included only for Rust projects)
golem-add-rust-crate/
SKILL.md
ts/ # TypeScript-specific skills (included only for TS projects)
golem-add-npm-package/
SKILL.md
scala/ # Scala-specific skills (included only for Scala projects)
moonbit/ # MoonBit-specific skills (included only for MoonBit projects)
When golem new creates a project, it embeds the common/ skills plus the language-specific skills into the project's .agents/skills/ and .claude/skills/ directories.
Skills are embedded in the golem / golem-cli binaries. If you add or modify a skill under golem-skills/skills/, you must recompile the binaries before the changes take effect — including before running the skill test harness.
cargo make build-release-full
Without this step, golem new will still emit the old skill content, and the harness will test against stale skills.
Each SKILL.md is also republished as a How-To Guide on learn.golem.cloud under docs/src/content/how-to-guides/. After adding or editing a skill, regenerate those MDX pages:
cargo make generate-docs-skills
CI's check-docs-skills task will fail any PR that changes golem-skills/skills/ without also updating the generated MDX.
golem binary in $GOLEM_PATH/target/release/ or $GOLEM_PATH/target/debug/. Build with cargo build -p golem (debug) or cargo build -p golem --release (release). The harness prefers the release build and falls back to debug.golem server run --data-dir <workspaces/golem-server-data> --clean and stops it when done. If a server is already running on port 9881, the harness fails with an error to avoid conflicts.claude (Claude Code), opencode, or codexfswatch on macOS, inotify-tools on Linuxcwd looking for sdks/rust/golem-rust and sdks/ts/packages directories (same markers as golem-cli). If auto-detection also fails, the harness exits with an error. The resolved target directory (target/release or target/debug) is prepended to PATH so all spawned processes — including agent drivers — use the correct golem and golem-cli binaries.cargo-component and wasm32-wasip2 targetpnpm, wasm-rquickjs-cli, TS SDK built (cargo make build-sdk-ts)moon (MoonBit toolchain), wasm-toolscd golem-skills/tests/harness
npm install
npm run build
The build script runs ESLint then tsc, so lint errors will fail the build.
The harness uses ESLint 9 with typescript-eslint for linting and Prettier for formatting. Configuration files:
eslint.config.js — ESLint flat config with typescript-eslint recommended rules.prettierrc — Prettier config (2-space indent, double quotes, trailing commas, 100 char width)cd golem-skills/tests/harness
npm run lint # Check for lint errors
npm run lint:fix # Auto-fix lint errors
npm run format:check # Check formatting without changing files
npm run format # Auto-format all source files
Always run npm run lint:fix and npm run format before committing harness changes. CI enforces both lint (via npm run build) and formatting (via npm run format:check).
cd golem-skills/tests/harness
npm test
From golem-skills/tests/harness/:
npx tsx src/run.ts [options]
| Option | Description | Default |
|---|---|---|
--agent <name> | Agent driver: claude-code, opencode, codex, or all | all |
--language <lang> | Language: ts, rust, or all | all |
--scenario <name> | Run only the named scenario | all scenarios |
--scenarios <dir> | Path to scenario YAML directory | ./scenarios |
--output <dir> | Results output directory | ./results |
--timeout <seconds> | Global timeout per step | 300 |
--dry-run | Validate scenarios without executing | false |
--resume-from <id> | Resume from a specific step ID | — |
--workspace <path> | Override workspace directory | — |
--merge-reports <dir> | Merge summary.json files into aggregated report | — |
# Run a single scenario with Claude Code for Rust
npx tsx src/run.ts --agent claude-code --language rust --scenario golem-new-project-rust
# Dry-run to validate YAML
npx tsx src/run.ts --dry-run --scenario golem-db-app-ts
# Resume a failed scenario from a specific step, reusing a previous workspace
npx tsx src/run.ts --agent claude-code --language ts --scenario golem-db-app-ts \
--resume-from build-and-deploy --workspace ./workspaces/<run-id>/golem-db-app-ts/ts
# Merge reports from multiple CI runs
npx tsx src/run.ts --merge-reports ./ci-results --output ./merged
Each harness run generates a unique run ID (UUID). Without --workspace, each scenario gets its own directory at <cwd>/workspaces/<run-id>/<scenario-name>/<language>/. With --workspace, the same structure is created under the specified root: <workspace>/<run-id>/<scenario-name>/<language>/. Workspace directories are never deleted, so you can inspect the results after the run.
The harness manages the Golem server automatically:
golem server run --data-dir <workspaces/<run-id>/golem-server-data> --clean and waits up to 60 seconds for the healthcheck to pass.--clean) to ensure a fresh state for each scenario.local Golem profile exists and the server is still reachable.Create the skill under the appropriate subdirectory of golem-skills/skills/:
common/<skill-name>/SKILL.md — for language-independent skillsrust/<skill-name>/SKILL.md — for Rust-specific skillsts/<skill-name>/SKILL.md — for TypeScript-specific skillsscala/<skill-name>/SKILL.md — for Scala-specific skillsmoonbit/<skill-name>/SKILL.md — for MoonBit-specific skillsUse YAML frontmatter:
---
name: my-new-skill
description: "What the skill does. Use when <trigger conditions>."
---
# Skill Title
Instructions for the agent...
golem-cli --helpIf the skill is relevant to one or more golem-cli subcommands, add a SkillBinding entry so that — when an automated coding agent invokes golem-cli ... --help inside a Golem application that has the skill installed — a Relevant skills: block linking to the skill's SKILL.md is appended to that command's long help.
Edit cli/golem-cli/src/agent_help_hints/builtin_skill_map.rs and add a row to SKILL_BINDINGS:
// Common (language-independent) skill:
SkillBinding {
cli_path: &["agent", "delete"],
basename: "golem-delete-agent",
kind: SkillKind::Common,
summary: "Delete an agent instance.",
},
// Per-language skill (one variant per listed language; folder is
// `<basename>-<lang>` where lang is rust|ts|scala|moonbit):
SkillBinding {
cli_path: &["secret", "create"],
basename: "golem-add-secret",
kind: SkillKind::PerLanguage(ALL_LANGS),
summary: "Add a typed secret available to your agents.",
},
Rules:
cli_path is the chain of subcommand names exactly as they appear in clap's tree (kebab-case, e.g. &["agent", "cancel-invocation"]).basename is the skill folder name without any language suffix.kind is SkillKind::Common for language-independent skills, or SkillKind::PerLanguage(...) for per-language ones.summary is a one-line, language-agnostic description shown above the file links.cli_path can appear in multiple bindings; they are merged into a single block under that command in source order.<app_dir>/.agents/skills/ is silently skipped at runtime, so adding speculative bindings is safe.Two compile-time tests guard the table:
every_binding_basename_exists_in_golem_skills_repo — fails if the named skill folder is missing from golem-skills/skills/.every_binding_path_resolves_in_clap_tree — fails if cli_path doesn't match a real subcommand (catches CLI renames).Run them with:
cargo test -p golem-cli --lib -- agent_help_hints
After creating or modifying a skill, recompile so the changes are embedded:
cargo make build-release-full
Create golem-skills/tests/harness/scenarios/<scenario-name>.yaml:
name: "my-scenario"
settings:
timeout_per_subprompt: 300
golem_server:
custom_request_port: 9006
steps:
- id: "step-one"
prompt: "Do something using the skill"
expectedSkills:
- "my-new-skill"
verify:
build: true
npx tsx src/run.ts --agent claude-code --language rust --scenario my-scenario
name: "scenario-name" # Required. Unique scenario identifier.
settings:
timeout_per_subprompt: 300 # Default timeout for prompt steps (seconds)
golem_server:
router_port: 9881 # Golem router port (for healthcheck)
custom_request_port: 9006 # Sets GOLEM_CUSTOM_REQUEST_PORT env var
cleanup: true # Whether to clean workspace before run
prerequisites:
env: # Extra env vars set during execution
DATABASE_URL: "postgres://..."
skip_if: # Skip entire scenario conditionally
language: "ts" # Skip when language is "ts"
agent: "codex" # Skip when agent is "codex"
os: "windows" # Skip when OS matches (darwin→macos, win32→windows)
steps: [...] # Required. At least one step.
Every step must have exactly one action field. Common fields available on all steps:
- id: "unique-step-id" # Optional. Used for --resume-from.
timeout: 600 # Override step timeout (seconds)
expect: { ... } # Assertions (see below)
retry: # Retry on failure
attempts: 3
delay: 5 # Seconds between retries
only_if: # Run only when conditions match
language: "rust"
agent: "claude-code"
os: "macos"
skip_if: # Skip when conditions match
language: "ts"
prompt — Send a prompt to the coding agent- id: "create-app"
prompt: "Create a new Golem application called my-app with Rust."
expectedSkills: # Skills that MUST be activated
- "golem-new-project"
allowedExtraSkills: # Extra skills that are OK to activate
- "golem-db-app-rust"
strictSkillMatch: false # If true, ONLY expectedSkills may activate
continueSession: true # Continue previous agent session and keep cumulative
# skill tracking for that prompt session.
# Set to false to start a fresh agent session with
# fresh skill tracking.
verify:
build: true # Run `golem build` after the prompt
deploy: true # Run `golem build` + `golem deploy --yes`
create_project — Create a Golem project directly (without an agent prompt)Runs golem new <name> --template <language> --yes in the workspace, automatically using the current language as the template. Useful when a scenario needs a pre-existing project without involving the agent.
- id: "setup-project"
create_project:
name: "my-app"
verify:
build: true
deploy: true
With language-conditional presets:
- id: "setup-project"
create_project:
name: "my-app"
presets:
rust: ["some-rust-preset"]
ts: ["some-ts-preset"]
verify:
build: true
deploy: true
shell — Run a shell command- id: "check-files"
shell:
command: "ls"
args: ["my-app/golem.yaml"]
cwd: "subdirectory" # Relative to workspace
expect:
exit_code: 0
stdout_contains: "golem.yaml"
http — Make an HTTP request- id: "call-api"
http:
url: "http://my-app.localhost:9006/path"
method: "POST" # GET, POST, PUT, DELETE, PATCH
headers:
Content-Type: "application/json"
body: '{"key": "value"}'
expect:
status: 200
body_contains: "expected text"
body_matches: "regex.*pattern"
invoke — Invoke a Golem agent function via CLI- id: "call-function"
invoke:
agent: 'CounterAgent("my-counter")'
method: "increment"
args: '"hello"' # Optional function arguments
expect:
stdout_contains: "1"
Use the real method name as it appears in source code, not a kebab-cased external name. For
cross-language scenarios, method and args can be language-conditional:
- id: "call-function"
invoke:
agent: 'ItemRepositoryAgent("catalog")'
method:
rust: "create_item"
ts: "createItem"
scala: "createItem"
args: '{id: "item-1", name: "Hammer"}'
Prompts must use language-appropriate method name casing (snake_case for Rust/MoonBit, camelCase for TypeScript/Scala) — not kebab-case. Invocation steps must also use the source-language method names that the generated code actually exposes.
invoke_json — Invoke with --json outputSame as invoke but requests JSON-formatted CLI output. Supports result_json assertions with
JSONPath.
result_json assertions are evaluated against the unwrapped invocation result value, not the full
CLI envelope. That means:
$.id$$ or list element paths like $[0].id- id: "call-json"
invoke_json:
agent: 'MyAgent("test")'
method: "getData"
expect:
result_json:
- path: "$.name"
equals: "test"
- path: "$.items[0]"
contains: "expected"
Cross-language example:
- id: "create-item"
invoke_json:
agent: 'ItemRepositoryAgent("catalog")'
method:
rust: "create_item"
ts: "createItem"
scala: "createItem"
args: '{id: "item-1", name: "Hammer"}'
expect:
result_json:
- path: "$.id"
equals: "item-1"
- path: "$.name"
equals: "Hammer"
create_agent — Create a Golem agent- id: "make-agent"
create_agent:
name: 'MyAgent("instance-1")'
env:
KEY: "value"
config:
setting: "value"
delete_agent — Delete a Golem agent- id: "remove-agent"
delete_agent:
name: 'MyAgent("instance-1")'
trigger — Fire-and-forget agent function call- id: "trigger-bg"
trigger:
agent: 'MyAgent("test")'
method: "backgroundTask"
Like invoke and invoke_json, trigger.method can be language-conditional when Rust,
TypeScript, and Scala use different method casing.
check_file — Assert on file contentsReads a file relative to the golem project directory and runs assertions against its contents.
The file content is treated as stdout for assertion purposes.
- id: "check-output"
check_file:
path: "output.txt"
expect:
stdout_contains: "expected text"
stdout_not_contains: "unwanted text"
stdout_matches: "regex.*pattern"
mcp_call — Call an MCP server methodInitializes an MCP session via the Streamable HTTP transport, then sends a JSON-RPC method call. Session management (initialize + session ID forwarding) is handled automatically.
- id: "list-tools"
mcp_call:
url: "http://my-app.localhost:9007/mcp"
method: "tools/list"
expect:
status: 200
body_contains: "my-tool-name"
With parameters (e.g., calling a tool):
- id: "call-tool"
mcp_call:
url: "http://my-app.localhost:9007/mcp"
method: "tools/call"
params:
name: "CounterAgent-increment"
arguments:
name: "my-counter"
expect:
status: 200
body_contains: "1"
sleep — Wait for a duration- id: "wait"
sleep: 5 # seconds
expect)Available assertion fields:
| Field | Applies To | Description |
|---|---|---|
exit_code | shell, invoke | Assert process exit code |
stdout_contains | shell, invoke, check_file, mcp_call | Stdout includes substring |
stdout_not_contains | shell, invoke, check_file, mcp_call | Stdout must NOT include substring |
stdout_matches | shell, invoke, check_file, mcp_call | Stdout matches regex |
status | http, mcp_call | HTTP response status code |
body_contains | http, mcp_call | Response body includes substring |
body_matches | http, mcp_call | Response body matches regex |
result_json | invoke_json | JSONPath assertions on parsed JSON result |
Regex-based assertions use JavaScript RegExp syntax because the harness evaluates them with
Node.js. --dry-run validates that stdout_matches and body_matches compile successfully.
Use JavaScript-compatible patterns such as \\d+, (?:...), and [\\s\\S]* for cross-line
matches. Do not use PCRE-only inline flags such as (?s).
result_json entries support:
path: JSONPath expression (e.g., $.name, $.items[0].id)equals: Exact match (deep equality)contains: Substring match on stringified valueprompt, expectedSkills, allowedExtraSkills, verify, create_project, invoke.method,
invoke_json.method, trigger.method, invoke.args, invoke_json.args, and trigger.args
can be language-conditional:
- id: "create-project"
prompt:
ts: "Create a new Golem application with TypeScript."
rust: "Create a new Golem application with Rust."
expectedSkills:
ts: ["golem-new-project", "golem-db-app-ts"]
rust: ["golem-new-project", "golem-db-app-rust"]
Another common pattern is language-specific invocation naming:
- id: "list-items"
invoke_json:
agent: 'ItemRepositoryAgent("catalog")'
method:
rust: "list_items"
ts: "listItems"
scala: "listItems"
moonbit: "list_items"
When method arguments contain records or other composite types, use per-language args because
golem agent invoke parses arguments using language-specific syntax. Rust uses { field: value }
with :, TypeScript uses { field: value } with :, Scala uses TypeName(field = value)
with =, and MoonBit uses { field: value } with : (same as Rust):
- id: "create-item"
invoke_json:
agent: 'ItemRepositoryAgent("catalog")'
method:
rust: "create_item"
ts: "createItem"
scala: "createItem"
moonbit: "create_item"
args:
rust: '{ id: "item-1", name: "Hammer" }'
ts: '{ id: "item-1", name: "Hammer" }'
scala: 'Item(id = "item-1", name = "Hammer")'
moonbit: '{ id: "item-1", name: "Hammer" }'
For simple scalar arguments (strings, numbers, booleans), the syntax is the same across all
languages, so a plain args string suffices:
args: '"item-1"'
create_project for setup when the scenario is not specifically testing project
creation. This keeps skill activation expectations focused on the behavior under test.invoke_json over invoke for behavioral verification. It is more stable for
assertions, especially for records, lists, and other structured return values.method fields whenever Rust, TypeScript, Scala, and MoonBit differ in method
casing or naming style. MoonBit uses snake_case (same as Rust).camelCase (e.g., createItem, getTag),
Rust and MoonBit use snake_case (e.g., create_item, get_tag). If a prompt mentions
method names, use per-language prompt syntax even if the rest of the text is identical.
Agents (especially Codex) may interpret kebab-case method names literally and generate
code with computed property syntax like async ["create-item"](), producing kebab-case
WIT exports that don't match the invoke/invoke_json step's expected method names.prompt when the wording genuinely differs between languages (e.g., different method
names, file names, or syntax). If the prompt is essentially the same for all languages
except for method name casing, still use per-language prompts for correctness. The agent
already knows the project language from the AGENTS.md guide and will pick the right REPL
language, file extension, etc.settings.golem_server.custom_request_port so the app has a known HTTP endpoint, then
ask the agent to add a second agent type with an HTTP mount that acts as the "other side." For
example, a SideEffectRecorder agent with POST /record (appends an event string to an
internal list) and GET /events (returns the full event history as JSON). The agent under test
then makes HTTP requests to this recorder during its operation. After the invocation, the
scenario can use an http step to GET /events and assert on the recorded sequence. This
pattern mirrors how the worker executor tests use a TestHttpServer to capture side-effect
ordering, but uses a real Golem agent instead — no external infrastructure needed. See
transactions-1-fallible-rollback-http-ledger.yaml for a concrete example where OrderLedger
serves this role, recording reserve/charge/refund/release history via HTTP endpoints and
exposing a GET /state endpoint that the harness asserts against.Steps support {{variable}} substitution. Built-in variables:
| Variable | Value |
|---|---|
{{workspace}} | Absolute workspace path |
{{scenario}} | Scenario name |
{{agent}} | Current agent name |
{{language}} | Current language |
The harness detects whether an agent actually read a skill using two mechanisms:
fswatch (macOS) or inotifywait (Linux) monitors SKILL.md file access eventsBoth mechanisms feed into expectedSkills / allowedExtraSkills / strictSkillMatch
verification. Skill tracking is scoped to the current prompt session: followup prompts accumulate
activations, while the first prompt in a scenario and any prompt with continueSession: false
start a fresh tracking session.
| Agent | CLI Command | Skill Directories | Session Support |
|---|---|---|---|
claude-code | claude --print --permission-mode bypassPermissions | .claude/skills/ | Yes (sessionId) |
opencode | opencode run | .claude/skills/, .agents/skills/ | No |
codex | codex exec --dangerously-bypass-approvals-and-sandbox | .agents/skills/ | Yes (session_id) |
The driver copies/symlinks all skills from the --skills directory into the agent's expected skill directories within the workspace.
Failed steps are automatically classified:
| Code | Category | Meaning |
|---|---|---|
SKILL_NOT_ACTIVATED | agent | Expected skill was not read by the agent |
SKILL_MISMATCH | agent | Unexpected extra skills were activated |
BUILD_FAILED | build | golem build failed |
DEPLOY_FAILED | deploy | golem deploy failed |
INVOKE_FAILED | deploy | Agent function invocation failed |
INVOKE_JSON_FAILED | deploy | JSON agent invocation failed |
SHELL_FAILED | infra | Shell command returned non-zero exit |
HTTP_FAILED | network | HTTP request failed or timed out |
MCP_CALL_FAILED | network | MCP call failed (init, session, or method error) |
CREATE_PROJECT_FAILED | infra | golem new project creation failed |
CREATE_AGENT_FAILED | infra | golem agent new failed |
DELETE_AGENT_FAILED | infra | golem agent delete failed |
FILE_CHECK_FAILED | assertion | Could not read file for check_file step |
ASSERTION_FAILED | assertion | Output didn't match expect assertions |
Results are written to --output (default ./results/):
<agent>-<language>-<scenario-name>.json with step-by-step resultsGITHUB_STEP_SUMMARY is setSkills in golem-skills/skills/ (see Skill Directory Structure for layout):
common/golem-new-project — scaffolding with golem newrust/golem-add-rust-crate — adding Rust crate dependenciests/golem-add-npm-package — adding npm package dependenciesscala/golem-add-scala-dependency — adding Scala library dependenciesmoonbit/golem-add-moonbit-package — adding MoonBit mooncakes dependenciesScenarios in golem-skills/tests/harness/scenarios/:
create-a-new-project.yaml — project creation, build, deploy, and invokeadd-third-party-dependency.yaml — add a third-party dependency, use it in code, and verify