with one click
bat-story-eval
// Compare MCP tool behavior between target and baseline versions using pre-built and custom stories with diff-based triage.
// Compare MCP tool behavior between target and baseline versions using pre-built and custom stories with diff-based triage.
Implement a GitHub issue end-to-end — create a worktree branch, implement the feature with tests, create a draft PR, then iteratively resolve all CI failures and review comments until the PR is clean. Use when you need to fully implement a GitHub issue from start to merge-ready. Triggers on "implement issue", "resolve issue", "/issue-to-pr-resolver <number>".
Manage your own GitHub pull requests — check CI status, inline review comments, PR-level comments, resolve review threads, fix issues, and iterate until all checks pass and threads are resolved. Use for managing your own PRs (not external contributions). Triggers on "check my PR", "check PR", "/my-pr-checker <number>".
Deep analysis of a single GitHub issue with codebase exploration, implementation planning, and architectural assessment. Use when you need to analyze a GitHub issue, assess its complexity, plan implementation approaches, and post a structured analysis comment. Triggers on "analyze issue", "deep analysis", "/issue-analysis <number>".
Create a git worktree in worktree/ subdirectory with up-to-date master
Review a contribution PR for safety, quality, and readiness. Checks for security concerns, test coverage, size appropriateness, and intent alignment. Use when reviewing external contributions.
Run bot acceptance tests to validate MCP tools work correctly from a real AI agent's perspective. Use when testing PRs, detecting regressions, or verifying tool changes end-to-end with Claude/Gemini CLIs.
| name | bat-story-eval |
| description | Compare MCP tool behavior between target and baseline versions using pre-built and custom stories with diff-based triage. |
| disable-model-invocation | true |
| argument-hint | --baseline v6.6.1 [--agents gemini] [--stories s01,s02] |
| allowed-tools | Bash, Read, Write, Glob, Grep, Task |
You are the evaluator. Follow these steps IN ORDER. Do not skip steps.
From $ARGUMENTS, extract:
--baseline: REQUIRED. Git tag/branch of the released version (e.g., v6.6.1).--agents: Agent list (default: gemini). Comma-separated.--stories: Force specific pre-built story IDs (e.g., s01,s02). Overrides triage selection.--all-stories: Skip triage, run ALL pre-built stories.--keep-container: Keep HA containers alive after run for manual inspection.--model: Model for Claude agent (e.g., haiku, sonnet).If $ARGUMENTS is --help or missing --baseline, show usage and stop:
/bat-story-eval --baseline v6.6.1
/bat-story-eval --baseline v6.6.1 --agents gemini,claude
/bat-story-eval --baseline v6.6.1 --stories s01,s02
/bat-story-eval --baseline v6.6.1 --all-stories --agents claude --model haiku
cd /home/julien/github/ha-mcp/worktree/uat-stories
git diff <baseline>..HEAD -- src/ha_mcp/ --stat
git diff <baseline>..HEAD -- src/ha_mcp/ --name-only
Classify changed files:
tools/tools_*.py): specific tool implementations changedclient/, server.py, errors.py, tools/util_helpers.py): affects all toolsutils/, resources/): may affect all toolsSkip if --stories or --all-stories was passed.
tests/uat/stories/catalog/s*.yaml (title, description, prompt, setup)client/, server.py, errors.py) -> all stories selectedRead the diff carefully. Your job is to catch regressions. For each changed code path NOT covered by selected pre-built stories, ask: "could this break something a user would notice?" If yes, design a custom story to test that hypothesis.
Guidelines: Always create at least 1 custom story. Each must test a distinct regression hypothesis — don't create stories that overlap. Stop when you've covered the risky gaps.
Write each as /tmp/custom_c<NN>.yaml using the standard story format:
id: c01
title: "Short description of what is being tested"
category: custom
weight: 5
description: >
Rationale: [what changed in the diff and why this scenario tests it]
setup:
- tool: ha_config_set_helper
args:
helper_type: "input_boolean"
name: "Test Entity Name"
prompt: >
[Natural language request a real user would make that exercises the changed code]
teardown: []
verify:
questions:
- "Did the agent achieve the expected outcome?"
- "Did it use the expected tools?"
expected:
tools_should_use:
- ha_search_entities
description: >
[What a correct agent should do]
Design principles:
For EACH agent, run all stories against the baseline version. One container per agent, reused across all stories.
cd /home/julien/github/ha-mcp/worktree/uat-stories
uv run python tests/uat/stories/run_story.py \
catalog/<first_story>.yaml \
--agents <agent> --keep-container \
--branch <baseline> \
--results-file local/uat-results.jsonl
CAPTURE from stderr: HA URL (e.g., http://localhost:32771), token, session file path.
After each story, verify via ha_query.py using the story's verify.questions:
uv run python tests/uat/stories/scripts/ha_query.py \
--ha-url http://localhost:PORT --ha-token TOKEN \
--agent <agent> \
"Does an automation with alias 'Sunset Porch Light' exist?"
Record each answer as confirmed / denied / unclear.
Run remaining pre-built stories on the same container:
uv run python tests/uat/stories/run_story.py \
catalog/<next_story>.yaml \
--agents <agent> --ha-url http://localhost:PORT --ha-token TOKEN \
--branch <baseline> \
--results-file local/uat-results.jsonl
Verify each immediately after running.
uv run python tests/uat/stories/run_story.py \
/tmp/custom_c01.yaml \
--agents <agent> --ha-url http://localhost:PORT --ha-token TOKEN \
--branch <baseline> \
--results-file local/uat-results.jsonl
Verify each via ha_query.py using the custom story's verify.questions.
docker stop $(docker ps -q --filter "ancestor=ghcr.io/home-assistant/home-assistant:2026.1.3") 2>/dev/null
Repeat Step 1 for the target (local code). Same stories, same order, fresh container.
The only difference: omit --branch so run_story.py uses local code.
uv run python tests/uat/stories/run_story.py \
catalog/<first_story>.yaml \
--agents <agent> --keep-container \
--results-file local/uat-results.jsonl
Same container reuse for remaining stories (--ha-url). Same verification after each.
For each story on each version, read the session file captured during the run.
Gemini sessions (JSON):
python3 -c "
import json, sys
data = json.load(open(sys.argv[1]))
for msg in data.get('messages', []):
for tc in msg.get('toolCalls', []):
print(f\" {tc['name']} ({tc.get('status', '?')})\")
" /path/to/session.json
Claude sessions (JSONL):
python3 -c "
import json, sys
for line in open(sys.argv[1]):
entry = json.loads(line)
if entry.get('type') == 'assistant':
for b in entry.get('message', {}).get('content', []):
if b.get('type') == 'tool_use':
print(f\" {b['name']}\")
" /path/to/session.jsonl
Compare against expected.tools_should_use:
| Black-Box | White-Box | Score |
|---|---|---|
| Entity correct + right structure | Right tools | pass |
| Entity correct + right structure | Wrong tools or recovered errors | pass (with notes) |
| Entity correct + wrong structure | Any | partial |
| Entity not created | Any | fail |
Primary metrics (decide pass/fail on these):
Secondary metrics (report but don't decide on these alone):
# Gemini: input includes cached, so subtract
billable = (input - cached) + output + thoughts
# Claude: input_tokens is already non-cached
billable = input + output
For each story+agent:
Append eval results as NEW lines (never modify existing):
record["eval_score"] = "pass" # or "partial" or "fail"
record["eval_notes"] = "Entity created, triggers verified"
record["eval_trend"] = "stable" # or "new", "improved", "decreased"
Diff: <baseline>..HEAD — N files changed in src/ha_mcp/
Selected pre-built: s01, s03, s05 (3 stories — tools_automation.py, tools_search.py changed)
Custom stories: c01, c02 (2 stories — covering error handling, fuzzy search threshold)
Skipped: s02, s04, s06-s12 (tools unchanged)
| Story | Agent | Baseline | Target | Trend | Baseline Tokens | Target Tokens | Delta |
|-------|--------|----------|--------|--------|-----------------|---------------|-------|
| s01 | gemini | pass | pass | stable | 36,262 | 34,100 | -6% |
| s03 | gemini | pass | pass | stable | 42,000 | 41,500 | -1% |
For EACH custom story, output a full section:
#### c01: [Title]
**Rationale**: [What changed in the diff and why this tests it]
**Setup**:
- Created input_boolean "Sophisticated Kitchen Sensor" via FastMCP
**Test prompt**: "[The exact prompt sent to the agent]"
**Verification**:
| Question | Baseline | Target |
|----------|----------|--------|
| Found the entity? | confirmed | confirmed |
| Used ha_search_entities? | confirmed | confirmed |
**Score**: baseline=pass, target=pass, trend=stable
**Tokens**: baseline=28,500, target=27,200 (-5%)
If any trend = decreased:
git diff <baseline>..HEADWhen a story has >30% more billable tokens vs baseline, check for KV-cache misses:
for i, msg in enumerate(data["messages"]):
tok = msg.get("tokens", {})
cached = tok.get("cached", 0)
total = tok.get("input", 0)
print(f"Turn {i+1}: input={total:,} cached={cached:,} non-cached={total-cached:,}")
A turn with cached=0 after a non-cold-start turn = KV-cache miss (provider-side, not a code regression).
Compare tool description sizes between versions:
uv run python tests/uat/stories/scripts/measure_tools.py \
--output local/tool-sizes-target.json
uv run python tests/uat/stories/scripts/measure_tools.py \
--output local/tool-sizes-baseline.json --branch <baseline>
Flag >5% total size increase (directly impacts token cost per turn).
| File | Purpose |
|---|---|
tests/uat/stories/run_story.py | Story runner (container, setup, agent CLI) |
tests/uat/stories/scripts/ha_query.py | Query live HA via agent+MCP for verification |
tests/uat/stories/catalog/s*.yaml | Pre-built story definitions |
local/uat-results.jsonl | Historical results (gitignored) |
--baseline is required: it's both the diff source and the control group--keep-container), rest use --ha-url/tmp/ (ephemeral); full details reported in Step 6/home/julien/github/ha-mcp/worktree/uat-stories for uv run