Run any Skill in Manus with one click

$pwd:

bat-story-eval

Name: Bat Story Eval
Author: homeassistant-ai

// Compare MCP tool behavior between target and baseline versions using pre-built and custom stories with diff-based triage.

Run Skill in Manus

$ git log --oneline --stat

stars:3,173

forks:124

updated:February 17, 2026 at 07:44

File Explorer

3 files

SKILL.md

readonly

related-skills.json

same repository

issue-to-pr-resolver.md

from "homeassistant-ai/ha-mcp"

Implement a GitHub issue end-to-end — create a worktree branch, implement the feature with tests, create a draft PR, then iteratively resolve all CI failures and review comments until the PR is clean. Use when you need to fully implement a GitHub issue from start to merge-ready. Triggers on "implement issue", "resolve issue", "/issue-to-pr-resolver <number>".

2026-05-183.2k

my-pr-checker.md

from "homeassistant-ai/ha-mcp"

Manage your own GitHub pull requests — check CI status, inline review comments, PR-level comments, resolve review threads, fix issues, and iterate until all checks pass and threads are resolved. Use for managing your own PRs (not external contributions). Triggers on "check my PR", "check PR", "/my-pr-checker <number>".

2026-05-183.2k

issue-analysis.md

from "homeassistant-ai/ha-mcp"

Deep analysis of a single GitHub issue with codebase exploration, implementation planning, and architectural assessment. Use when you need to analyze a GitHub issue, assess its complexity, plan implementation approaches, and post a structured analysis comment. Triggers on "analyze issue", "deep analysis", "/issue-analysis <number>".

2026-05-163.2k

wt.md

from "homeassistant-ai/ha-mcp"

Create a git worktree in worktree/ subdirectory with up-to-date master

2026-05-063.2k

contrib-pr-review.md

from "homeassistant-ai/ha-mcp"

Review a contribution PR for safety, quality, and readiness. Checks for security concerns, test coverage, size appropriateness, and intent alignment. Use when reviewing external contributions.

2026-03-133.2k

bat-adhoc.md

from "homeassistant-ai/ha-mcp"

Run bot acceptance tests to validate MCP tools work correctly from a real AI agent's perspective. Use when testing PRs, detecting regressions, or verifying tool changes end-to-end with Claude/Gemini CLIs.

2026-02-173.2k

package.json

"author": "homeassistant-ai"

"repository": "homeassistant-ai/ha-mcp"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

name	bat-story-eval
description	Compare MCP tool behavior between target and baseline versions using pre-built and custom stories with diff-based triage.
disable-model-invocation	true
argument-hint	--baseline v6.6.1 [--agents gemini] [--stories s01,s02]
allowed-tools	Bash, Read, Write, Glob, Grep, Task

BAT Story Evaluation

You are the evaluator. Follow these steps IN ORDER. Do not skip steps.

Parse Arguments

From $ARGUMENTS, extract:

--baseline: REQUIRED. Git tag/branch of the released version (e.g., v6.6.1).
--agents: Agent list (default: gemini). Comma-separated.
--stories: Force specific pre-built story IDs (e.g., s01,s02). Overrides triage selection.
--all-stories: Skip triage, run ALL pre-built stories.
--keep-container: Keep HA containers alive after run for manual inspection.
--model: Model for Claude agent (e.g., haiku, sonnet).

If $ARGUMENTS is --help or missing --baseline, show usage and stop:

/bat-story-eval --baseline v6.6.1
/bat-story-eval --baseline v6.6.1 --agents gemini,claude
/bat-story-eval --baseline v6.6.1 --stories s01,s02
/bat-story-eval --baseline v6.6.1 --all-stories --agents claude --model haiku

Step 0: Triage (Diff Analysis + Custom Story Design)

0a. Compute Diff

cd /home/julien/github/ha-mcp/worktree/uat-stories
git diff <baseline>..HEAD -- src/ha_mcp/ --stat
git diff <baseline>..HEAD -- src/ha_mcp/ --name-only

Classify changed files:

Tool modules (tools/tools_*.py): specific tool implementations changed
Core code (client/, server.py, errors.py, tools/util_helpers.py): affects all tools
Utilities (utils/, resources/): may affect all tools
No src/ changes: only tests/docs/config — select 2 smoke-test stories

0b. Select Pre-built Stories

Skip if --stories or --all-stories was passed.

Read the diff from 0a
Read all story YAMLs in tests/uat/stories/catalog/s*.yaml (title, description, prompt, setup)
For each story, reason about whether the diff could affect its outcome:
- What tools/code paths would this story exercise?
- Do any of those overlap with what changed?
Rules:
- Story likely exercises changed code -> selected
- Core code changed (client/, server.py, errors.py) -> all stories selected
- No src/ changes -> 2 representative stories as smoke test
Report which stories were selected and why (one sentence per story)

0c. Design Custom Stories (at least 1)

Read the diff carefully. Your job is to catch regressions. For each changed code path NOT covered by selected pre-built stories, ask: "could this break something a user would notice?" If yes, design a custom story to test that hypothesis.

Guidelines: Always create at least 1 custom story. Each must test a distinct regression hypothesis — don't create stories that overlap. Stop when you've covered the risky gaps.

Write each as /tmp/custom_c<NN>.yaml using the standard story format:

id: c01
title: "Short description of what is being tested"
category: custom
weight: 5
description: >
  Rationale: [what changed in the diff and why this scenario tests it]

setup:
  - tool: ha_config_set_helper
    args:
      helper_type: "input_boolean"
      name: "Test Entity Name"

prompt: >
  [Natural language request a real user would make that exercises the changed code]

teardown: []

verify:
  questions:
    - "Did the agent achieve the expected outcome?"
    - "Did it use the expected tools?"

expected:
  tools_should_use:
    - ha_search_entities
  description: >
    [What a correct agent should do]

Design principles:

Focus on code paths that changed in the diff
Plausible user scenarios, not synthetic edge cases
Setup creates realistic HA state via FastMCP in-memory steps
Prompts are what a real user would type
At least 1. Each tests a distinct regression hypothesis. Stop when gaps are covered.

Step 1: Run Baseline Version

For EACH agent, run all stories against the baseline version. One container per agent, reused across all stories.

1a. Start container with first story

cd /home/julien/github/ha-mcp/worktree/uat-stories
uv run python tests/uat/stories/run_story.py \
  catalog/<first_story>.yaml \
  --agents <agent> --keep-container \
  --branch <baseline> \
  --results-file local/uat-results.jsonl

CAPTURE from stderr: HA URL (e.g., http://localhost:32771), token, session file path.

1b. Verify, then run remaining pre-built stories

After each story, verify via ha_query.py using the story's verify.questions:

uv run python tests/uat/stories/scripts/ha_query.py \
  --ha-url http://localhost:PORT --ha-token TOKEN \
  --agent <agent> \
  "Does an automation with alias 'Sunset Porch Light' exist?"

Record each answer as confirmed / denied / unclear.

Run remaining pre-built stories on the same container:

uv run python tests/uat/stories/run_story.py \
  catalog/<next_story>.yaml \
  --agents <agent> --ha-url http://localhost:PORT --ha-token TOKEN \
  --branch <baseline> \
  --results-file local/uat-results.jsonl

Verify each immediately after running.

1c. Run custom stories on same container

uv run python tests/uat/stories/run_story.py \
  /tmp/custom_c01.yaml \
  --agents <agent> --ha-url http://localhost:PORT --ha-token TOKEN \
  --branch <baseline> \
  --results-file local/uat-results.jsonl

Verify each via ha_query.py using the custom story's verify.questions.

1d. Stop container

docker stop $(docker ps -q --filter "ancestor=ghcr.io/home-assistant/home-assistant:2026.1.3") 2>/dev/null

Step 2: Run Target Version

Repeat Step 1 for the target (local code). Same stories, same order, fresh container.

The only difference: omit --branch so run_story.py uses local code.

uv run python tests/uat/stories/run_story.py \
  catalog/<first_story>.yaml \
  --agents <agent> --keep-container \
  --results-file local/uat-results.jsonl

Same container reuse for remaining stories (--ha-url). Same verification after each.

Step 3: White-Box Analysis

For each story on each version, read the session file captured during the run.

Gemini sessions (JSON):

python3 -c "
import json, sys
data = json.load(open(sys.argv[1]))
for msg in data.get('messages', []):
    for tc in msg.get('toolCalls', []):
        print(f\"  {tc['name']} ({tc.get('status', '?')})\")
" /path/to/session.json

Claude sessions (JSONL):

python3 -c "
import json, sys
for line in open(sys.argv[1]):
    entry = json.loads(line)
    if entry.get('type') == 'assistant':
        for b in entry.get('message', {}).get('content', []):
            if b.get('type') == 'tool_use':
                print(f\"  {b['name']}\")
" /path/to/session.jsonl

Compare against expected.tools_should_use:

All expected tools used? (High weight)
Tool failures with recovery? (Medium weight)
Total tool call count (Low weight, note it)

Step 4: Score & Compare

Scoring Matrix

Black-Box	White-Box	Score
Entity correct + right structure	Right tools	pass
Entity correct + right structure	Wrong tools or recovered errors	pass (with notes)
Entity correct + wrong structure	Any	partial
Entity not created	Any	fail

Metrics

Primary metrics (decide pass/fail on these):

Black-box score (entity correct, structure correct)
White-box tool selection (expected tools used)
Error recovery (failures handled gracefully)

Secondary metrics (report but don't decide on these alone):

Billable tokens — directional cost signal, flag >30% increase for investigation but don't auto-fail
Cached tokens / cache hit ratio — useful context for cost analysis, but varies based on provider-side KV-cache behavior
Tool call count / turns — varies between runs due to agent exploration
Duration — noisy (network, KV-cache misses, server load), only flag large (>2x) outliers
Tool description size delta (Step 8)

Extracting Billable Tokens

# Gemini: input includes cached, so subtract
billable = (input - cached) + output + thoughts

# Claude: input_tokens is already non-cached
billable = input + output

Trend (target vs baseline)

For each story+agent:

Both pass -> stable
Target pass, baseline fail -> improved
Target fail, baseline pass -> decreased (REGRESSION)
Custom story, first run -> new
Billable tokens >30% higher -> cost investigation (even if pass — check Step 7 for KV-cache misses before concluding regression)

Step 5: Update JSONL

Append eval results as NEW lines (never modify existing):

record["eval_score"] = "pass"  # or "partial" or "fail"
record["eval_notes"] = "Entity created, triggers verified"
record["eval_trend"] = "stable"  # or "new", "improved", "decreased"

Step 6: Report

Triage Summary

Diff: <baseline>..HEAD — N files changed in src/ha_mcp/
Selected pre-built: s01, s03, s05 (3 stories — tools_automation.py, tools_search.py changed)
Custom stories: c01, c02 (2 stories — covering error handling, fuzzy search threshold)
Skipped: s02, s04, s06-s12 (tools unchanged)

Pre-built Story Results

| Story | Agent  | Baseline | Target | Trend  | Baseline Tokens | Target Tokens | Delta |
|-------|--------|----------|--------|--------|-----------------|---------------|-------|
| s01   | gemini | pass     | pass   | stable | 36,262          | 34,100        | -6%   |
| s03   | gemini | pass     | pass   | stable | 42,000          | 41,500        | -1%   |

Custom Story Details

For EACH custom story, output a full section:

#### c01: [Title]

**Rationale**: [What changed in the diff and why this tests it]

**Setup**:
- Created input_boolean "Sophisticated Kitchen Sensor" via FastMCP

**Test prompt**: "[The exact prompt sent to the agent]"

**Verification**:
| Question | Baseline | Target |
|----------|----------|--------|
| Found the entity? | confirmed | confirmed |
| Used ha_search_entities? | confirmed | confirmed |

**Score**: baseline=pass, target=pass, trend=stable
**Tokens**: baseline=28,500, target=27,200 (-5%)

Regressions

If any trend = decreased:

Flag prominently
Suggest re-run to check flakiness
Show relevant section of git diff <baseline>..HEAD

Step 7: Investigate Outliers

When a story has >30% more billable tokens vs baseline, check for KV-cache misses:

for i, msg in enumerate(data["messages"]):
    tok = msg.get("tokens", {})
    cached = tok.get("cached", 0)
    total = tok.get("input", 0)
    print(f"Turn {i+1}: input={total:,} cached={cached:,} non-cached={total-cached:,}")

A turn with cached=0 after a non-cold-start turn = KV-cache miss (provider-side, not a code regression).

Step 8: Tool Description Size

Compare tool description sizes between versions:

uv run python tests/uat/stories/scripts/measure_tools.py \
  --output local/tool-sizes-target.json
uv run python tests/uat/stories/scripts/measure_tools.py \
  --output local/tool-sizes-baseline.json --branch <baseline>

Flag >5% total size increase (directly impacts token cost per turn).

Key Files

File	Purpose
`tests/uat/stories/run_story.py`	Story runner (container, setup, agent CLI)
`tests/uat/stories/scripts/ha_query.py`	Query live HA via agent+MCP for verification
`tests/uat/stories/catalog/s*.yaml`	Pre-built story definitions
`local/uat-results.jsonl`	Historical results (gitignored)

Important Notes

--baseline is required: it's both the diff source and the control group
Run pre-built stories BEFORE custom stories (cleanest state)
ALWAYS verify each story via ha_query.py before running the next
Reuse containers: first story starts it (--keep-container), rest use --ha-url
Custom story YAMLs go to /tmp/ (ephemeral); full details reported in Step 6
See "Metrics" section in Step 4 for primary vs secondary metric classification
The working directory MUST be /home/julien/github/ha-mcp/worktree/uat-stories for uv run

bat-story-eval

More from this repository

More from this repository

BAT Story Evaluation

Parse Arguments

Step 0: Triage (Diff Analysis + Custom Story Design)

0a. Compute Diff

0b. Select Pre-built Stories

0c. Design Custom Stories (at least 1)

Step 1: Run Baseline Version

1a. Start container with first story

1b. Verify, then run remaining pre-built stories

1c. Run custom stories on same container

1d. Stop container

Step 2: Run Target Version

Step 3: White-Box Analysis

Step 4: Score & Compare

Scoring Matrix

Metrics

Extracting Billable Tokens

Trend (target vs baseline)

Step 5: Update JSONL

Step 6: Report

Triage Summary

Pre-built Story Results

Custom Story Details

Regressions

Step 7: Investigate Outliers

Step 8: Tool Description Size

Key Files

Important Notes

BAT Story Evaluation

Parse Arguments

Step 0: Triage (Diff Analysis + Custom Story Design)

0a. Compute Diff

0b. Select Pre-built Stories

0c. Design Custom Stories (at least 1)

Step 1: Run Baseline Version

1a. Start container with first story

1b. Verify, then run remaining pre-built stories

1c. Run custom stories on same container

1d. Stop container

Step 2: Run Target Version

Step 3: White-Box Analysis

Step 4: Score & Compare

Scoring Matrix

Metrics

Extracting Billable Tokens

Trend (target vs baseline)

Step 5: Update JSONL

Step 6: Report

Triage Summary

Pre-built Story Results

Custom Story Details

Regressions

Step 7: Investigate Outliers

Step 8: Tool Description Size

Key Files

Important Notes