Run any Skill in Manus with one click

butler-test-condensation

Guide for discovering, analyzing, and pruning the Butlers test suite. Use when working on test condensation beads (Phase 1 epic bu-rhztl closed; Phase 2 epic bu-hg8rl active), assessing test bloat, identifying pruning targets, or rewriting tests to be contract-driven. Triggers on test reduction, test pruning, test consolidation, or condensation tasks for this project. Also use when a fresh session needs to assess test health, create new condensation beads, or resume in-progress condensation work.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/Tzeusy/butlers --skill butler-test-condensation

Copy and paste this command into Claude Code to install the skill

Source

Tzeusy/butlers

Stars0

Forks0

UpdatedMay 3, 2026 at 08:50

File Explorer

6 files

SKILL.md

readonly

name

butler-test-condensation

description

Butler Test Condensation

Systematic reduction of the Butlers test suite to ~2,000 contract-driven tests. Each surviving test must trace to an architectural invariant, an RFC wire contract, or an OpenSpec capability.

Epic History

Phase 1 epic bu-rhztl (2026-04-04 → 2026-04-06, CLOSED): condensed 13,675 → 2,196 tests across 10 PRs. Three-tier architecture established.
Phase 2 maintenance epic (2026-05-03 onward): suite has grown to ~3,700 with new domains (Chronicles 519 tests) + drift in tests/api/ (492 vs 200 target). See bd list --parent <phase-2-epic-id> for current children.

Before You Start

Rediscover current state — never trust hardcoded counts in this skill:

CURRENT=$(find tests/ -name '*.py' -exec grep -c 'def test_' {} + 2>/dev/null | awk -F: '{sum+=$2} END {print sum}')
echo "Current test count: $CURRENT (skill baseline was 3,704 on 2026-05-03; Phase 1 closed at 2,196)"

Check Phase 2 epic status: bd list --status all then look for the active "Phase 2" epic — see which beads are open/in-progress
Read your bead: bd show <bead-id> for targets and acceptance criteria
Load doctrine: read about/heart-and-soul/ for invariants, relevant RFCs in about/legends-and-lore/
Run scoped discovery on your domain — see references/discovery.md

If your measured counts differ >10% from this skill's numbers, update references/domains.md before starting work.

Resuming Mid-Epic

If beads are already in-progress or completed:

# 1. What's done, what's available?
bd list --parent <epic-id>
bd ready

# 2. Check predecessor's progress on your domain
git log --oneline -- tests/YOUR_DOMAIN/ | head -10

# 3. Measure current state of your domain
find tests/YOUR_DOMAIN -name '*.py' -exec grep -c 'def test_' {} + 2>/dev/null | awk -F: '{sum+=$2} END {print sum}'

Phase 1 (bu-rhztl) is closed. For Phase 2 work, the contract-extraction backfill bead should generally complete before structural condensation in domains that depend on Tier 1 promotion (e.g., chronicler).

Three-Tier Test Architecture

All tests must map to exactly one tier. If a test doesn't fit, it's a pruning target.

Tier 1: Architectural Invariants (~200 tests) — `tests/contracts/`

Heart-and-soul non-negotiables. Tagged @pytest.mark.contract. Each test docstring cites its RFC/principle. 15 invariants:

Schema isolation — butler can't query another butler's schema (RFC 0006)
MCP-only inter-butler — no cross-butler imports or direct DB calls
Daemon determinism — 17-phase startup order, failure propagation (RFC 0001)
Tool surface isolation — ephemeral MCP config scoping (RFC 0002)
Module composition — topo sort, cycle detection, cascade failure (RFC 0002)
Module boundaries — modules MUST NOT modify core infrastructure (RFC 0002)
Credential tier resolution — Tier 0->1->2 precedence, no plaintext leakage (RFC 0006)
Approval gates — sensitive ops intercepted, can't be bypassed
Graceful shutdown — drain, reverse-order on_shutdown (RFC 0001)
Session lifecycle — request_id UUIDv7 propagation, tool call capture (RFC 0001)
Identity resolution — 3-table JOIN, owner bootstrap, unknowns (RFC 0004)
Context bus — signal TTL, write permissions, supersession (RFC 0009)
Routing pipeline — dedup, thread affinity, triage rules, priority (RFC 0003)
Connector-as-transport — connectors normalize to ingest.v1 only; no routing/classification logic
Staffer routing exclusion — staffers excluded from user-message routing candidates (RFC 0003)

Tier 2: Wire Contracts (~500-800 tests)

RFC-defined schemas and state machines:

ingest.v1 envelope schema (RFC 0003)
route inbox state machine: accepted->processing->processed/errored (RFC 0001)
Module ABC contract: register_tools, migrations, on_startup/on_shutdown (RFC 0002)
Migration chain execution (schema outcomes, not SQL strings)
API response contracts (Pydantic schema validation, not field-by-field)
Cross-butler briefing view + 5 guardrails (RFC 0010)
Insight delivery: candidate schema, dedup key format, cooldown, anti-spam (RFC 0011)
Finance transaction model: tiered dedup, CRUD, soft-delete-only (RFC 0012)

Tier 3: Capability Behavior (~800-1200 tests)

OpenSpec-driven. Map each spec's WHEN/THEN Scenarios to test functions. Tests exercise behavior through MCP tool interface or public API — not internal helpers. Assertions are structural (non-None, correct type, non-empty) not behavioral (exact strings, specific counts, ordering). See references/classification.md for the decision matrix.

Condensation Workflow Per Domain

Scope: Run discovery commands from references/discovery.md scoped to your domain
Classify: Apply the decision matrix in references/classification.md to each test
Write replacements: For each deleted test that covers a unique behavior, write a behavioral replacement through MCP tool/public API interface
Verify: Run uv run pytest tests/YOUR_DOMAIN -q --tb=short — zero failures
Delete: Remove old implementation tests
Gate: Pass quality gates (see below)
Count: Verify test count meets bead acceptance criteria

Quality Gates

Before marking a bead complete:

Green suite: uv run pytest tests/YOUR_DOMAIN -q --tb=short — 0 failures
Count target met: compare against references/beads.md targets
No lost edge cases: for each deleted file, verify its unique behaviors are covered by remaining tests (grep for the error/edge case in surviving tests)
Contract tests pass: uv run pytest tests/contracts/ -q -m contract (if Phase 1 done)
Commit documents delta: "Condense X tests: N → M (details of what was removed)"

Updating OpenSpec When Tests Reveal Gaps

During condensation, you may find tests that validate behavior NOT in any spec:

If the behavior is essential (users rely on it) → create an OpenSpec change to document it
If the behavior is an implementation detail → delete the test
If the spec contradicts the test → update the spec to match current behavior

Document your decision in a commit message or bead comment.

Domain-Specific Guidance

references/domains.md — per-domain targets, file inventories, strategies.

Test Classification Decision Matrix

references/classification.md — how to decide keep/delete/rewrite for any test.

Beads Epic

references/beads.md — dependency graph, bead IDs, lifecycle.

Butler Test Condensation

Systematic reduction of the Butlers test suite to ~2,000 contract-driven tests. Each surviving test must trace to an architectural invariant, an RFC wire contract, or an OpenSpec capability.

Epic History

Phase 1 epic bu-rhztl (2026-04-04 → 2026-04-06, CLOSED): condensed 13,675 → 2,196 tests across 10 PRs. Three-tier architecture established.
Phase 2 maintenance epic (2026-05-03 onward): suite has grown to ~3,700 with new domains (Chronicles 519 tests) + drift in tests/api/ (492 vs 200 target). See bd list --parent <phase-2-epic-id> for current children.

Before You Start

Rediscover current state — never trust hardcoded counts in this skill:

CURRENT=$(find tests/ -name '*.py' -exec grep -c 'def test_' {} + 2>/dev/null | awk -F: '{sum+=$2} END {print sum}')
echo "Current test count: $CURRENT (skill baseline was 3,704 on 2026-05-03; Phase 1 closed at 2,196)"

Check Phase 2 epic status: bd list --status all then look for the active "Phase 2" epic — see which beads are open/in-progress
Read your bead: bd show <bead-id> for targets and acceptance criteria
Load doctrine: read about/heart-and-soul/ for invariants, relevant RFCs in about/legends-and-lore/
Run scoped discovery on your domain — see references/discovery.md

If your measured counts differ >10% from this skill's numbers, update references/domains.md before starting work.

Resuming Mid-Epic

If beads are already in-progress or completed:

# 1. What's done, what's available?
bd list --parent <epic-id>
bd ready

# 2. Check predecessor's progress on your domain
git log --oneline -- tests/YOUR_DOMAIN/ | head -10

# 3. Measure current state of your domain
find tests/YOUR_DOMAIN -name '*.py' -exec grep -c 'def test_' {} + 2>/dev/null | awk -F: '{sum+=$2} END {print sum}'

Three-Tier Test Architecture

All tests must map to exactly one tier. If a test doesn't fit, it's a pruning target.

Tier 1: Architectural Invariants (~200 tests) — `tests/contracts/`

Heart-and-soul non-negotiables. Tagged @pytest.mark.contract. Each test docstring cites its RFC/principle. 15 invariants:

Schema isolation — butler can't query another butler's schema (RFC 0006)
MCP-only inter-butler — no cross-butler imports or direct DB calls
Daemon determinism — 17-phase startup order, failure propagation (RFC 0001)
Tool surface isolation — ephemeral MCP config scoping (RFC 0002)
Module composition — topo sort, cycle detection, cascade failure (RFC 0002)
Module boundaries — modules MUST NOT modify core infrastructure (RFC 0002)
Credential tier resolution — Tier 0->1->2 precedence, no plaintext leakage (RFC 0006)
Approval gates — sensitive ops intercepted, can't be bypassed
Graceful shutdown — drain, reverse-order on_shutdown (RFC 0001)
Session lifecycle — request_id UUIDv7 propagation, tool call capture (RFC 0001)
Identity resolution — 3-table JOIN, owner bootstrap, unknowns (RFC 0004)
Context bus — signal TTL, write permissions, supersession (RFC 0009)
Routing pipeline — dedup, thread affinity, triage rules, priority (RFC 0003)
Connector-as-transport — connectors normalize to ingest.v1 only; no routing/classification logic
Staffer routing exclusion — staffers excluded from user-message routing candidates (RFC 0003)

Tier 2: Wire Contracts (~500-800 tests)

RFC-defined schemas and state machines:

ingest.v1 envelope schema (RFC 0003)
route inbox state machine: accepted->processing->processed/errored (RFC 0001)
Module ABC contract: register_tools, migrations, on_startup/on_shutdown (RFC 0002)
Migration chain execution (schema outcomes, not SQL strings)
API response contracts (Pydantic schema validation, not field-by-field)
Cross-butler briefing view + 5 guardrails (RFC 0010)
Insight delivery: candidate schema, dedup key format, cooldown, anti-spam (RFC 0011)
Finance transaction model: tiered dedup, CRUD, soft-delete-only (RFC 0012)

Tier 3: Capability Behavior (~800-1200 tests)

Condensation Workflow Per Domain

Scope: Run discovery commands from references/discovery.md scoped to your domain
Classify: Apply the decision matrix in references/classification.md to each test
Write replacements: For each deleted test that covers a unique behavior, write a behavioral replacement through MCP tool/public API interface
Verify: Run uv run pytest tests/YOUR_DOMAIN -q --tb=short — zero failures
Delete: Remove old implementation tests
Gate: Pass quality gates (see below)
Count: Verify test count meets bead acceptance criteria

Quality Gates

Before marking a bead complete:

Green suite: uv run pytest tests/YOUR_DOMAIN -q --tb=short — 0 failures
Count target met: compare against references/beads.md targets
No lost edge cases: for each deleted file, verify its unique behaviors are covered by remaining tests (grep for the error/edge case in surviving tests)
Contract tests pass: uv run pytest tests/contracts/ -q -m contract (if Phase 1 done)
Commit documents delta: "Condense X tests: N → M (details of what was removed)"

Updating OpenSpec When Tests Reveal Gaps

During condensation, you may find tests that validate behavior NOT in any spec:

If the behavior is essential (users rely on it) → create an OpenSpec change to document it
If the behavior is an implementation detail → delete the test
If the spec contradicts the test → update the spec to match current behavior

Document your decision in a commit message or bead comment.

Domain-Specific Guidance

references/domains.md — per-domain targets, file inventories, strategies.

Test Classification Decision Matrix

references/classification.md — how to decide keep/delete/rewrite for any test.

Beads Epic

references/beads.md — dependency graph, bead IDs, lifecycle.

butler-test-condensation

Butler Test Condensation

Epic History

Before You Start

Resuming Mid-Epic

Three-Tier Test Architecture

Tier 1: Architectural Invariants (~200 tests) — tests/contracts/

Tier 2: Wire Contracts (~500-800 tests)

Tier 3: Capability Behavior (~800-1200 tests)

Condensation Workflow Per Domain

Quality Gates

Updating OpenSpec When Tests Reveal Gaps

Domain-Specific Guidance

Test Classification Decision Matrix

Beads Epic

More from this repository

More from this repository

Butler Test Condensation

Epic History

Before You Start

Resuming Mid-Epic

Three-Tier Test Architecture

Tier 1: Architectural Invariants (~200 tests) — tests/contracts/

Tier 2: Wire Contracts (~500-800 tests)

Tier 3: Capability Behavior (~800-1200 tests)

Condensation Workflow Per Domain

Quality Gates

Updating OpenSpec When Tests Reveal Gaps

Domain-Specific Guidance

Test Classification Decision Matrix

Beads Epic

Tier 1: Architectural Invariants (~200 tests) — `tests/contracts/`

Tier 1: Architectural Invariants (~200 tests) — `tests/contracts/`