| name | test-refine |
| description | Refines an existing Go test suite — removes trivial/duplicate tests, strengthens weak assertions, reshapes tests around invariants, and closes branch-coverage gaps. Uses code-guided coverage and (when available) gremlins mutation-testing survivor data rather than relying on line coverage alone. Use when test quality is uneven, after a test-generation pass, before opening a PR, or as a quality gate on critical paths (consensus, channel state, payment flows). Triggers: "refine these tests", "tests are bloated", "tighten assertions", "remove trivial tests", "audit test quality", "/test-refine". |
Test Refine
A test suite can be voluminous and yet weak: tests that overlap, assert nothing meaningful, run code without checking outputs, or miss the branches that actually matter. Line coverage doesn't catch this — a 100% line-covered suite can have zero real assertions. This skill operates on an existing test suite and refines it.
This is orthogonal to test generation skills (test-forge, property-based-testing):
test-forge generates new tests.
property-based-testing designs PBT for new or existing functions.
mutation-testing validates whether existing tests catch behavior changes.
test-refine (this skill) consumes the above signals and changes the test suite — strengthening assertions, removing dead weight, reshaping for invariants, closing branch-coverage gaps.
When to Use
- After a
test-forge pass — sharpen the generated tests.
- Before opening a PR — clean up the diff so reviewers see strong tests.
- Periodically on critical packages — guard against test-quality drift.
- Following a mutation-testing run with survivors — convert each LIVED mutant into a tightened test.
Two-Phase Workflow
The skill runs read-only triage first, produces a markdown report, then applies fixes only after the user reviews and approves.
Phase A — Triage (read-only)
~/.claude/skills/test-refine/scripts/triage.sh
~/.claude/skills/test-refine/scripts/triage.sh --scope file --target ./internal/wallet/wallet_test.go
~/.claude/skills/test-refine/scripts/triage.sh --scope diff
~/.claude/skills/test-refine/scripts/triage.sh --scope repo
~/.claude/skills/test-refine/scripts/triage.sh --scope package --use-mutations
~/.claude/skills/test-refine/scripts/triage.sh --scope diff --use-mutations
The triage script:
- Resolves scope.
- Runs
go test -cover -covermode=atomic to capture per-function statement+branch coverage.
- Optionally calls
~/.claude/skills/mutation-testing/scripts/unleash.sh for LIVED mutant data.
- Runs the AST analyzers (
detect-smells.go, detect-duplicates.go, domain-checks.go).
- Scores findings via
score.go (composite priority — see below).
- Renders the markdown report.
Phase B — Batch Fix (after user approves report)
The user reviews the report. Each finding has a checkbox. The user checks boxes for the fixes they approve, then:
~/.claude/skills/test-refine/scripts/apply-fixes.sh \
--report .reviews/test-refinement/2026-05-06-wallet.md
The script applies only the checked items:
- Strengthen assertions in place.
- Add missing branch tests.
- Reshape into invariants (rewrite using
rapid for flagged cases).
- Remove tests only if explicitly checked.
After fixes, it re-runs go test ./... -race -count=1, appends an "After" metrics section to the same report, and surfaces the diff.
Safety rule: a test is never removed unless its checkbox in the report is explicitly checked. This honors the global rule "never remove/skip tests without asking" from ~/.claude/CLAUDE.md.
Composite Priority Score
When a triage produces dozens of findings, ranking matters. The composite score combines three signals:
priority = w_risk × risk_score(file_path)
+ w_severity × severity(smell_id)
+ w_gap × branch_gap(function)
| Component | Range | Source |
|---|
risk_score | 0.2–1.0 | File-path heuristic: `consensus |
severity | 0.3 (L) / 0.6 (M) / 1.0 (H) | Smell catalog severity (references/smell-catalog.md) |
branch_gap | 0.0–1.0 | Uncovered branches in the target function / total branches |
Default weights: 0.5 / 0.3 / 0.2. Override via --weights risk=0.6,severity=0.3,gap=0.1.
Smell Catalog (summary)
Full catalog with Go examples in references/smell-catalog.md.
| ID | Smell | Severity |
|---|
| S01 | No assertions at all (recognises testify, helper-named functions, and t.Run subtests) | High |
| S02 | Tautological assertion (x == x) | High |
| S03 | Getter/setter trivial test | Medium |
| S04 | Asserts no panic only | High |
| S05 | Unchecked error from SUT | High |
| S06 | Sensitive equality on rendered text — fmt.Sprint* and non-canonical .String(). Skipped for canonical types (chainhash.Hash, UUID, big.Int, time.Time, OutPoint, etc.) | Medium |
| S07 | Conditional/skipped assertion | Medium |
| S08 | Duplicate test body (semantic) | Medium |
| S09 | Assertion roulette — only fires when ≥4 bare asserts share the same call+RHS shape and the test has ≥8 bare asserts; confidence 0.4 (advisory) | Low |
| S10 | Expect-the-expected (want derived from got) | High |
| S11 | Side-effect not asserted | Medium |
| S12 | Mutation-survivor zone (gremlins data required) | High |
Domain-Specific Checks
For systems / distributed / Bitcoin / networking code, "good test" goes beyond standard assertion strength. The skill also checks four dimensions detailed in references/domain-checks.md:
- Concurrency: SUT uses goroutines/channels/mutexes? Test must run with
-race-friendly patterns and exercise concurrent calls. Sequential test of concurrent code is flagged.
- Failure-mode: SUT returns
error or accepts context.Context? Tests must cover error path, cancellation, timeout. For network/disk SUTs, missing fault-injection tests are flagged. Inspired by Jepsen nemesis patterns.
- Property/invariant: Detects PBT candidates — serialization roundtrips, parser/encoder pairs, normalizers, comparators. Proposes conversion to
rapid PBT (see references/reshape-to-invariants.md).
- Determinism:
time.Now(), rand.* without seeds, os.Getenv, goroutine-ordering assumptions in tests are flagged. Suggests injectable clocks/RNGs (DST-style).
What Counts as Trivial (and Removable)
Findings flagged for removal candidacy (still always require user approval):
- No assertions (S01).
- Getter/setter-only tests (S03) — testing language semantics, not behavior.
- Tautological assertions (S02).
- Zero-mutation-killers (S12 special case): when gremlins data shows a test exists in the package whose removal causes no change in
test_efficacy. The test catches no mutants the rest of the suite doesn't already catch.
Naming Convention
Test function names must not contain underscores. Use TestEncodeTxRoundtrip, not TestEncodeTx_Roundtrip. For variants of the same logical test, use t.Run("subtest name", func(t *testing.T) {...}). The reshape pass enforces this — any rewrite the skill proposes uses subtests, not underscored function names. See references/strengthening-patterns.md.
Aggressive Reshape Posture
The user opted into "aggressive — reshape tests for invariants". Examples:
- A test that asserts
Marshal(x) == fixed_bytes for a single value → reshape into a rapid roundtrip property: Unmarshal(Marshal(x)) == x for arbitrary x.
- A test that asserts a single state transition → reshape into a
rapid.StateMachine covering all transitions.
- Multiple tests covering near-identical inputs → consolidate into a table-driven test with the union of cases.
Every reshape proposal in the report shows the original test verbatim alongside the proposed rewrite. If the user rejects the reshape, no change is made.
Critical Pitfall Guards
- Never auto-delete. The report's checkbox grammar (
- [ ] Remove TestX) is the only way a test is removed. Bulk approval is not a thing.
- Reshape passing tests of mission-critical code with care. Phase B optionally re-runs
gremlins unleash on touched packages to verify reshape didn't weaken mutation score. Pass --verify-mutations.
- PBT conversion for side-effect-bearing code uses state-machine PBT. Pure-function PBT is wrong here. Such cases are flagged separately and link to
property-based-testing/references/strategies.md.
- Equivalent mutants are not test gaps. When gremlins reports
LIVED, the AST is checked for equivalence patterns (mutated value immediately overwritten, mutation in unreachable code, associative no-op) before flagging as S12.
Output Artifacts
The committed markdown report lives at:
.reviews/test-refinement/<YYYY-MM-DD>-<scope-slug>.md
Contents:
- Header banners: warn when the data layer is incomplete (coverage = 0%, gremlins skipped) or when priority is severity-only because branch-gap is unavailable. The user sees this before trusting the ranking.
- Before metrics: test count, assertion count, statement+branch coverage %,
test_efficacy if mutation data available.
- Top-N findings table: rank, priority, confidence, fix-kind, file:line, smell, severity, test → SUT. The full message is not in the table — it's in the per-finding detail block, so nothing gets truncated.
- Per-finding detail block: smell, suggestion, an
Apply fix checkbox, a False positive — won't fix checkbox (so disagreement is recordable across re-runs), and a collapsible <details> block with the test function body so the reviewer can sanity-check without context-switching to the source file.
- Domain-check section: concurrency / failure-mode / PBT / determinism findings. Runs for all scopes (package, file, diff, repo) — the diff-scope variant fans out across the unique set of affected package directories.
- Removal candidates list: each with one-line justification and explicit checkbox.
- PBT conversion candidates: specific functions + property sketch using
rapid.
- (After fixes) "After" metrics appended to the same file.
Integration
- Reuses
property-based-testing skill's references for PBT conversion templates.
- Consumes
mutation-testing skill output (gremlins JSON) when --use-mutations is set.
- Invocable from
/code-review and /pre-pr-review flows as a sub-step.
Further Reading