| name | testing-practices |
| description | Use when deciding what or how to test, or reviewing test quality — adding tests for new logic, judging whether a change needs coverage, or critiquing a test that mocks heavily, reads the clock, hits the network, or asserts on internals. Steers toward testing behavior over implementation, one concept per test, deterministic runs, and covering error paths — not just the happy path. Runs scripts/check_test_hygiene.sh to flag flaky/brittle/falsely-green smells; exits nonzero for a pre-commit hook or CI. Not for running the suite, debugging a failing test, test-tooling/fixture setup, speeding up CI, or red-green TDD (use the TDD skill to write a failing test first). |
Testing Practices
Overview
Claude's default test is a happy-path unit test that mocks everything around the
function and asserts the function called its collaborators in a certain order. That
test is green, brittle, and proves almost nothing — it locks in the current
implementation and stays green when the real behavior breaks. This skill pushes
toward tests that earn their keep: they assert observable behavior, they're
deterministic, they cover the failure paths, and each one checks a single concept.
scripts/check_test_hygiene.sh statically flags the smells that make a suite lie
(real clock, network, unseeded randomness, over-mocking) and exits nonzero so it
runs in CI.
This is a "what to test and how", not a tutorial on your test framework. It assumes
you know how to write an assertion; it tells you which assertions are worth writing.
When to Use
Use this when:
- Writing tests for new code, or adding tests to cover a fix.
- Reviewing a diff's tests, or a suite feels flaky / always-green / slow.
- A change adds logic with no test, or adds a test that mocks heavily, sleeps,
reads
now(), or pokes at private attributes.
Do NOT use this when:
- You need general lint/format/style enforcement — that's
code-style.
- You want a behavioral critique of the production change — that's
adversarial-review.
- You're choosing a framework or runner; this skill is framework-agnostic.
Running the hygiene check
scripts/check_test_hygiene.sh
scripts/check_test_hygiene.sh tests/api/
It scans only test files (test_*.py, *_test.py, *.test.*, *.spec.*).
Findings are heuristics — a hit means look, not always defect. Acknowledge a
deliberate exception with an inline # test-hygiene: ok on that line. Tune the
over-mocking threshold with TEST_HYGIENE_MOCK_LIMIT.
Wire it as a hook or Action
It exits nonzero, so enforcement is one line — pre-commit:
.claude/skills/testing-practices/scripts/check_test_hygiene.sh || exit 1
GitHub Action step:
- run: .claude/skills/testing-practices/scripts/check_test_hygiene.sh
What to test (and how)
Test behavior, not implementation
Assert on what the unit produces or causes given an input — the return value, the
emitted event, the row written — not on how it got there. Tests that assert "this
private method was called" or reach into private attributes break on every refactor
even when behavior is identical, and pass even when behavior is wrong. If you can't
test a behavior without inspecting internals, that's a design smell in the code, not
a reason to test internals.
One concept per test
Each test pins down a single behavior, and its name says which one
(test_refund_rejected_when_already_refunded). Don't pile six unrelated assertions
into one test: when it fails you can't tell which behavior broke, and the first
failed assert hides the rest. Multiple asserts that together verify one concept
(e.g. status code + body of one response) are fine.
Cover the error paths, not just the happy path
The happy path is the path you already know works. The bugs live in the edges:
empty input, the timeout, the duplicate, the permission denied, the malformed
payload, the boundary value. For every happy-path test, ask "what's the failure mode
here?" and test that the code fails correctly — right exception, right message,
no partial write, no swallow.
Keep tests deterministic
A test that passes or fails depending on the wall clock, the network, or an unseeded
RNG is worse than no test — it trains the team to ignore red. Inject a fixed time
instead of reading now(). Seed or fix randomness. Mock the network boundary. Wait
on a condition, never sleep(). The hygiene script flags all of these.
Gotchas
- Testing implementation details. Asserting on call order, private methods, or
internal state couples the test to today's code. Refactor → red test → you "fix"
the test → you've validated nothing. Assert observable behavior only.
- Over-mocking hides real breakage. Mock the function under test's boundaries
(network, DB, clock), not its neighbors. When you mock the collaborator you're
actually integrating with, the test asserts your wiring matches your mock — and
stays green when the real collaborator's contract changes. High mock density (the
script's per-file count) is the tell.
- Many asserts, one test, no diagnosis. When a 10-assert test fails on assert
#2, you learn nothing about asserts #3–10 and can't name the broken behavior.
Split by concept.
- Hidden non-determinism.
datetime.now(), time.time(), random() without a
seed, uuid4(), real HTTP, and sleep()-based timing all make tests flaky or
irreproducible. The failure shows up at 2am in CI, not on your machine. Inject the
clock, seed the RNG, mock the boundary, wait on events.
- Only the happy path. A suite that's all happy-path gives false confidence:
coverage looks high, but every interesting failure mode is untested. Error paths
are where the value is.
- Asserting the mock instead of the result.
mock.assert_called_with(...) tests
that you called your own mock — tautological. Prefer asserting the real output or
the real side effect the behavior produces.
- Snapshot tests as a behavior substitute. A blob snapshot passes review with
one glance and then "updates" silently whenever output drifts. Use targeted
assertions on the parts that matter.
Files
SKILL.md — this file; what to test, how, and the smells to avoid.
scripts/check_test_hygiene.sh — static linter for test files: flags real clock,
network, unseeded randomness, sleeps, and over-mocking density. Exits nonzero;
usable as a pre-commit hook or CI step.