with one click
do-test
// Use when running the test suite. Parses arguments, dispatches test runners (potentially in parallel), and aggregates results. Triggered by 'run tests', 'test this', or any request about testing.
// Use when running the test suite. Parses arguments, dispatches test runners (potentially in parallel), and aggregates results. Triggered by 'run tests', 'test this', or any request about testing.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | do-test |
| description | Use when running the test suite. Parses arguments, dispatches test runners (potentially in parallel), and aggregates results. Triggered by 'run tests', 'test this', or any request about testing. |
| argument-hint | [test-path-or-filter] |
You are the test orchestrator. You parse arguments, dispatch test runners (potentially in parallel), and aggregate results into a summary.
TEST_ARGS: $ARGUMENTS
If TEST_ARGS is empty or literally $ARGUMENTS: The skill argument substitution did not run. Look at the user's original message in the conversation — they invoked this as /do-test <argument>. Extract whatever follows /do-test as the value of TEST_ARGS. Do NOT stop or report an error; just use the argument from the message.
Before running tests, scan for any additional test-related skill docs in the project:
ls .claude/skills/*test*/*.md .claude/skills-global/*test*/*.md 2>/dev/null
Read any discovered files. They may define additional test runners, targets, or configurations beyond what this skill covers (e.g., mobile tests, browser tests, performance benchmarks). Incorporate their instructions alongside the defaults below.
Parse TEST_ARGS to determine what to run:
| Input | Behavior |
|---|---|
| (empty) | Run all test directories + lint checks |
unit | Run tests/unit/ + lint |
integration | Run tests/integration/ + lint |
e2e | Run tests/e2e/ + lint |
tools | Run tests/tools/ + lint |
performance | Run tests/performance/ + lint |
tests/unit/test_bridge_logic.py | Run that specific file + lint |
--changed | Detect changed files, map to test files, run those + lint |
--no-lint | Skip ruff/black checks (combinable with any above) |
unit --no-lint | Run tests/unit/ without lint |
--changed --no-lint | Changed-file tests without lint |
--direct | Force direct execution, skip parallel agent dispatch |
unit --direct | Run tests/unit/ directly (combinable with any target) |
frontend <url> "<scenario>" | Run a browser-based UI test via frontend-tester subagent |
happy-paths | Run python tools/happy_path_runner.py tests/happy-paths/scripts/ directly via bash. No subagent dispatch. |
Parsing rules:
--changed, --no-lintfrontend, route to Frontend Testing (see below). If target is happy-paths, route to Happy Path Testing (see below). Neither runs pytest.unit, integration, e2e, tools, performance) or a file/directory path--changed, target is "all"--direct flag alongside --changed and --no-lint--changed)When --changed is specified:
Determine the diff base:
CURRENT_BRANCH=$(git rev-parse --abbrev-ref HEAD)
if [ "$CURRENT_BRANCH" = "main" ]; then
DIFF_BASE="HEAD~1"
else
DIFF_BASE="main"
fi
Get changed files:
git diff --name-only "$DIFF_BASE"...HEAD -- '*.py'
Map changed files to test files using these conventions:
bridge/*.py -> tests/unit/test_bridge*.pytools/*.py -> tests/tools/test_*.pyagent/*.py -> tests/unit/test_agent*.pymonitoring/*.py -> tests/unit/test_monitoring*.pyfoo/bar.py -> tests/*/test_bar.pyFilter to existing files -- only include test files that actually exist on disk.
If no test files are found after mapping, report "No test files found for changed files" and skip test execution (lint still runs unless --no-lint).
| Name | Value | Description |
|---|---|---|
PARALLEL_DISPATCH_THRESHOLD | 50 | Number of test files above which parallel subagent dispatch is used instead of sequential execution. Below this threshold, run tests in-process to avoid subagent overhead. |
Integration test check: If the plan has an Agent Integration section describing cross-component wiring (tool A feeds component B), verify at least one test exercises the full chain -- not just each component in isolation.
When the user requests a specific test type or file, run it directly in the current agent. No need for parallel dispatch -- the overhead is not worth it for a single runner.
# For a type like "unit":
pytest tests/unit/ -v --tb=short
# For a specific file:
pytest tests/unit/test_bridge_logic.py -v --tb=short
# For --changed with resolved files:
pytest tests/unit/test_foo.py tests/tools/test_bar.py -v --tb=short
If lint is enabled, run lint sequentially after tests:
python -m ruff check .
black --check .
When running all tests and the total number of test files exceeds PARALLEL_DISPATCH_THRESHOLD (50), dispatch parallel subagents via the Task tool for each test directory that exists. This maximizes throughput. Below the threshold, run all suites sequentially in-process to avoid subagent overhead.
Step 0: Decide execution mode
Before dispatching parallel agents, determine if direct execution is more efficient:
TEST_FILE_COUNT=$(find tests/ -name "test_*.py" 2>/dev/null | wc -l | tr -d ' ')
Run tests DIRECTLY (no agent dispatch) if ANY of these are true:
--direct flag is setWhen running directly, execute as a single command:
pytest tests/ -v --tb=short
Then run lint (if enabled) and skip to Result Aggregation.
Only dispatch parallel agents if:
--direct flagStep 1: Discover test directories
Check which of these directories exist and contain test files:
tests/unit/tests/integration/tests/e2e/tests/performance/tests/tools/Also check for top-level test files in tests/ (files matching test_*.py directly in the tests directory).
Step 2: Dispatch parallel agents
For each existing test directory/group, create a Task:
Task({
description: "Run [suite-name] tests",
subagent_type: "test-engineer",
model: "sonnet",
prompt: "Run the following test command and report results:
cd [CWD]
pytest [test-path] -v --tb=short
Report: number of tests passed, failed, skipped, and any failure details.
Output the raw pytest output.",
run_in_background: true
})
If lint is enabled, dispatch a lint agent in parallel too:
Task({
description: "Run lint checks",
subagent_type: "validator",
model: "sonnet",
prompt: "Run lint checks in [CWD]:
cd [CWD]
python -m ruff check .
black --check .
Report: pass/fail for each tool, and any issues found.",
run_in_background: true
})
Step 3: Wait for agents with timeout fallback
Monitor all background tasks. Set a 2-minute timeout from dispatch.
If all agents complete within 2 minutes: Collect their outputs normally and proceed to Result Aggregation.
If any agent has NOT returned output after 2 minutes:
"Agent timeout: [suite-name] test-engineer did not return within 2 minutes"pytest tests/ -v --tb=short
python -m ruff check .
black --check .
This fallback ensures test results are always collected, even when agent dispatch fails.
After all runners complete, present a summary table:
## Test Results
| Suite | Status | Passed | Failed | Skipped | Duration |
|-------|--------|--------|--------|---------|----------|
| unit | PASS | 42 | 0 | 2 | 3.1s |
| integration | FAIL | 8 | 1 | 0 | 12.4s |
| tools | PASS | 15 | 0 | 0 | 1.8s |
| lint (ruff) | PASS | - | - | - | 0.5s |
| lint (black) | PASS | - | - | - | 0.3s |
### Failures
**integration::test_api_auth.py::test_expired_token**
AssertionError: Expected 401, got 200
File "tests/integration/test_api_auth.py", line 45
Final verdict:
ALL TESTS PASSEDWhen test failures are detected, do NOT claim failures are "pre-existing" without evidence. Instead, dispatch the baseline-verifier subagent to classify each failure by running it against main.
Run baseline verification when ALL of these are true:
main (baseline comparison only makes sense on feature branches)Skip baseline verification if:
main (no baseline to compare against)Before dispatching failures to the baseline verifier, retry ONLY the failing tests once more on the current branch to detect intermittent (flaky) failures:
python -m pytest <FAILING_TEST_IDS> -v --tb=short 2>&1
Classify retry results:
FLAKY. Report them in the results table but they do NOT count as failures or regressions. Do NOT send them to baseline verification.Why this matters: Flaky tests that fail on the branch but pass on main get misclassified as regressions. A single retry catches the most common intermittent failures (timing-dependent tests, LLM classifier non-determinism, resource contention) without the overhead of a full baseline worktree.
Add flaky tests to the results table:
### Flaky Tests (passed on retry)
| Test | Verdict | Notes |
|------|---------|-------|
| `tests/unit/test_timing.py::test_stall_detection` | FLAKY | Passed on retry (intermittent) |
| `tests/unit/test_classifier.py::test_bare_ref` | FLAKY | Passed on retry (LLM non-determinism) |
These tests are intermittently failing and should be investigated, but they do not block the pipeline.
After the flaky filter, update FAILING_TEST_IDS to contain only the tests that still failed on retry. Proceed to Step 1 with this reduced list.
Parse the pytest output to extract all failing test node IDs. These look like:
tests/unit/test_foo.py::test_bar FAILED
tests/integration/test_api.py::TestAuth::test_expired_token FAILED
Collect them into a list: FAILING_TEST_IDS
If the Observer passed context from a previous /do-test OUTCOME (available in the prompt context), look for:
regression_fix_attempt: The current attempt number (integer)persistent_regressions: The list of regression test IDs from the prior runThese are used for the circuit breaker in Step 5.
Dispatch the baseline-verifier subagent to classify failures:
Task({
description: "Baseline verification: classify test failures against main",
subagent_type: "baseline-verifier",
prompt: "
Classify these failing test node IDs by running them against main:
failing_test_ids:
<one test ID per line>
worktree_path: <current CWD for reference>
Follow the instructions in your agent definition exactly.
Return the structured JSON classification.
"
})
Wait for the subagent to complete and parse the returned JSON.
Replace the generic "Failures" section with a verified classification table:
### Failure Classification (verified against main at <baseline_commit>)
| Test | Branch | Main | Verdict |
|------|--------|------|---------|
| tests/unit/test_foo.py::test_bar | FAILED | PASSED | **REGRESSION** |
| tests/unit/test_old.py::test_legacy | FAILED | FAILED | pre-existing |
| tests/e2e/test_flow.py::test_deleted | FAILED | N/A | inconclusive |
**Summary:**
- Regressions: 1 (blocking)
- Pre-existing: 1 (does not block merge)
- Inconclusive: 1 (manual review recommended)
Track regression fix attempts across pipeline invocations to prevent infinite test-patch-test loops.
Reading the counter:
regression_fix_attempt was provided in the context from a prior OUTCOMEIncrementing the counter:
regressions list with persistent_regressions from the prior runregression_fix_attempt by 1regression_fix_attempt to 1Circuit breaker trigger:
MAX_REGRESSION_FIX_ATTEMPTS = 3regression_fix_attempt >= MAX_REGRESSION_FIX_ATTEMPTS and regressions still exist:
status: blocked instead of status: failnext_skill: /do-planfailure_reason: "Regression fix not converging after N attempts. Escalating to planning."persistent_regressions in artifacts so the planner has contextAfter baseline verification completes, the final verdict changes:
failpartialblockedfailImportant: Only regressions block the pipeline. Pre-existing failures are reported but do NOT cause status: fail.
frontend target)When TEST_ARGS starts with frontend, route to the frontend-tester subagent. Do not run pytest.
Input format:
/do-test frontend https://myapp.com "Login form submits and shows dashboard"
/do-test frontend https://myapp.com "Checkout flow completes successfully" -- steps: click add-to-cart, click checkout, fill address, submit
Dispatch a single frontend-tester subagent:
Task({
description: "Frontend test: <scenario>",
subagent_type: "frontend-tester",
prompt: "
URL: <url>
Scenario: <scenario>
Steps:
<extracted steps if provided, otherwise infer from scenario>
Expected: <inferred from scenario>
"
})
The frontend-tester agent owns all browser interaction via BYOB MCP (mcp__byob__browser_*) — the skill never drives the browser directly.
When running all tests (no target) and a tests/frontend/ directory exists with .json or .yaml scenario files, dispatch one frontend-tester subagent per scenario file in parallel alongside the pytest agents.
Scenario file format (for tests/frontend/):
{
"url": "https://myapp.com/login",
"scenario": "Login with valid credentials shows dashboard",
"steps": [
"Fill email field with test@example.com",
"Fill password field with password123",
"Click Login button"
],
"expected": "Dashboard page loads with user name visible"
}
Result aggregation: Include frontend results in the summary table alongside pytest suites:
| Suite | Status | Passed | Failed | Screenshot |
|-----------------|--------|--------|--------|------------|
| frontend/login | PASS | 1 | 0 | /tmp/... |
| frontend/checkout | FAIL | 0 | 1 | /tmp/... |
happy-paths target)When TEST_ARGS starts with happy-paths, run the deterministic test runner directly. No subagent needed.
python tools/happy_path_runner.py tests/happy-paths/scripts/
The runner outputs a markdown summary table to stdout with pass/fail/error counts per script, followed by a JSON summary in an HTML comment block. Include results in the summary table alongside pytest and frontend suites.
If tests/happy-paths/scripts/ contains .sh files, include happy-paths execution
alongside pytest and frontend targets. Run via bash, not subagent.
All commands run relative to the current working directory. Do not attempt to detect or navigate to worktrees. When /do-test is invoked:
/do-build: CWD is already the worktree -- commands run thereSimply use the CWD as-is. Run pwd once at the start to confirm and log it.
pytest is not installed, report the error clearly--changed, fall back to running all testsAfter tests pass, run these additional quality scans and include results in the report:
Scan for except Exception: pass patterns that lack test coverage:
grep -rn "except.*Exception.*:" --include="*.py" agent/ bridge/ | grep -v "logger\|log\.\|warning\|error\|raise\|# .*tested" | head -20
Report any bare exception handlers found. Each should either:
pass is acceptable (e.g., cleanup during shutdown)If the test suite covers agent output processing code, verify that empty/None/whitespace inputs are tested:
grep -rn "def test.*empty\|def test.*none\|def test.*whitespace" tests/ --include="*.py" | wc -l
Flag if the changed files include output processing code but the test suite has zero empty input tests.
If any changed files contain inner functions or closures (functions defined inside other functions), flag whether those closures have dedicated test coverage:
grep -rn "def .*(" --include="*.py" agent/ bridge/ | grep "^.*:.*def .*:$" | head -10
Closures that replicate logic already tested elsewhere (e.g., inline routing logic that should call a shared function) are a test smell. Note them in the report.
After tests pass, scan for xfail-marked tests that are now passing (xpass). When a bug fix lands, the corresponding xfail marker should be removed and converted to a hard assertion. Stale xfails indicate the fix landed but the test wasn't updated.
Two forms of xfail exist and require different detection:
Decorator form (@pytest.mark.xfail): Pytest reports these as XPASS in test output when the test unexpectedly passes. Check the pytest output for XPASS entries.
Runtime form (pytest.xfail("reason") called inside the test body): These are invisible to XPASS detection because the call short-circuits the test before it reaches the assertion. A test with a runtime pytest.xfail() will show as xfail even when the underlying bug is fixed — it never gets a chance to pass. This is the more dangerous form because it silently hides regressions.
# Find ALL xfail markers (both decorator and runtime forms)
grep -rn 'pytest.mark.xfail\|pytest.xfail(' tests/ --include="*.py" | head -20
For decorator xfails: Check if pytest reports XPASS in the test output.
For runtime xfails: These ALWAYS require manual review. Flag every pytest.xfail( call found in test bodies:
if broken: pytest.xfail(...)), check whether the condition is still trueFor each stale xfail detected (either form):
Important: Runtime pytest.xfail() is a stronger smell than decorator @pytest.mark.xfail. If --changed mode is active and the changed files include a bug fix, runtime xfails in related test files should be flagged as blockers, not just warnings.
Skip if: No xfail markers found in the test suite.
Before emitting the OUTCOME, run a mandatory Exception Swallow Gate on the diff. This gate blocks the TEST stage if new unguarded except Exception blocks are introduced.
When to run: Always — after tests pass, before OUTCOME emission. Scan the diff (not the full codebase) for new except Exception blocks only.
Gate logic:
# Get the diff of new/changed Python lines that add except Exception blocks
DIFF_BASE=$(git rev-parse --abbrev-ref HEAD | grep -q "^main$" && echo "HEAD~1" || echo "main")
DIFF_CONTENT=$(git diff "$DIFF_BASE"...HEAD -- '*.py' | grep '^+' | grep -v '^+++')
# Find line numbers (within the diff output) of new except Exception clauses
EXCEPT_LINE_NUMS=$(echo "$DIFF_CONTENT" | grep -n 'except.*Exception' | cut -d: -f1)
if [ -z "$EXCEPT_LINE_NUMS" ]; then
echo "EXCEPTION_SWALLOW_GATE: PASS (no new except Exception blocks)"
else
# For each new except Exception line, check the clause line AND the next 3 handler-body lines
FAILURES=""
TOTAL_LINES=$(echo "$DIFF_CONTENT" | wc -l)
while IFS= read -r lineno; do
# Extract the except clause line itself
clause_line=$(echo "$DIFF_CONTENT" | sed -n "${lineno}p")
# Extract up to 3 handler-body lines following the except clause
end_line=$((lineno + 3))
[ $end_line -gt $TOTAL_LINES ] && end_line=$TOTAL_LINES
body_lines=$(echo "$DIFF_CONTENT" | sed -n "$((lineno+1)),${end_line}p")
# Pass if the except clause has a valid swallow-ok comment (reason must be 10+ non-whitespace chars)
if echo "$clause_line" | grep -qE "# swallow-ok: .{10,}"; then
continue
fi
# Pass if the handler body contains logger, log., warning, error, or raise
if echo "$body_lines" | grep -qE "logger|log\.|warning|error|raise"; then
continue
fi
FAILURES="$FAILURES
$clause_line"
done <<< "$EXCEPT_LINE_NUMS"
if [ -z "$FAILURES" ]; then
echo "EXCEPTION_SWALLOW_GATE: PASS"
else
echo "EXCEPTION_SWALLOW_GATE: FAIL — new unguarded except Exception block(s):"
echo -e "$FAILURES"
echo ""
echo "Each new except Exception block must either:"
echo " 1. Contain logger, log., warning, error, or raise in the handler body (next 3 lines)"
echo " 2. Have an inline comment on the except line: # swallow-ok: {reason with 10+ chars}"
echo " Example: # swallow-ok: safe during shutdown, task already cancelled"
echo " Invalid: # swallow-ok: x (reason too short)"
echo " Invalid: # swallow-ok: (empty reason)"
echo "GATE_FAILED"
fi
fi
If gate fails: Emit <!-- OUTCOME {"status":"fail","stage":"TEST","artifacts":{"swallow_gate":"failed","new_swallows":[...]}} --> and stop. Do NOT emit a success OUTCOME.
Carve-out convention: To exempt a legitimate exception swallow, add an inline comment on the same line as the except clause:
except Exception: # swallow-ok: safe during shutdown, task already cancelled
pass
The reason must be at least 10 non-whitespace characters. Bare # swallow-ok: or whitespace-only reasons do NOT pass.
As the very last line of your final response, emit an OUTCOME contract so the pipeline can classify the test result programmatically:
<!-- OUTCOME {"status":"success","stage":"TEST","artifacts":{"passed":<N>,"failed":0}} --><!-- OUTCOME {"status":"fail","stage":"TEST","artifacts":{"passed":<N>,"failed":<N>}} --><!-- OUTCOME {"status":"partial","stage":"TEST","artifacts":{"passed":<N>,"failed":0,"flaky":<N>}} -->This structured output is parsed by classify_outcome() in agent/pipeline_state.py (Tier 0) before any text pattern matching.
/tmp for any scratch work-v --tb=short flags provide verbose test names with concise tracebacks