Run any Skill in Manus with one click

$pwd:

do-test

Name: Do Test
Author: tomcounsell

// Use when running the test suite. Parses arguments, dispatches test runners (potentially in parallel), and aggregates results. Triggered by 'run tests', 'test this', or any request about testing.

Run Skill in Manus

$ git log --oneline --stat

stars:12

forks:7

updated:May 6, 2026 at 06:11

File Explorer

2 files

SKILL.md

readonly

package.json

"author": "tomcounsell"

"repository": "tomcounsell/ai"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	do-test
description	Use when running the test suite. Parses arguments, dispatches test runners (potentially in parallel), and aggregates results. Triggered by 'run tests', 'test this', or any request about testing.
argument-hint	[test-path-or-filter]

Do Test

You are the test orchestrator. You parse arguments, dispatch test runners (potentially in parallel), and aggregate results into a summary.

Variables

TEST_ARGS: $ARGUMENTS

If TEST_ARGS is empty or literally $ARGUMENTS: The skill argument substitution did not run. Look at the user's original message in the conversation — they invoked this as /do-test <argument>. Extract whatever follows /do-test as the value of TEST_ARGS. Do NOT stop or report an error; just use the argument from the message.

Step 0: Discover Additional Test Skills

Before running tests, scan for any additional test-related skill docs in the project:

ls .claude/skills/*test*/*.md .claude/skills-global/*test*/*.md 2>/dev/null

Read any discovered files. They may define additional test runners, targets, or configurations beyond what this skill covers (e.g., mobile tests, browser tests, performance benchmarks). Incorporate their instructions alongside the defaults below.

Argument Parsing

Parse TEST_ARGS to determine what to run:

Input	Behavior
(empty)	Run all test directories + lint checks
`unit`	Run `tests/unit/` + lint
`integration`	Run `tests/integration/` + lint
`e2e`	Run `tests/e2e/` + lint
`tools`	Run `tests/tools/` + lint
`performance`	Run `tests/performance/` + lint
`tests/unit/test_bridge_logic.py`	Run that specific file + lint
`--changed`	Detect changed files, map to test files, run those + lint
`--no-lint`	Skip ruff/black checks (combinable with any above)
`unit --no-lint`	Run `tests/unit/` without lint
`--changed --no-lint`	Changed-file tests without lint
`--direct`	Force direct execution, skip parallel agent dispatch
`unit --direct`	Run `tests/unit/` directly (combinable with any target)
`frontend <url> "<scenario>"`	Run a browser-based UI test via `frontend-tester` subagent
`happy-paths`	Run `python tools/happy_path_runner.py tests/happy-paths/scripts/` directly via bash. No subagent dispatch.

Parsing rules:

Extract flags: --changed, --no-lint
If target is frontend, route to Frontend Testing (see below). If target is happy-paths, route to Happy Path Testing (see below). Neither runs pytest.
Whatever remains is the target: a test type name (unit, integration, e2e, tools, performance) or a file/directory path
If no target and no --changed, target is "all"
Extract --direct flag alongside --changed and --no-lint

Changed-File Detection (`--changed`)

When --changed is specified:

Determine the diff base:

CURRENT_BRANCH=$(git rev-parse --abbrev-ref HEAD)
if [ "$CURRENT_BRANCH" = "main" ]; then
  DIFF_BASE="HEAD~1"
else
  DIFF_BASE="main"
fi

Get changed files:

git diff --name-only "$DIFF_BASE"...HEAD -- '*.py'

Map changed files to test files using these conventions:
- bridge/*.py -> tests/unit/test_bridge*.py
- tools/*.py -> tests/tools/test_*.py
- agent/*.py -> tests/unit/test_agent*.py
- monitoring/*.py -> tests/unit/test_monitoring*.py
- General rule: source file foo/bar.py -> tests/*/test_bar.py
- Test files themselves: include directly if they were changed
Filter to existing files -- only include test files that actually exist on disk.
If no test files are found after mapping, report "No test files found for changed files" and skip test execution (lint still runs unless --no-lint).

Constants

Name	Value	Description
`PARALLEL_DISPATCH_THRESHOLD`	50	Number of test files above which parallel subagent dispatch is used instead of sequential execution. Below this threshold, run tests in-process to avoid subagent overhead.

Integration test check: If the plan has an Agent Integration section describing cross-component wiring (tool A feeds component B), verify at least one test exercises the full chain -- not just each component in isolation.

Execution Strategy

Single Target (specific type or file)

When the user requests a specific test type or file, run it directly in the current agent. No need for parallel dispatch -- the overhead is not worth it for a single runner.

# For a type like "unit":
pytest tests/unit/ -v --tb=short

# For a specific file:
pytest tests/unit/test_bridge_logic.py -v --tb=short

# For --changed with resolved files:
pytest tests/unit/test_foo.py tests/tools/test_bar.py -v --tb=short

If lint is enabled, run lint sequentially after tests:

python -m ruff check .
black --check .

All Tests (no target specified)

When running all tests and the total number of test files exceeds PARALLEL_DISPATCH_THRESHOLD (50), dispatch parallel subagents via the Task tool for each test directory that exists. This maximizes throughput. Below the threshold, run all suites sequentially in-process to avoid subagent overhead.

Step 0: Decide execution mode

Before dispatching parallel agents, determine if direct execution is more efficient:

TEST_FILE_COUNT=$(find tests/ -name "test_*.py" 2>/dev/null | wc -l | tr -d ' ')

Run tests DIRECTLY (no agent dispatch) if ANY of these are true:

--direct flag is set
Test file count is below 50
Previous parallel dispatch in this session failed to produce output

When running directly, execute as a single command:

pytest tests/ -v --tb=short

Then run lint (if enabled) and skip to Result Aggregation.

Only dispatch parallel agents if:

No --direct flag
Test file count is 50 or more
No prior agent failures in this session

Step 1: Discover test directories

Check which of these directories exist and contain test files:

tests/unit/
tests/integration/
tests/e2e/
tests/performance/
tests/tools/

Also check for top-level test files in tests/ (files matching test_*.py directly in the tests directory).

Step 2: Dispatch parallel agents

For each existing test directory/group, create a Task:

Task({
  description: "Run [suite-name] tests",
  subagent_type: "test-engineer",
  model: "sonnet",
  prompt: "Run the following test command and report results:

    cd [CWD]
    pytest [test-path] -v --tb=short

    Report: number of tests passed, failed, skipped, and any failure details.
    Output the raw pytest output.",
  run_in_background: true
})

If lint is enabled, dispatch a lint agent in parallel too:

Task({
  description: "Run lint checks",
  subagent_type: "validator",
  model: "sonnet",
  prompt: "Run lint checks in [CWD]:

    cd [CWD]
    python -m ruff check .
    black --check .

    Report: pass/fail for each tool, and any issues found.",
  run_in_background: true
})

Step 3: Wait for agents with timeout fallback

Monitor all background tasks. Set a 2-minute timeout from dispatch.

If all agents complete within 2 minutes: Collect their outputs normally and proceed to Result Aggregation.

If any agent has NOT returned output after 2 minutes:

Abandon all pending agents (do not wait further)
Log which agents timed out: "Agent timeout: [suite-name] test-engineer did not return within 2 minutes"
Fall back to direct execution:
```
pytest tests/ -v --tb=short
```
Use the direct execution output for Result Aggregation
Run lint directly too if lint agents also timed out:
```
python -m ruff check .
black --check .
```

This fallback ensures test results are always collected, even when agent dispatch fails.

Result Aggregation

After all runners complete, present a summary table:

## Test Results

| Suite | Status | Passed | Failed | Skipped | Duration |
|-------|--------|--------|--------|---------|----------|
| unit | PASS | 42 | 0 | 2 | 3.1s |
| integration | FAIL | 8 | 1 | 0 | 12.4s |
| tools | PASS | 15 | 0 | 0 | 1.8s |
| lint (ruff) | PASS | - | - | - | 0.5s |
| lint (black) | PASS | - | - | - | 0.3s |

### Failures

**integration::test_api_auth.py::test_expired_token**
AssertionError: Expected 401, got 200
  File "tests/integration/test_api_auth.py", line 45

Final verdict:

If ALL suites pass: report ALL TESTS PASSED
If ANY suite fails: proceed to Failure Baseline Verification before reporting final verdict

Failure Baseline Verification

When test failures are detected, do NOT claim failures are "pre-existing" without evidence. Instead, dispatch the baseline-verifier subagent to classify each failure by running it against main.

When to Run

Run baseline verification when ALL of these are true:

One or more tests failed (pytest exit code 1)
The current branch is NOT main (baseline comparison only makes sense on feature branches)
There are fewer than 50 failing tests (above this threshold, baseline verification would be too slow)

Skip baseline verification if:

All tests passed (nothing to classify)
Running on main (no baseline to compare against)
More than 50 tests failed (likely a systemic issue, not individual regressions)

Step 0.5: Flaky Filter (Retry Before Baseline)

Before dispatching failures to the baseline verifier, retry ONLY the failing tests once more on the current branch to detect intermittent (flaky) failures:

python -m pytest <FAILING_TEST_IDS> -v --tb=short 2>&1

Classify retry results:

Tests that PASS on retry → classify as FLAKY. Report them in the results table but they do NOT count as failures or regressions. Do NOT send them to baseline verification.
Tests that still FAIL on retry → these are consistent failures. Send them to baseline verification as normal.
If ALL failures pass on retry → skip baseline verification entirely (all are flaky).

Why this matters: Flaky tests that fail on the branch but pass on main get misclassified as regressions. A single retry catches the most common intermittent failures (timing-dependent tests, LLM classifier non-determinism, resource contention) without the overhead of a full baseline worktree.

Add flaky tests to the results table:

### Flaky Tests (passed on retry)

| Test | Verdict | Notes |
|------|---------|-------|
| `tests/unit/test_timing.py::test_stall_detection` | FLAKY | Passed on retry (intermittent) |
| `tests/unit/test_classifier.py::test_bare_ref` | FLAKY | Passed on retry (LLM non-determinism) |

These tests are intermittently failing and should be investigated, but they do not block the pipeline.

After the flaky filter, update FAILING_TEST_IDS to contain only the tests that still failed on retry. Proceed to Step 1 with this reduced list.

Step 1: Collect Failing Test Node IDs

Parse the pytest output to extract all failing test node IDs. These look like:

tests/unit/test_foo.py::test_bar FAILED
tests/integration/test_api.py::TestAuth::test_expired_token FAILED

Collect them into a list: FAILING_TEST_IDS

Step 2: Check Regression Counter Context

If the Observer passed context from a previous /do-test OUTCOME (available in the prompt context), look for:

regression_fix_attempt: The current attempt number (integer)
persistent_regressions: The list of regression test IDs from the prior run

These are used for the circuit breaker in Step 5.

Step 3: Dispatch Baseline Verifier

Dispatch the baseline-verifier subagent to classify failures:

Task({
  description: "Baseline verification: classify test failures against main",
  subagent_type: "baseline-verifier",
  prompt: "
    Classify these failing test node IDs by running them against main:

    failing_test_ids:
    <one test ID per line>

    worktree_path: <current CWD for reference>

    Follow the instructions in your agent definition exactly.
    Return the structured JSON classification.
  "
})

Wait for the subagent to complete and parse the returned JSON.

Step 4: Integrate Classification into Results

Replace the generic "Failures" section with a verified classification table:

### Failure Classification (verified against main at <baseline_commit>)

| Test | Branch | Main | Verdict |
|------|--------|------|---------|
| tests/unit/test_foo.py::test_bar | FAILED | PASSED | **REGRESSION** |
| tests/unit/test_old.py::test_legacy | FAILED | FAILED | pre-existing |
| tests/e2e/test_flow.py::test_deleted | FAILED | N/A | inconclusive |

**Summary:**
- Regressions: 1 (blocking)
- Pre-existing: 1 (does not block merge)
- Inconclusive: 1 (manual review recommended)

Step 5: Regression Circuit Breaker

Track regression fix attempts across pipeline invocations to prevent infinite test-patch-test loops.

Reading the counter:

Check if regression_fix_attempt was provided in the context from a prior OUTCOME
If not present, this is attempt 0 (first test run)

Incrementing the counter:

Compare the current regressions list with persistent_regressions from the prior run
If the regression sets are identical (same test IDs): increment regression_fix_attempt by 1
If the regression set changed (different test IDs, even partially): reset regression_fix_attempt to 1

Circuit breaker trigger:

MAX_REGRESSION_FIX_ATTEMPTS = 3
If regression_fix_attempt >= MAX_REGRESSION_FIX_ATTEMPTS and regressions still exist:
- Emit status: blocked instead of status: fail
- Set next_skill: /do-plan
- Include failure_reason: "Regression fix not converging after N attempts. Escalating to planning."
- Include persistent_regressions in artifacts so the planner has context

Adjusted Final Verdict (with Baseline Verification)

After baseline verification completes, the final verdict changes:

ALL TESTS PASSED: No failures at all (baseline verification was skipped)
REGRESSIONS FOUND: Branch introduced new test failures. These MUST be fixed. Status: fail
PRE-EXISTING ONLY: All failures also fail on main. Branch did not make things worse. Status: partial
BLOCKED - ESCALATING: Regression fixes not converging after 3 attempts. Status: blocked
MIXED: Some regressions, some pre-existing. Regressions must be fixed. Status: fail

Important: Only regressions block the pipeline. Pre-existing failures are reported but do NOT cause status: fail.

Frontend Testing (`frontend` target)

When TEST_ARGS starts with frontend, route to the frontend-tester subagent. Do not run pytest.

Input format:

/do-test frontend https://myapp.com "Login form submits and shows dashboard"
/do-test frontend https://myapp.com "Checkout flow completes successfully" -- steps: click add-to-cart, click checkout, fill address, submit

Dispatch a single frontend-tester subagent:

Task({
  description: "Frontend test: <scenario>",
  subagent_type: "frontend-tester",
  prompt: "
URL: <url>
Scenario: <scenario>
Steps:
  <extracted steps if provided, otherwise infer from scenario>
Expected: <inferred from scenario>
  "
})

The frontend-tester agent owns all browser interaction via BYOB MCP (mcp__byob__browser_*) — the skill never drives the browser directly.

When running all tests (no target) and a tests/frontend/ directory exists with .json or .yaml scenario files, dispatch one frontend-tester subagent per scenario file in parallel alongside the pytest agents.

Scenario file format (for tests/frontend/):

{
  "url": "https://myapp.com/login",
  "scenario": "Login with valid credentials shows dashboard",
  "steps": [
    "Fill email field with test@example.com",
    "Fill password field with password123",
    "Click Login button"
  ],
  "expected": "Dashboard page loads with user name visible"
}

Result aggregation: Include frontend results in the summary table alongside pytest suites:

| Suite           | Status | Passed | Failed | Screenshot |
|-----------------|--------|--------|--------|------------|
| frontend/login  | PASS   | 1      | 0      | /tmp/...   |
| frontend/checkout | FAIL | 0      | 1      | /tmp/...   |

Happy Path Testing (`happy-paths` target)

When TEST_ARGS starts with happy-paths, run the deterministic test runner directly. No subagent needed.

Execution:

python tools/happy_path_runner.py tests/happy-paths/scripts/

Result format:

The runner outputs a markdown summary table to stdout with pass/fail/error counts per script, followed by a JSON summary in an HTML comment block. Include results in the summary table alongside pytest and frontend suites.

When running all tests:

If tests/happy-paths/scripts/ contains .sh files, include happy-paths execution alongside pytest and frontend targets. Run via bash, not subagent.

CWD-Relative Execution

All commands run relative to the current working directory. Do not attempt to detect or navigate to worktrees. When /do-test is invoked:

From /do-build: CWD is already the worktree -- commands run there
Directly by user: CWD is the main repo -- commands run there

Simply use the CWD as-is. Run pwd once at the start to confirm and log it.

Error Handling

If pytest is not installed, report the error clearly
If a test directory does not exist, skip it silently (do not fail)
If git commands fail for --changed, fall back to running all tests
Parse pytest exit codes: 0 = all passed, 1 = some failed, 2 = error, 5 = no tests collected

Quality Checks (Post-Test)

After tests pass, run these additional quality scans and include results in the report:

Exception Swallow Scan

Scan for except Exception: pass patterns that lack test coverage:

grep -rn "except.*Exception.*:" --include="*.py" agent/ bridge/ | grep -v "logger\|log\.\|warning\|error\|raise\|# .*tested" | head -20

Report any bare exception handlers found. Each should either:

Have a corresponding test asserting observable behavior (logger.warning, state change)
Be documented with a comment explaining why bare pass is acceptable (e.g., cleanup during shutdown)

Empty Input Check

If the test suite covers agent output processing code, verify that empty/None/whitespace inputs are tested:

grep -rn "def test.*empty\|def test.*none\|def test.*whitespace" tests/ --include="*.py" | wc -l

Flag if the changed files include output processing code but the test suite has zero empty input tests.

Closure Coverage Flag

If any changed files contain inner functions or closures (functions defined inside other functions), flag whether those closures have dedicated test coverage:

grep -rn "def .*(" --include="*.py" agent/ bridge/ | grep "^.*:.*def .*:$" | head -10

Closures that replicate logic already tested elsewhere (e.g., inline routing logic that should call a shared function) are a test smell. Note them in the report.

Stale xfail Hygiene Scan

After tests pass, scan for xfail-marked tests that are now passing (xpass). When a bug fix lands, the corresponding xfail marker should be removed and converted to a hard assertion. Stale xfails indicate the fix landed but the test wasn't updated.

Two forms of xfail exist and require different detection:

Decorator form (@pytest.mark.xfail): Pytest reports these as XPASS in test output when the test unexpectedly passes. Check the pytest output for XPASS entries.
Runtime form (pytest.xfail("reason") called inside the test body): These are invisible to XPASS detection because the call short-circuits the test before it reaches the assertion. A test with a runtime pytest.xfail() will show as xfail even when the underlying bug is fixed — it never gets a chance to pass. This is the more dangerous form because it silently hides regressions.

# Find ALL xfail markers (both decorator and runtime forms)
grep -rn 'pytest.mark.xfail\|pytest.xfail(' tests/ --include="*.py" | head -20

For decorator xfails: Check if pytest reports XPASS in the test output.

For runtime xfails: These ALWAYS require manual review. Flag every pytest.xfail( call found in test bodies:

If the call is guarded by a condition (e.g., if broken: pytest.xfail(...)), check whether the condition is still true
If the call is unconditional, flag it as "runtime xfail — cannot detect if bug is fixed, must be reviewed"

For each stale xfail detected (either form):

Flag it prominently in the quality report: "STALE XFAIL: tests/foo/test_bar.py:LINE — [decorator|runtime] form"
Include the file and line number for easy removal
Suggest: "This test should have its xfail marker removed and converted to a hard assertion"

Important: Runtime pytest.xfail() is a stronger smell than decorator @pytest.mark.xfail. If --changed mode is active and the changed files include a bug fix, runtime xfails in related test files should be flagged as blockers, not just warnings.

Skip if: No xfail markers found in the test suite.

Exception Swallow Gate

Before emitting the OUTCOME, run a mandatory Exception Swallow Gate on the diff. This gate blocks the TEST stage if new unguarded except Exception blocks are introduced.

When to run: Always — after tests pass, before OUTCOME emission. Scan the diff (not the full codebase) for new except Exception blocks only.

Gate logic:

# Get the diff of new/changed Python lines that add except Exception blocks
DIFF_BASE=$(git rev-parse --abbrev-ref HEAD | grep -q "^main$" && echo "HEAD~1" || echo "main")
DIFF_CONTENT=$(git diff "$DIFF_BASE"...HEAD -- '*.py' | grep '^+' | grep -v '^+++')

# Find line numbers (within the diff output) of new except Exception clauses
EXCEPT_LINE_NUMS=$(echo "$DIFF_CONTENT" | grep -n 'except.*Exception' | cut -d: -f1)

if [ -z "$EXCEPT_LINE_NUMS" ]; then
    echo "EXCEPTION_SWALLOW_GATE: PASS (no new except Exception blocks)"
else
    # For each new except Exception line, check the clause line AND the next 3 handler-body lines
    FAILURES=""
    TOTAL_LINES=$(echo "$DIFF_CONTENT" | wc -l)
    while IFS= read -r lineno; do
        # Extract the except clause line itself
        clause_line=$(echo "$DIFF_CONTENT" | sed -n "${lineno}p")
        # Extract up to 3 handler-body lines following the except clause
        end_line=$((lineno + 3))
        [ $end_line -gt $TOTAL_LINES ] && end_line=$TOTAL_LINES
        body_lines=$(echo "$DIFF_CONTENT" | sed -n "$((lineno+1)),${end_line}p")
        # Pass if the except clause has a valid swallow-ok comment (reason must be 10+ non-whitespace chars)
        if echo "$clause_line" | grep -qE "# swallow-ok: .{10,}"; then
            continue
        fi
        # Pass if the handler body contains logger, log., warning, error, or raise
        if echo "$body_lines" | grep -qE "logger|log\.|warning|error|raise"; then
            continue
        fi
        FAILURES="$FAILURES
$clause_line"
    done <<< "$EXCEPT_LINE_NUMS"

    if [ -z "$FAILURES" ]; then
        echo "EXCEPTION_SWALLOW_GATE: PASS"
    else
        echo "EXCEPTION_SWALLOW_GATE: FAIL — new unguarded except Exception block(s):"
        echo -e "$FAILURES"
        echo ""
        echo "Each new except Exception block must either:"
        echo "  1. Contain logger, log., warning, error, or raise in the handler body (next 3 lines)"
        echo "  2. Have an inline comment on the except line: # swallow-ok: {reason with 10+ chars}"
        echo "     Example: # swallow-ok: safe during shutdown, task already cancelled"
        echo "     Invalid: # swallow-ok: x   (reason too short)"
        echo "     Invalid: # swallow-ok:      (empty reason)"
        echo "GATE_FAILED"
    fi
fi

If gate fails: Emit  and stop. Do NOT emit a success OUTCOME.

Carve-out convention: To exempt a legitimate exception swallow, add an inline comment on the same line as the except clause:

except Exception:  # swallow-ok: safe during shutdown, task already cancelled
    pass

The reason must be at least 10 non-whitespace characters. Bare # swallow-ok: or whitespace-only reasons do NOT pass.

OUTCOME Contract Emission

As the very last line of your final response, emit an OUTCOME contract so the pipeline can classify the test result programmatically:

Success (all tests passed): 
Fail (test failures found): 
Partial (tests passed but with flaky tests or warnings):

This structured output is parsed by classify_outcome() in agent/pipeline_state.py (Tier 0) before any text pattern matching.

Notes

No temporary files in the repo -- use /tmp for any scratch work
Do not modify any source or test files -- this skill is read-only (it runs tests, it does not fix them)
Keep pytest output visible -- developers need to see the raw output for debugging
The -v --tb=short flags provide verbose test names with concise tracebacks

name	do-test
description	Use when running the test suite. Parses arguments, dispatches test runners (potentially in parallel), and aggregates results. Triggered by 'run tests', 'test this', or any request about testing.
argument-hint	[test-path-or-filter]

Do Test

You are the test orchestrator. You parse arguments, dispatch test runners (potentially in parallel), and aggregate results into a summary.

Variables

TEST_ARGS: $ARGUMENTS

Step 0: Discover Additional Test Skills

Before running tests, scan for any additional test-related skill docs in the project:

ls .claude/skills/*test*/*.md .claude/skills-global/*test*/*.md 2>/dev/null

Argument Parsing

Parse TEST_ARGS to determine what to run:

Input	Behavior
(empty)	Run all test directories + lint checks
`unit`	Run `tests/unit/` + lint
`integration`	Run `tests/integration/` + lint
`e2e`	Run `tests/e2e/` + lint
`tools`	Run `tests/tools/` + lint
`performance`	Run `tests/performance/` + lint
`tests/unit/test_bridge_logic.py`	Run that specific file + lint
`--changed`	Detect changed files, map to test files, run those + lint
`--no-lint`	Skip ruff/black checks (combinable with any above)
`unit --no-lint`	Run `tests/unit/` without lint
`--changed --no-lint`	Changed-file tests without lint
`--direct`	Force direct execution, skip parallel agent dispatch
`unit --direct`	Run `tests/unit/` directly (combinable with any target)
`frontend <url> "<scenario>"`	Run a browser-based UI test via `frontend-tester` subagent
`happy-paths`	Run `python tools/happy_path_runner.py tests/happy-paths/scripts/` directly via bash. No subagent dispatch.

Parsing rules:

Extract flags: --changed, --no-lint
If target is frontend, route to Frontend Testing (see below). If target is happy-paths, route to Happy Path Testing (see below). Neither runs pytest.
Whatever remains is the target: a test type name (unit, integration, e2e, tools, performance) or a file/directory path
If no target and no --changed, target is "all"
Extract --direct flag alongside --changed and --no-lint

Changed-File Detection (`--changed`)

When --changed is specified:

Determine the diff base:

CURRENT_BRANCH=$(git rev-parse --abbrev-ref HEAD)
if [ "$CURRENT_BRANCH" = "main" ]; then
  DIFF_BASE="HEAD~1"
else
  DIFF_BASE="main"
fi

Get changed files:

git diff --name-only "$DIFF_BASE"...HEAD -- '*.py'

Map changed files to test files using these conventions:
- bridge/*.py -> tests/unit/test_bridge*.py
- tools/*.py -> tests/tools/test_*.py
- agent/*.py -> tests/unit/test_agent*.py
- monitoring/*.py -> tests/unit/test_monitoring*.py
- General rule: source file foo/bar.py -> tests/*/test_bar.py
- Test files themselves: include directly if they were changed
Filter to existing files -- only include test files that actually exist on disk.
If no test files are found after mapping, report "No test files found for changed files" and skip test execution (lint still runs unless --no-lint).

Constants

Name	Value	Description
`PARALLEL_DISPATCH_THRESHOLD`	50	Number of test files above which parallel subagent dispatch is used instead of sequential execution. Below this threshold, run tests in-process to avoid subagent overhead.

Execution Strategy

Single Target (specific type or file)

When the user requests a specific test type or file, run it directly in the current agent. No need for parallel dispatch -- the overhead is not worth it for a single runner.

# For a type like "unit":
pytest tests/unit/ -v --tb=short

# For a specific file:
pytest tests/unit/test_bridge_logic.py -v --tb=short

# For --changed with resolved files:
pytest tests/unit/test_foo.py tests/tools/test_bar.py -v --tb=short

If lint is enabled, run lint sequentially after tests:

python -m ruff check .
black --check .

All Tests (no target specified)

Step 0: Decide execution mode

Before dispatching parallel agents, determine if direct execution is more efficient:

TEST_FILE_COUNT=$(find tests/ -name "test_*.py" 2>/dev/null | wc -l | tr -d ' ')

Run tests DIRECTLY (no agent dispatch) if ANY of these are true:

--direct flag is set
Test file count is below 50
Previous parallel dispatch in this session failed to produce output

When running directly, execute as a single command:

pytest tests/ -v --tb=short

Then run lint (if enabled) and skip to Result Aggregation.

Only dispatch parallel agents if:

No --direct flag
Test file count is 50 or more
No prior agent failures in this session

Step 1: Discover test directories

Check which of these directories exist and contain test files:

tests/unit/
tests/integration/
tests/e2e/
tests/performance/
tests/tools/

Also check for top-level test files in tests/ (files matching test_*.py directly in the tests directory).

Step 2: Dispatch parallel agents

For each existing test directory/group, create a Task:

Task({
  description: "Run [suite-name] tests",
  subagent_type: "test-engineer",
  model: "sonnet",
  prompt: "Run the following test command and report results:

    cd [CWD]
    pytest [test-path] -v --tb=short

    Report: number of tests passed, failed, skipped, and any failure details.
    Output the raw pytest output.",
  run_in_background: true
})

If lint is enabled, dispatch a lint agent in parallel too:

Task({
  description: "Run lint checks",
  subagent_type: "validator",
  model: "sonnet",
  prompt: "Run lint checks in [CWD]:

    cd [CWD]
    python -m ruff check .
    black --check .

    Report: pass/fail for each tool, and any issues found.",
  run_in_background: true
})

Step 3: Wait for agents with timeout fallback

Monitor all background tasks. Set a 2-minute timeout from dispatch.

If all agents complete within 2 minutes: Collect their outputs normally and proceed to Result Aggregation.

If any agent has NOT returned output after 2 minutes:

Abandon all pending agents (do not wait further)
Log which agents timed out: "Agent timeout: [suite-name] test-engineer did not return within 2 minutes"
Fall back to direct execution:
```
pytest tests/ -v --tb=short
```
Use the direct execution output for Result Aggregation
Run lint directly too if lint agents also timed out:
```
python -m ruff check .
black --check .
```

This fallback ensures test results are always collected, even when agent dispatch fails.

Result Aggregation

After all runners complete, present a summary table:

## Test Results

| Suite | Status | Passed | Failed | Skipped | Duration |
|-------|--------|--------|--------|---------|----------|
| unit | PASS | 42 | 0 | 2 | 3.1s |
| integration | FAIL | 8 | 1 | 0 | 12.4s |
| tools | PASS | 15 | 0 | 0 | 1.8s |
| lint (ruff) | PASS | - | - | - | 0.5s |
| lint (black) | PASS | - | - | - | 0.3s |

### Failures

**integration::test_api_auth.py::test_expired_token**
AssertionError: Expected 401, got 200
  File "tests/integration/test_api_auth.py", line 45

Final verdict:

If ALL suites pass: report ALL TESTS PASSED
If ANY suite fails: proceed to Failure Baseline Verification before reporting final verdict

Failure Baseline Verification

When test failures are detected, do NOT claim failures are "pre-existing" without evidence. Instead, dispatch the baseline-verifier subagent to classify each failure by running it against main.

When to Run

Run baseline verification when ALL of these are true:

One or more tests failed (pytest exit code 1)
The current branch is NOT main (baseline comparison only makes sense on feature branches)
There are fewer than 50 failing tests (above this threshold, baseline verification would be too slow)

Skip baseline verification if:

All tests passed (nothing to classify)
Running on main (no baseline to compare against)
More than 50 tests failed (likely a systemic issue, not individual regressions)

Step 0.5: Flaky Filter (Retry Before Baseline)

Before dispatching failures to the baseline verifier, retry ONLY the failing tests once more on the current branch to detect intermittent (flaky) failures:

python -m pytest <FAILING_TEST_IDS> -v --tb=short 2>&1

Classify retry results:

Tests that PASS on retry → classify as FLAKY. Report them in the results table but they do NOT count as failures or regressions. Do NOT send them to baseline verification.
Tests that still FAIL on retry → these are consistent failures. Send them to baseline verification as normal.
If ALL failures pass on retry → skip baseline verification entirely (all are flaky).

Add flaky tests to the results table:

### Flaky Tests (passed on retry)

| Test | Verdict | Notes |
|------|---------|-------|
| `tests/unit/test_timing.py::test_stall_detection` | FLAKY | Passed on retry (intermittent) |
| `tests/unit/test_classifier.py::test_bare_ref` | FLAKY | Passed on retry (LLM non-determinism) |

These tests are intermittently failing and should be investigated, but they do not block the pipeline.

After the flaky filter, update FAILING_TEST_IDS to contain only the tests that still failed on retry. Proceed to Step 1 with this reduced list.

Step 1: Collect Failing Test Node IDs

Parse the pytest output to extract all failing test node IDs. These look like:

tests/unit/test_foo.py::test_bar FAILED
tests/integration/test_api.py::TestAuth::test_expired_token FAILED

Collect them into a list: FAILING_TEST_IDS

Step 2: Check Regression Counter Context

If the Observer passed context from a previous /do-test OUTCOME (available in the prompt context), look for:

regression_fix_attempt: The current attempt number (integer)
persistent_regressions: The list of regression test IDs from the prior run

These are used for the circuit breaker in Step 5.

Step 3: Dispatch Baseline Verifier

Dispatch the baseline-verifier subagent to classify failures:

Task({
  description: "Baseline verification: classify test failures against main",
  subagent_type: "baseline-verifier",
  prompt: "
    Classify these failing test node IDs by running them against main:

    failing_test_ids:
    <one test ID per line>

    worktree_path: <current CWD for reference>

    Follow the instructions in your agent definition exactly.
    Return the structured JSON classification.
  "
})

Wait for the subagent to complete and parse the returned JSON.

Step 4: Integrate Classification into Results

Replace the generic "Failures" section with a verified classification table:

### Failure Classification (verified against main at <baseline_commit>)

| Test | Branch | Main | Verdict |
|------|--------|------|---------|
| tests/unit/test_foo.py::test_bar | FAILED | PASSED | **REGRESSION** |
| tests/unit/test_old.py::test_legacy | FAILED | FAILED | pre-existing |
| tests/e2e/test_flow.py::test_deleted | FAILED | N/A | inconclusive |

**Summary:**
- Regressions: 1 (blocking)
- Pre-existing: 1 (does not block merge)
- Inconclusive: 1 (manual review recommended)

Step 5: Regression Circuit Breaker

Track regression fix attempts across pipeline invocations to prevent infinite test-patch-test loops.

Reading the counter:

Check if regression_fix_attempt was provided in the context from a prior OUTCOME
If not present, this is attempt 0 (first test run)

Incrementing the counter:

Compare the current regressions list with persistent_regressions from the prior run
If the regression sets are identical (same test IDs): increment regression_fix_attempt by 1
If the regression set changed (different test IDs, even partially): reset regression_fix_attempt to 1

Circuit breaker trigger:

MAX_REGRESSION_FIX_ATTEMPTS = 3
If regression_fix_attempt >= MAX_REGRESSION_FIX_ATTEMPTS and regressions still exist:
- Emit status: blocked instead of status: fail
- Set next_skill: /do-plan
- Include failure_reason: "Regression fix not converging after N attempts. Escalating to planning."
- Include persistent_regressions in artifacts so the planner has context

Adjusted Final Verdict (with Baseline Verification)

After baseline verification completes, the final verdict changes:

ALL TESTS PASSED: No failures at all (baseline verification was skipped)
REGRESSIONS FOUND: Branch introduced new test failures. These MUST be fixed. Status: fail
PRE-EXISTING ONLY: All failures also fail on main. Branch did not make things worse. Status: partial
BLOCKED - ESCALATING: Regression fixes not converging after 3 attempts. Status: blocked
MIXED: Some regressions, some pre-existing. Regressions must be fixed. Status: fail

Important: Only regressions block the pipeline. Pre-existing failures are reported but do NOT cause status: fail.

Frontend Testing (`frontend` target)

When TEST_ARGS starts with frontend, route to the frontend-tester subagent. Do not run pytest.

Input format:

/do-test frontend https://myapp.com "Login form submits and shows dashboard"
/do-test frontend https://myapp.com "Checkout flow completes successfully" -- steps: click add-to-cart, click checkout, fill address, submit

Dispatch a single frontend-tester subagent:

Task({
  description: "Frontend test: <scenario>",
  subagent_type: "frontend-tester",
  prompt: "
URL: <url>
Scenario: <scenario>
Steps:
  <extracted steps if provided, otherwise infer from scenario>
Expected: <inferred from scenario>
  "
})

The frontend-tester agent owns all browser interaction via BYOB MCP (mcp__byob__browser_*) — the skill never drives the browser directly.

Scenario file format (for tests/frontend/):

{
  "url": "https://myapp.com/login",
  "scenario": "Login with valid credentials shows dashboard",
  "steps": [
    "Fill email field with test@example.com",
    "Fill password field with password123",
    "Click Login button"
  ],
  "expected": "Dashboard page loads with user name visible"
}

Result aggregation: Include frontend results in the summary table alongside pytest suites:

| Suite           | Status | Passed | Failed | Screenshot |
|-----------------|--------|--------|--------|------------|
| frontend/login  | PASS   | 1      | 0      | /tmp/...   |
| frontend/checkout | FAIL | 0      | 1      | /tmp/...   |

Happy Path Testing (`happy-paths` target)

When TEST_ARGS starts with happy-paths, run the deterministic test runner directly. No subagent needed.

Execution:

python tools/happy_path_runner.py tests/happy-paths/scripts/

Result format:

When running all tests:

If tests/happy-paths/scripts/ contains .sh files, include happy-paths execution alongside pytest and frontend targets. Run via bash, not subagent.

CWD-Relative Execution

All commands run relative to the current working directory. Do not attempt to detect or navigate to worktrees. When /do-test is invoked:

From /do-build: CWD is already the worktree -- commands run there
Directly by user: CWD is the main repo -- commands run there

Simply use the CWD as-is. Run pwd once at the start to confirm and log it.

Error Handling

If pytest is not installed, report the error clearly
If a test directory does not exist, skip it silently (do not fail)
If git commands fail for --changed, fall back to running all tests
Parse pytest exit codes: 0 = all passed, 1 = some failed, 2 = error, 5 = no tests collected

Quality Checks (Post-Test)

After tests pass, run these additional quality scans and include results in the report:

Exception Swallow Scan

Scan for except Exception: pass patterns that lack test coverage:

grep -rn "except.*Exception.*:" --include="*.py" agent/ bridge/ | grep -v "logger\|log\.\|warning\|error\|raise\|# .*tested" | head -20

Report any bare exception handlers found. Each should either:

Have a corresponding test asserting observable behavior (logger.warning, state change)
Be documented with a comment explaining why bare pass is acceptable (e.g., cleanup during shutdown)

Empty Input Check

If the test suite covers agent output processing code, verify that empty/None/whitespace inputs are tested:

grep -rn "def test.*empty\|def test.*none\|def test.*whitespace" tests/ --include="*.py" | wc -l

Flag if the changed files include output processing code but the test suite has zero empty input tests.

Closure Coverage Flag

If any changed files contain inner functions or closures (functions defined inside other functions), flag whether those closures have dedicated test coverage:

grep -rn "def .*(" --include="*.py" agent/ bridge/ | grep "^.*:.*def .*:$" | head -10

Closures that replicate logic already tested elsewhere (e.g., inline routing logic that should call a shared function) are a test smell. Note them in the report.

Stale xfail Hygiene Scan

Two forms of xfail exist and require different detection:

Decorator form (@pytest.mark.xfail): Pytest reports these as XPASS in test output when the test unexpectedly passes. Check the pytest output for XPASS entries.
Runtime form (pytest.xfail("reason") called inside the test body): These are invisible to XPASS detection because the call short-circuits the test before it reaches the assertion. A test with a runtime pytest.xfail() will show as xfail even when the underlying bug is fixed — it never gets a chance to pass. This is the more dangerous form because it silently hides regressions.

# Find ALL xfail markers (both decorator and runtime forms)
grep -rn 'pytest.mark.xfail\|pytest.xfail(' tests/ --include="*.py" | head -20

For decorator xfails: Check if pytest reports XPASS in the test output.

For runtime xfails: These ALWAYS require manual review. Flag every pytest.xfail( call found in test bodies:

If the call is guarded by a condition (e.g., if broken: pytest.xfail(...)), check whether the condition is still true
If the call is unconditional, flag it as "runtime xfail — cannot detect if bug is fixed, must be reviewed"

For each stale xfail detected (either form):

Flag it prominently in the quality report: "STALE XFAIL: tests/foo/test_bar.py:LINE — [decorator|runtime] form"
Include the file and line number for easy removal
Suggest: "This test should have its xfail marker removed and converted to a hard assertion"

Skip if: No xfail markers found in the test suite.

Exception Swallow Gate

Before emitting the OUTCOME, run a mandatory Exception Swallow Gate on the diff. This gate blocks the TEST stage if new unguarded except Exception blocks are introduced.

When to run: Always — after tests pass, before OUTCOME emission. Scan the diff (not the full codebase) for new except Exception blocks only.

Gate logic:

# Get the diff of new/changed Python lines that add except Exception blocks
DIFF_BASE=$(git rev-parse --abbrev-ref HEAD | grep -q "^main$" && echo "HEAD~1" || echo "main")
DIFF_CONTENT=$(git diff "$DIFF_BASE"...HEAD -- '*.py' | grep '^+' | grep -v '^+++')

# Find line numbers (within the diff output) of new except Exception clauses
EXCEPT_LINE_NUMS=$(echo "$DIFF_CONTENT" | grep -n 'except.*Exception' | cut -d: -f1)

if [ -z "$EXCEPT_LINE_NUMS" ]; then
    echo "EXCEPTION_SWALLOW_GATE: PASS (no new except Exception blocks)"
else
    # For each new except Exception line, check the clause line AND the next 3 handler-body lines
    FAILURES=""
    TOTAL_LINES=$(echo "$DIFF_CONTENT" | wc -l)
    while IFS= read -r lineno; do
        # Extract the except clause line itself
        clause_line=$(echo "$DIFF_CONTENT" | sed -n "${lineno}p")
        # Extract up to 3 handler-body lines following the except clause
        end_line=$((lineno + 3))
        [ $end_line -gt $TOTAL_LINES ] && end_line=$TOTAL_LINES
        body_lines=$(echo "$DIFF_CONTENT" | sed -n "$((lineno+1)),${end_line}p")
        # Pass if the except clause has a valid swallow-ok comment (reason must be 10+ non-whitespace chars)
        if echo "$clause_line" | grep -qE "# swallow-ok: .{10,}"; then
            continue
        fi
        # Pass if the handler body contains logger, log., warning, error, or raise
        if echo "$body_lines" | grep -qE "logger|log\.|warning|error|raise"; then
            continue
        fi
        FAILURES="$FAILURES
$clause_line"
    done <<< "$EXCEPT_LINE_NUMS"

    if [ -z "$FAILURES" ]; then
        echo "EXCEPTION_SWALLOW_GATE: PASS"
    else
        echo "EXCEPTION_SWALLOW_GATE: FAIL — new unguarded except Exception block(s):"
        echo -e "$FAILURES"
        echo ""
        echo "Each new except Exception block must either:"
        echo "  1. Contain logger, log., warning, error, or raise in the handler body (next 3 lines)"
        echo "  2. Have an inline comment on the except line: # swallow-ok: {reason with 10+ chars}"
        echo "     Example: # swallow-ok: safe during shutdown, task already cancelled"
        echo "     Invalid: # swallow-ok: x   (reason too short)"
        echo "     Invalid: # swallow-ok:      (empty reason)"
        echo "GATE_FAILED"
    fi
fi

If gate fails: Emit  and stop. Do NOT emit a success OUTCOME.

Carve-out convention: To exempt a legitimate exception swallow, add an inline comment on the same line as the except clause:

except Exception:  # swallow-ok: safe during shutdown, task already cancelled
    pass

The reason must be at least 10 non-whitespace characters. Bare # swallow-ok: or whitespace-only reasons do NOT pass.

OUTCOME Contract Emission

As the very last line of your final response, emit an OUTCOME contract so the pipeline can classify the test result programmatically:

Success (all tests passed): 
Fail (test failures found): 
Partial (tests passed but with flaky tests or warnings):

This structured output is parsed by classify_outcome() in agent/pipeline_state.py (Tier 0) before any text pattern matching.

Notes

No temporary files in the repo -- use /tmp for any scratch work
Do not modify any source or test files -- this skill is read-only (it runs tests, it does not fix them)
Keep pytest output visible -- developers need to see the raw output for debugging
The -v --tb=short flags provide verbose test names with concise tracebacks

do-test

Do Test

Variables

Step 0: Discover Additional Test Skills

Argument Parsing

Changed-File Detection (--changed)

Constants

Execution Strategy

Single Target (specific type or file)

All Tests (no target specified)

Result Aggregation

Failure Baseline Verification

When to Run

Step 0.5: Flaky Filter (Retry Before Baseline)

Step 1: Collect Failing Test Node IDs

Step 2: Check Regression Counter Context

Step 3: Dispatch Baseline Verifier

Step 4: Integrate Classification into Results

Step 5: Regression Circuit Breaker

Adjusted Final Verdict (with Baseline Verification)

Frontend Testing (frontend target)

Happy Path Testing (happy-paths target)

Execution:

Result format:

When running all tests:

CWD-Relative Execution

Error Handling

Quality Checks (Post-Test)

Exception Swallow Scan

Empty Input Check

Closure Coverage Flag

Stale xfail Hygiene Scan

Exception Swallow Gate

OUTCOME Contract Emission

Notes

Do Test

Variables

Step 0: Discover Additional Test Skills

Argument Parsing

Changed-File Detection (--changed)

Constants

Execution Strategy

Single Target (specific type or file)

All Tests (no target specified)

Result Aggregation

Failure Baseline Verification

When to Run

Step 0.5: Flaky Filter (Retry Before Baseline)

Step 1: Collect Failing Test Node IDs

Step 2: Check Regression Counter Context

Step 3: Dispatch Baseline Verifier

Step 4: Integrate Classification into Results

Step 5: Regression Circuit Breaker

Adjusted Final Verdict (with Baseline Verification)

Frontend Testing (frontend target)

Happy Path Testing (happy-paths target)

Execution:

Result format:

When running all tests:

CWD-Relative Execution

Error Handling

Quality Checks (Post-Test)

Exception Swallow Scan

Empty Input Check

Closure Coverage Flag

Stale xfail Hygiene Scan

Exception Swallow Gate

OUTCOME Contract Emission

Notes

Changed-File Detection (`--changed`)

Frontend Testing (`frontend` target)

Happy Path Testing (`happy-paths` target)

Changed-File Detection (`--changed`)

Frontend Testing (`frontend` target)

Happy Path Testing (`happy-paths` target)