| name | flaky-test-diagnoser |
| description | Systematically diagnoses why a test is flaky by running multi-run experiments, isolation tests, ordering permutations, and timing analysis. Use when the user says a test is flaky, intermittent, non-deterministic, randomly failing, passes sometimes, or asks to debug test flakiness. |
| category | debug |
| capabilities | ["hypothesis-testing","root-cause-analysis","backoff-retry"] |
| best_for | ["Diagnosing intermittent or flaky test failures","Running multi-run experiments to isolate non-determinism","Distinguishing ordering, timing, shared state, and resource leak causes"] |
| not_for | ["Consistently failing tests (use deep-debug)","Reviewing code for defects (use deep-qa)","Fixing the flaky test after diagnosis (use autopilot)"] |
| input_types | ["code-path"] |
| output_types | ["diagnosis"] |
| output_signals | ["termination_label","hypothesis_count","fail_rate","root_cause_category"] |
| complexity | complex |
| model_tier | sonnet |
| cost_profile | high |
| execution | {"sagaflow":"required","temporal_skill":"flaky-test-diagnoser-temporal","estimated_duration":"20-60min"} |
| related_skills | [{"name":"deep-debug","relation":"alternative","note":"For consistently failing bugs rather than intermittent ones"},{"name":"autopilot","relation":"follow-up","note":"Fix the flaky test after diagnosis"}] |
| maturity | stable |
Flaky Test Diagnoser
Runs competing-hypothesis experiments to isolate the true root cause of a flaky test. Produces a diagnosis with reproducible evidence ā or an honest inconclusive label. Never promotes a guess to "root cause."
This is a tracing skill, not a fixer. The goal is to explain why the test is intermittent with falsifiable evidence, not to jump into patching code.
Execution contracts
Subagent watchdog: any time this skill spawns a test-running subagent via run_in_background=true (multi-run protocols, ordering bisection, distinguishing experiments), it MUST be armed with a staleness monitor per _shared/subagent-watchdog.md. Use Flavor A (Monitor tail per spawn) with thresholds STALE=15 min, HUNG=45 min ā test suites can legitimately take a long time, especially with Nā„10 reruns, but a 45-min silent window is pathological. TaskOutput status reports PID liveness only; output-file mtime is the ground truth for progress. Contract inheritance: timed_out_heartbeat joins this skill's termination vocabulary as a peer of blocked_by_environment (watchdog kill is a different failure class ā the test-runner subagent hung, not the test environment). A watchdog-killed experiment contributes zero runs to the Nā„10 count; the skill must retry or escalate.
Workflow
- Bootstrap run state ā create
flaky-diag-{run_id}/ in CWD with state.json, hypotheses/, runs/, experiments/. See STATE.md.
- Detect test runner ā identify the project's test framework and runner command from config files. See RUNNERS.md.
- Confirm flakiness ā run the target test N>=10 times in isolation, record pass/fail per run, compute fail rate. Record each run under
runs/. See EXPERIMENTS.md.
- Generate competing hypotheses ā produce >=3 deliberately different hypotheses from the 6 categories (ORDERING, TIMING, SHARED_STATE, EXTERNAL_DEPENDENCY, RESOURCE_LEAK, NON_DETERMINISM). Each hypothesis gets its own file in
hypotheses/ following the schema in FORMAT.md. See ANALYSIS.md for category signatures.
- Gather evidence for AND against each hypothesis ā run the structured experiments (isolation, ordering bisection, timing, environment, code reads) and record findings as
evidence_for / evidence_against entries per hypothesis. See EXPERIMENTS.md.
- Design a distinguishing experiment per surviving hypothesis ā an experiment whose outcome uniquely discriminates that hypothesis from its rivals. An experiment that "merely supports" all hypotheses does NOT count. See FORMAT.md.
- Run distinguishing experiments ā record outcomes under
experiments/. Down-rank hypotheses whose distinctive prediction failed. See GOLDEN-RULES.md for the falsifiability gate.
- Rebuttal round ā let the second-ranked hypothesis present its best rebuttal against the leader. The leader must answer with evidence, not assertion.
- Attempt red-green-red demonstration ā if a leader has been selected, describe the minimal fix and have the user apply it (or apply it in an experimental branch). Rerun the multi-run protocol (N>=10). Only label
root_cause_isolated_with_repro if the red-green-red cycle succeeds.
- Emit structured output and diagnosis report ā each hypothesis gets a
STRUCTURED_OUTPUT_START / STRUCTURED_OUTPUT_END block. Final report follows REPORT.md.
Honest termination labels
Pick exactly one. Never invent synonyms.
| Label | Meaning | Requires |
|---|
root_cause_isolated_with_repro | Single hypothesis confirmed; fix demonstrated with red-green-red across N>=10 runs | Distinguishing experiment passed; all rivals empirically falsified; fix verified |
narrowed_to_N_hypotheses | Could not discriminate between N surviving hypotheses; additional probes recommended | Each surviving hypothesis has evidence_for; no distinguishing experiment available within budget |
inconclusive_after_N_runs | Could not reproduce flakiness or no hypothesis gathered sufficient evidence | N>=20 runs executed; no consistent signal |
blocked_by_environment | Cannot execute experiments (runner broken, missing deps, no shell access, tests require secrets) | Specific blocker identified with the exact error output |
Never use: "diagnosed", "likely root cause", "probably", "should be", "root cause found" as a terminal label without a red-green-red demo. See GOLDEN-RULES.md.
Self-review checklist
Before delivering the report, verify ALL:
Golden rules
Hard rules. See GOLDEN-RULES.md for the full text, anti-rationalization counter-table, and the falsifiability gate.
- Never promote a hypothesis to root cause without a distinguishing experiment that empirically rejects all rivals.
- "Diagnosed" requires red-green-red. Test fails ā fix applied ā test passes deterministically across N>=10 runs. Without it, the label is
narrowed_to_N_hypotheses.
- Always run isolation before ordering. If the test fails in isolation, ordering analysis is irrelevant.
- Bisect, never brute-force when searching for an interfering test.
- Capture exact commands verbatim. Every experiment logs its shell command so the user can reproduce it.
- Minimum 10 runs for any statistical claim. Flaky tests can have fail rates under 10%.
- Never modify test code during diagnosis beyond temporary instrumentation that is reverted before the report is emitted.
- Evidence for AND against every hypothesis. A hypothesis with only supporting evidence is unfalsifiable and must be flagged.
- Coordinator does not self-approve. The hypothesis ranking is based on experiment outcomes recorded in files, not coordinator opinion.
- Structured output is the contract. Per-hypothesis evaluations live between
STRUCTURED_OUTPUT_START/STRUCTURED_OUTPUT_END markers; free-text commentary is ignored when computing the verdict.
Cancellation / resume
If interrupted mid-run: on next invocation read flaky-diag-{run_id}/state.json, identify the highest-generation completed stage, and replay from the next stage. Never recompute completed experiments ā trust the recorded run artifacts. See STATE.md for the replay protocol.
Reference files
| File | Contents |
|---|
| RUNNERS.md | Test runner detection, command templates for pytest/jest/junit/go test/cargo test, and how to target a single test |
| EXPERIMENTS.md | Multi-run protocol, isolation protocol, ordering bisection algorithm, timing instrumentation, red-green-red demo protocol |
| ANALYSIS.md | Root cause categories, code pattern matching for flakiness signals, environment factor checklist, decision matrix |
| FORMAT.md | Per-hypothesis evaluation schema, STRUCTURED_OUTPUT_START markers, distinguishing-experiment template |
| STATE.md | state.json schema, per-stage state layout, resume protocol |
| GOLDEN-RULES.md | Full golden rules text, falsifiability gate, anti-rationalization counter-table |
| REPORT.md | Diagnosis report output template with honest termination labels |
Execution routing (sagaflow-first ā Contract 6)
Sagaflow is the default execution path. The in-session workflow above is the FALLBACK, used only when the sagaflow worker is confirmed unavailable. See _shared/execution-model-contracts.md Contract 6.
Routing sequence (mandatory before any in-session work):
- Run
sagaflow doctor
- If healthy ā launch via sagaflow below. Stop. Do not run in-session.
- If unhealthy ā log
SAGAFLOW_UNAVAILABLE, proceed with in-session fallback.
Launch command:
Bash(
run_in_background=true,
command="sagaflow launch flaky-test-diagnoser --arg test='<TEST_ID>' --arg command='<RUN_CMD>' --arg n_runs=<N> --await"
)
Both test and command are REQUIRED. Substitute <TEST_ID> with the test identifier (e.g. tests/test_foo.py::test_bar), <RUN_CMD> with the exact shell command that runs that test, and <N> with the stability-run count (default 10). The workflow writes ~/.sagaflow/runs/<run_id>/report.md with pass/fail table, isolation+ordering results, timing analysis, top hypothesis with evidence, and proposed fix or mitigation.