flaky-test-diagnoser

Name: Flaky Test Diagnoser
Author: npow

// Systematically diagnoses why a test is flaky by running multi-run experiments, isolation tests, ordering permutations, and timing analysis. Use when the user says a test is flaky, intermittent, non-deterministic, randomly failing, passes sometimes, or asks to debug test flakiness.

Run Skill in Manus

$ git log --oneline --stat

stars:3

forks:0

updated:May 6, 2026 at 03:01

File Explorer

18 files

SKILL.md

readonly

package.json

"author": "npow"

"repository": "npow/claude-skills"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

name	flaky-test-diagnoser
description	Systematically diagnoses why a test is flaky by running multi-run experiments, isolation tests, ordering permutations, and timing analysis. Use when the user says a test is flaky, intermittent, non-deterministic, randomly failing, passes sometimes, or asks to debug test flakiness.
category	debug
capabilities	["hypothesis-testing","root-cause-analysis","backoff-retry"]
best_for	["Diagnosing intermittent or flaky test failures","Running multi-run experiments to isolate non-determinism","Distinguishing ordering, timing, shared state, and resource leak causes"]
not_for	["Consistently failing tests (use deep-debug)","Reviewing code for defects (use deep-qa)","Fixing the flaky test after diagnosis (use autopilot)"]
input_types	["code-path"]
output_types	["diagnosis"]
output_signals	["termination_label","hypothesis_count","fail_rate","root_cause_category"]
complexity	complex
model_tier	sonnet
cost_profile	high
execution	{"sagaflow":"required","temporal_skill":"flaky-test-diagnoser-temporal","estimated_duration":"20-60min"}
related_skills	[{"name":"deep-debug","relation":"alternative","note":"For consistently failing bugs rather than intermittent ones"},{"name":"autopilot","relation":"follow-up","note":"Fix the flaky test after diagnosis"}]
maturity	stable

Flaky Test Diagnoser

Runs competing-hypothesis experiments to isolate the true root cause of a flaky test. Produces a diagnosis with reproducible evidence — or an honest inconclusive label. Never promotes a guess to "root cause."

This is a tracing skill, not a fixer. The goal is to explain why the test is intermittent with falsifiable evidence, not to jump into patching code.

Execution contracts

Subagent watchdog: any time this skill spawns a test-running subagent via run_in_background=true (multi-run protocols, ordering bisection, distinguishing experiments), it MUST be armed with a staleness monitor per _shared/subagent-watchdog.md. Use Flavor A (Monitor tail per spawn) with thresholds STALE=15 min, HUNG=45 min — test suites can legitimately take a long time, especially with N≥10 reruns, but a 45-min silent window is pathological. TaskOutput status reports PID liveness only; output-file mtime is the ground truth for progress. Contract inheritance: timed_out_heartbeat joins this skill's termination vocabulary as a peer of blocked_by_environment (watchdog kill is a different failure class — the test-runner subagent hung, not the test environment). A watchdog-killed experiment contributes zero runs to the N≥10 count; the skill must retry or escalate.

Workflow

Bootstrap run state — create flaky-diag-{run_id}/ in CWD with state.json, hypotheses/, runs/, experiments/. See STATE.md.
Detect test runner — identify the project's test framework and runner command from config files. See RUNNERS.md.
Confirm flakiness — run the target test N>=10 times in isolation, record pass/fail per run, compute fail rate. Record each run under runs/. See EXPERIMENTS.md.
Generate competing hypotheses — produce >=3 deliberately different hypotheses from the 6 categories (ORDERING, TIMING, SHARED_STATE, EXTERNAL_DEPENDENCY, RESOURCE_LEAK, NON_DETERMINISM). Each hypothesis gets its own file in hypotheses/ following the schema in FORMAT.md. See ANALYSIS.md for category signatures.
Gather evidence for AND against each hypothesis — run the structured experiments (isolation, ordering bisection, timing, environment, code reads) and record findings as evidence_for / evidence_against entries per hypothesis. See EXPERIMENTS.md.
Design a distinguishing experiment per surviving hypothesis — an experiment whose outcome uniquely discriminates that hypothesis from its rivals. An experiment that "merely supports" all hypotheses does NOT count. See FORMAT.md.
Run distinguishing experiments — record outcomes under experiments/. Down-rank hypotheses whose distinctive prediction failed. See GOLDEN-RULES.md for the falsifiability gate.
Rebuttal round — let the second-ranked hypothesis present its best rebuttal against the leader. The leader must answer with evidence, not assertion.
Attempt red-green-red demonstration — if a leader has been selected, describe the minimal fix and have the user apply it (or apply it in an experimental branch). Rerun the multi-run protocol (N>=10). Only label root_cause_isolated_with_repro if the red-green-red cycle succeeds.
Emit structured output and diagnosis report — each hypothesis gets a STRUCTURED_OUTPUT_START / STRUCTURED_OUTPUT_END block. Final report follows REPORT.md.

Honest termination labels

Pick exactly one. Never invent synonyms.

Label	Meaning	Requires
`root_cause_isolated_with_repro`	Single hypothesis confirmed; fix demonstrated with red-green-red across N>=10 runs	Distinguishing experiment passed; all rivals empirically falsified; fix verified
`narrowed_to_N_hypotheses`	Could not discriminate between N surviving hypotheses; additional probes recommended	Each surviving hypothesis has evidence_for; no distinguishing experiment available within budget
`inconclusive_after_N_runs`	Could not reproduce flakiness or no hypothesis gathered sufficient evidence	N>=20 runs executed; no consistent signal
`blocked_by_environment`	Cannot execute experiments (runner broken, missing deps, no shell access, tests require secrets)	Specific blocker identified with the exact error output

Never use: "diagnosed", "likely root cause", "probably", "should be", "root cause found" as a terminal label without a red-green-red demo. See GOLDEN-RULES.md.

Self-review checklist

Before delivering the report, verify ALL:

Golden rules

Hard rules. See GOLDEN-RULES.md for the full text, anti-rationalization counter-table, and the falsifiability gate.

Never promote a hypothesis to root cause without a distinguishing experiment that empirically rejects all rivals.
"Diagnosed" requires red-green-red. Test fails → fix applied → test passes deterministically across N>=10 runs. Without it, the label is narrowed_to_N_hypotheses.
Always run isolation before ordering. If the test fails in isolation, ordering analysis is irrelevant.
Bisect, never brute-force when searching for an interfering test.
Capture exact commands verbatim. Every experiment logs its shell command so the user can reproduce it.
Minimum 10 runs for any statistical claim. Flaky tests can have fail rates under 10%.
Never modify test code during diagnosis beyond temporary instrumentation that is reverted before the report is emitted.
Evidence for AND against every hypothesis. A hypothesis with only supporting evidence is unfalsifiable and must be flagged.
Coordinator does not self-approve. The hypothesis ranking is based on experiment outcomes recorded in files, not coordinator opinion.
Structured output is the contract. Per-hypothesis evaluations live between STRUCTURED_OUTPUT_START/STRUCTURED_OUTPUT_END markers; free-text commentary is ignored when computing the verdict.

Cancellation / resume

If interrupted mid-run: on next invocation read flaky-diag-{run_id}/state.json, identify the highest-generation completed stage, and replay from the next stage. Never recompute completed experiments — trust the recorded run artifacts. See STATE.md for the replay protocol.

Reference files

File	Contents
RUNNERS.md	Test runner detection, command templates for pytest/jest/junit/go test/cargo test, and how to target a single test
EXPERIMENTS.md	Multi-run protocol, isolation protocol, ordering bisection algorithm, timing instrumentation, red-green-red demo protocol
ANALYSIS.md	Root cause categories, code pattern matching for flakiness signals, environment factor checklist, decision matrix
FORMAT.md	Per-hypothesis evaluation schema, `STRUCTURED_OUTPUT_START` markers, distinguishing-experiment template
STATE.md	`state.json` schema, per-stage state layout, resume protocol
GOLDEN-RULES.md	Full golden rules text, falsifiability gate, anti-rationalization counter-table
REPORT.md	Diagnosis report output template with honest termination labels

Execution routing (sagaflow-first — Contract 6)

Sagaflow is the default execution path. The in-session workflow above is the FALLBACK, used only when the sagaflow worker is confirmed unavailable. See _shared/execution-model-contracts.md Contract 6.

Routing sequence (mandatory before any in-session work):

Run sagaflow doctor
If healthy → launch via sagaflow below. Stop. Do not run in-session.
If unhealthy → log SAGAFLOW_UNAVAILABLE, proceed with in-session fallback.

Launch command:

Bash(
  run_in_background=true,
  command="sagaflow launch flaky-test-diagnoser --arg test='<TEST_ID>' --arg command='<RUN_CMD>' --arg n_runs=<N> --await"
)

Both test and command are REQUIRED. Substitute <TEST_ID> with the test identifier (e.g. tests/test_foo.py::test_bar), <RUN_CMD> with the exact shell command that runs that test, and <N> with the stability-run count (default 10). The workflow writes ~/.sagaflow/runs/<run_id>/report.md with pass/fail table, isolation+ordering results, timing analysis, top hypothesis with evidence, and proposed fix or mitigation.