Run any Skill in Manus with one click

$pwd:

deep-debug

Name: Deep Debug
Author: npow

// The single entry point for ALL debugging — bugs, test failures, CI failures, Metaflow run failures, Jenkins build failures, or any unexpected behavior. Trigger phrases include "debug this", "why is this failing", "find the bug", "fix the bug", "root cause", "what's wrong with", "this is broken", "diagnose", "troubleshoot", "investigate this failure", "the test is failing", "this used to work", "why doesn't this work", "where's the bug", "CI failed", "PR is red", "why did my flow fail", "build failed", "Jenkins failure". Composes specialized sub-skills internally (debug-pr for CI, debug-run for Metaflow, debug-jenkins for builds) — the agent always enters through deep-debug, never routes to sub-skills directly.

Run Skill in Manus

$ git log --oneline --stat

stars:3

forks:0

updated:May 6, 2026 at 03:01

File Explorer

28 files

SKILL.md

readonly

package.json

"author": "npow"

"repository": "npow/claude-skills"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

name	deep-debug
description	The single entry point for ALL debugging — bugs, test failures, CI failures, Metaflow run failures, Jenkins build failures, or any unexpected behavior. Trigger phrases include "debug this", "why is this failing", "find the bug", "fix the bug", "root cause", "what's wrong with", "this is broken", "diagnose", "troubleshoot", "investigate this failure", "the test is failing", "this used to work", "why doesn't this work", "where's the bug", "CI failed", "PR is red", "why did my flow fail", "build failed", "Jenkins failure". Composes specialized sub-skills internally (debug-pr for CI, debug-run for Metaflow, debug-jenkins for builds) — the agent always enters through deep-debug, never routes to sub-skills directly.
user_invocable	true
argument	The bug description, symptom, error message, or reproduction context
category	debug
capabilities	["hypothesis-testing","root-cause-analysis","parallel-agents","ensemble-judges"]
best_for	["Diagnosing bugs with competing hypotheses","Root-cause analysis for test failures or production incidents","Debugging when the cause is unclear or disputed"]
not_for	["Flaky tests (use flaky-test-diagnoser)","Reviewing code for defects (use deep-qa)","Simple known bugs that just need fixing (use autopilot)"]
input_types	["code-path","artifact-file"]
output_types	["diagnosis","code"]
output_signals	["termination_label","hypothesis_count","fix_applied","test_passes"]
complexity	complex
model_tier	sonnet
cost_profile	high
execution	{"sagaflow":"required","temporal_skill":"deep-debug-temporal","estimated_duration":"15-45min"}
related_skills	[{"name":"flaky-test-diagnoser","relation":"alternative","note":"Specialized for intermittent/flaky test failures"},{"name":"deep-qa","relation":"alternative","note":"When you want to find defects rather than diagnose a symptom"},{"name":"autopilot","relation":"follow-up","note":"Drive a fix to completion after diagnosis"}]
maturity	stable

Deep Debug Skill

Adversarial hypothesis-driven debugging. Given a bug, generate competing hypotheses across orthogonal dimensions, judge each independently (two-pass blind), run discriminating probes that falsify the weaker, fix the survivor with test-first discipline, and escalate to architectural review after 3 failed attempts. Output is a verified fix with a failing-test-now-passes proof, OR an honest termination label naming why the bug resisted a code-level fix.

Execution Model

All operations use Claude Code primitives. These contracts are non-negotiable:

All data passed to agents via files, never inline. Evidence, hypothesis lists, probe specs, known-hypothesis IDs — all written to disk before the agent prompt. Inline data is silently truncated.
State written before agent spawn, not after. spawn_time_iso is written to state.json before the Agent tool call. Spawn failure records spawn_failed status. Resume uses persisted state, never in-memory reconstruction.
Structured output is the contract; free-text is ignored. Every judge, probe-verdict, and validator produces machine-parseable structured lines. Coordinator reads only structured fields. Unparseable output triggers fail-safe classification (disputed, never rejected). Hypothesis output files MUST contain STRUCTURED_OUTPUT_START/STRUCTURED_OUTPUT_END markers; files without these markers are treated as failed.
No coordinator self-review of anything load-bearing. Hypothesis plausibility, evidence tier, rebuttal verdicts, probe winner classification — all delegated to independent agents. The coordinator orchestrates; it does not evaluate.
Termination labels are honest. One of 7 defined labels. Never "probably fixed," never "looks right," never "no critical bugs remain." A fix is Fixed — reproducing test now passes only when the failing test exists, passes with the change, and the full suite is clean.
Hard ceilings are absolute. max_cycles = 3, max_probes_per_cycle = 3, fix_attempt_count ≤ 3. Three fix attempts is evidence the hypothesis space is wrong — Phase 7 architectural escalation is mandatory, never optional.

Shared contracts: this skill inherits the four execution-model contracts (files-not-inline, state-before-agent-spawn, structured-output, independence-invariant) from _shared/execution-model-contracts.md. The items listed above are the skill-specific elaborations; the shared file is authoritative for the base contracts.

Pressure awareness: this skill applies the pressure circuit breakers from _shared/pressure-awareness.md. After 3 probes with no hypothesis falsification, regenerate the hypothesis space rather than probing a 4th time. The existing max_cycles = 3 and fix_attempt_count ≤ 3 ceilings serve as the same-approach ceiling; this contract adds the diminishing-returns check between probe cycles.

Cross-finding coherence: this skill applies the coherence-integrator pattern from _shared/cross-finding-coherence.md at Phase 3, after hypothesis agents complete and BEFORE the judge classifies plausibility. The integrator reads all hypothesis files simultaneously and annotates each hypothesis with cross-hypothesis relationships (contradictions indicating the same root cause viewed from different angles, emergent patterns suggesting a higher-level architectural issue, coverage gaps in the 8-dimension space). These annotations are included in judge input files.

Subagent watchdog: every run_in_background=true spawn (hypothesis agents, judges, evidence-gatherer, architect) MUST be armed with a staleness monitor per _shared/subagent-watchdog.md. Use Flavor A with thresholds STALE=5 min, HUNG=20 min for Sonnet hypothesis agents; STALE=3 min, HUNG=10 min for Haiku judges and evidence-gatherer. Debugging agents that hang silently are the exact failure mode this skill is meant to prevent — applying it to the skill itself is load-bearing. Contract inheritance: timed_out_heartbeat joins this skill's 7-label termination vocabulary at the per-lane level (hypothesis agent / judge / probe); stalled_watchdog and hung_killed join hypotheses.{id}.status. A watchdog-killed hypothesis lane never returns leading or rejected — its verdict is absent, not assessed.

Philosophy

Debugging fails most often in three ways:

Fixation on the symptom site — the stack trace points at file:line, the fix goes there, the real cause is 5 calls upstream. Solo debuggers rarely trace data-flow all the way back.
Hypothesis anchoring — you form the first plausible hypothesis, evidence confirms it (because you're looking for confirmation), you fix it, the bug returns.
Fix-and-retry spiral — 3 failed attempts is the signal the abstraction is wrong, but under time pressure debuggers try fix #4, #5, #6, each one revealing a new problem in a different location.

Deep-debug addresses all three structurally:

Orthogonal dimensions force at least one hypothesis in concurrency, environment, and architecture categories — the three most commonly under-investigated
Independent judge with disconfirmation mandate prevents anchoring on the first fit
Hard architectural escalation at 3 failed fixes converts fix-spiral into a conscious pattern-level decision

If you only take one idea from this skill: evidence-for and evidence-against are the hypothesis contract. A hypothesis without disconfirming searches is never leading; it's plausible at best. See EVIDENCE.md.

The Iron Law

NO FIXES WITHOUT A HYPOTHESIS THAT SURVIVES INDEPENDENT JUDGE + DISCRIMINATING PROBE

If Phase 3 judge has not classified the hypothesis as leading AND Phase 4 probe has not confirmed it against the next-best alternative, you cannot advance to Phase 5. Violating the letter of this process is violating the spirit of debugging.

When to Use

Use deep-debug for:

Bugs that survived a quick-fix attempt (systematic-debugging already ran and the symptom remains or moved)
Production incidents requiring post-mortem rigor
Test failures that cannot be reproduced locally
Performance regressions where the cause is ambiguous
Build / deploy / signing / infrastructure failures with multi-layer pipelines
Behavior divergence between environments (local ✓, CI ✗, or one deployment region only)
Any bug where ≥ 2 plausible root causes exist and the coordinator is tempted to guess

Don't use for:

Obvious typos, compiler errors with clear messages, or known-issue-with-known-fix cases → use superpowers:systematic-debugging
Test flakiness where the target is "is this test flaky, and why specifically?" → use flaky-test-diagnoser (it's purpose-built)
Bugs the user already has a hypothesis for and just wants implemented → implement directly; save deep-debug for when the cause is genuinely unknown
Pure code review or static audit → use deep-qa with --type code

Workflow

Phase -1: Context-Aware Composition Routing

Before the debug cycle, detect if a specialized sub-skill applies. Invoke it via Skill() — deep-debug is the orchestrator, not a bypass.

Context detected	Sub-skill to invoke	How to detect
GitHub PR URL or "CI failed on PR"	`debug-pr`	URL matches `github.*/pull/\d+` or user mentions PR + CI failure
Metaflow pathspec or "flow failed"	`debug-run`	Input contains `FlowName/RunID` pattern or "metaflow" + failure language
Jenkins build URL or "build failed"	`debug-jenkins`	URL matches a Jenkins host or "jenkins" + failure language

If a sub-skill matches: invoke it first to gather structured context (CI logs, run status, build output). Feed its output into Phase 0 as the symptom — then continue with deep-debug's hypothesis-driven cycle on the root cause. The sub-skill handles evidence collection; deep-debug handles root-cause analysis.

If no sub-skill matches: proceed directly to Phase 0.

Phase 0: Input Validation Gate

Before any cycle begins, validate the symptom. Batch clarifying questions — if multiple surface here, present them in one message as a numbered list. Never serially.

Step 0a — Safety check:

Symptom describes a bug in the user's code, not a request to attack a third-party system
If harmful: refuse

Step 0b — Scope check:

Is the symptom specific enough to investigate? ("the app is slow" is too vague; "POST /api/orders p99 latency jumped from 200ms to 2s after {commit}" is fine)
If too vague: ask for: exact error (copy-paste the message), reproduction steps (commands that trigger it), affected environment (local/CI/prod), and when it started (timestamp or "since merge X")

Step 0c — Reproduction confirmation:

Attempt to reproduce immediately (run the provided command; trigger the flow)
Record status in state: confirmed (every time), intermittent (1-in-N with rate), or unreproducible (cannot trigger)
unreproducible → user batch-question: can they provide fresh logs, a stack trace, or a less-filtered reproduction?
Still unreproducible after user input → terminate with label Cannot reproduce — investigation blocked and exit cleanly. Do NOT hypothesize against a bug you cannot see.

Step 0d — Symptom locking:

Extract a ≤200-word symptom statement — the exact error, the surface, when it started, the reproduction command
Compute symptom_sha256; store in state.json
This statement is the concept-anchor — every hypothesis is a causal explanation of this symptom. Later phases MUST NOT silently rewrite the symptom to match an emerging hypothesis. symptom_sha256 catches tampering.

Step 0e — Pre-mortem micro-round (blind-spot seeding): Before Phase 2 expansion, spawn 1 Haiku agent with this prompt:

Given the symptom {symptom}, list 5 concrete ways this debugging investigation could miss the real cause. Cover these angles:

Wrong dimension — the symptom looks like {apparent dimension} but is actually {other dimension}

Measurement artifact — the bug might not exist; we could be reading stale/wrong evidence

Environment assumption — local behavior assumed, production differs

Framework-contract blindness — something assumed guaranteed that actually isn't

Architectural drift — this is symptom N of an architectural problem; fixing here won't hold Output to deep-debug-{run_id}/premortem.md with one concrete claim per angle.

Each flagged blind-spot becomes an angle with priority: critical and source: premortem in Phase 2.

Step 0f — Pre-run scope declaration:

Deep debug: "{symptom}"
Reproduction: {confirmed | intermittent (1-in-N) | unreproducible-investigation-blocked}
Dimensions: 8 (code-path, data-flow, recent-changes, environment, framework-contract, concurrency-timing, measurement-artifact, architectural-coupling)
Max cycles: {default 3}    | Hard stop: {6} cycles    | Budget cap: ${25.00}
Estimated cost: ~$3-8 per cycle (Sonnet hypothesis + Haiku judge + Haiku probes)
Invocation: {interactive | --auto | --hard-mode}
Continue? [y/N]

Skip gate if --auto.

Print: Starting deep-debug on: {symptom_excerpt} [run: {run_id}]

Phase 1: Evidence Gathering

Purpose: Collect the ground-truth artifacts every hypothesis in Phase 2 will be evaluated against. Evidence is tier-ranked per EVIDENCE.md.

Step 1a — Symptom capture (always):

Exact error text / stack trace / log excerpt — copy verbatim into deep-debug-{run_id}/evidence.md §Symptom
Timestamp of first occurrence (if known)
Reproduction command (from Phase 0c)

Step 1b — Recent changes snapshot (always):

git log --since='7 days ago' --oneline -- {affected-paths} → evidence.md §Recent Changes
Lock file diffs (package-lock.json / poetry.lock / Gemfile.lock) vs. last-known-good, if known
Config diffs
Even if this looks unrelated to the symptom, record it. recent-changes is one of the 8 dimensions — agents will search for the connection.

Step 1c — Boundary instrumentation (when multi-layer):

If the system has obvious layer boundaries (workflow → build → sign; API → service → DB; frontend → gateway → backend): add instrumentation at each layer to log inputs/outputs
Run the reproduction once with instrumentation; capture all layer logs into evidence.md §Boundary Instrumentation
This converts tier-5 stack-position guesses into tier-2 primary-artifact evidence for Phase 2
Skip if single-layer or instrumentation would take > 30 minutes (add to Phase 4 probe list instead)

Step 1d — Lock evidence:

evidence.md is append-only after Phase 1. Phase 4 probe results append to §Probe Log; nothing rewrites earlier sections.

Phase 2: Hypothesis Generation

Prospective gate (fires BEFORE spawning):

Coordinator outputs:

Cycle {N}: about to spawn 6 hypothesis agents + 1 outside-frame agent (Sonnet tier).
Estimated cost: ~${estimate} (cumulative: ${running_total} of ${budget_cap})
Dimensions queued: {list of angles' dimensions}
Required categories covered so far: {counts from state.json}
Continue? [y/N/redirect:<focus>]

Skip if --auto. If --auto AND cumulative budget exceeds 80% of cap: force user confirmation even in auto mode (budget backstop).

Step 2a — Enumerate angles (see DIMENSIONS.md):

For cycle 1:

Generate 2–4 angles per dimension (8 dimensions × 2–4 = 16–32 angles)
Generate 2–3 cross-dimensional angles
Add premortem angles from Phase 0e
Cap total frontier at 30. Priority-order per DIMENSIONS.md §Priority Ordering.

For cycle 2+:

Start from graveyard (hypotheses rejected in prior cycle) — what dimensions were under-investigated?
Add CRITICAL-priority angles for any dimension that had zero hypotheses accepted by the judge
Add CRITICAL-priority angles for any required category still uncovered

Step 2b — Write data files before spawning:

deep-debug-{run_id}/known-hypotheses.md — list of all prior-cycle hypothesis IDs and 1-line titles (so critics don't re-propose)
deep-debug-{run_id}/angles/{angle_id}.md — one file per angle with dimension, question, parent, rationale
deep-debug-{run_id}/evidence.md is referenced by path; critics read it directly

Step 2c — Spawn hypothesis agents (parallel):

Pop up to 6 highest-priority angles; assign to spec-derived hypothesis agents
Spawn 1 additional outside-frame hypothesis agent (slot #7) — seeded from the symptom only, not the current hypothesis list or evidence file narrative
Before each spawn: write status: "in_progress", spawn_time_iso to state.json
Spawn each Agent in parallel (subagent_type: general-purpose, model: sonnet) with file paths — NOT inline content
Hypothesis output path: deep-debug-{run_id}/hypotheses/{angle_id}.md (content-addressed — coordinator CANNOT overwrite)
If Agent tool returns error: record status: "spawn_failed", do not record as spawned; retry on next round

Quorum: Round complete if ≥ 4 of 6 spec-derived agents return parseable output within timeout (outside-frame tracked separately — its exit does not affect quorum).

Timeout scaling: 180s base. ×1.5 for cycles 2+ (larger evidence file).

Circuit breaker: ≥ 3 consecutive rounds with any critic failures → halt, log SYSTEM_FAILURE_ROUND, notify at turn boundary.

Step 2d — After agents complete:

For each completed agent: read hypothesis file, extract hypothesis struct, dedup (see STATE.md Tier 1 + Tier 2), update state.json
Spec-derived agents may suggest ≤ 1 new angle per cycle (immutable once written)
Outside-frame agent may suggest ≤ 2 new angles (its exemption — it brings novel dimensions)
Update coverage tracking: which dimensions / required categories are now covered?

Phase 3: Independent Hypothesis Judge (two-pass blind)

Purpose: Classify every hypothesis into plausibility tiers (leading | plausible | disputed | rejected | deferred) using a judge that cannot see the critic's confidence claim during pass 1. This prevents inflation.

Step 3a — Strip confidence from each hypothesis file:

Coordinator produces deep-debug-{run_id}/judge-inputs/batch_{cycle}_{batch_num}.md (up to 5 hypotheses per batch)
Stripping is mechanical: remove the **Confidence:** line and the CONFIDENCE: field from STRUCTURED_OUTPUT
Stripping recorded in state.json as judge_input_stripped: true
Original hypothesis files stay immutable

Step 3a-coherence — Cross-hypothesis coherence integrator:

Per _shared/cross-finding-coherence.md, adapted for hypotheses:

Collect all parseable hypothesis output files from this cycle (post-dedup).
Write integrator input manifest to deep-debug-{run_id}/coherence/cycle-{N}-input.md.
Spawn Sonnet coherence-integrator agent. Output: deep-debug-{run_id}/coherence/cycle-{N}-coherence.md. Timeout: 120s.
Parse structured annotations. For hypotheses, the key relationships are:
- CONTRADICTS — two hypotheses make incompatible causal claims about the same component. At least one is wrong; the judge should scrutinize both.
- PATTERN_MEMBER — multiple hypotheses from different dimensions point to the same underlying architectural issue. This is early signal for Phase 7 escalation.
- SUPERSEDED_BY — one hypothesis is a strict subset of another (same mechanism, narrower scope). The judge should prefer the broader hypothesis.
Attach annotations to hypotheses.{id}.coherence_annotation in state.json.
Feed GAP lines into the frontier as CRITICAL-priority angles for the next cycle.
If unparseable/timed out: proceed in degraded mode. Log COHERENCE_DEGRADED.

Escalation signal: If the integrator identifies ≥ 3 hypotheses across different dimensions as PATTERN_MEMBER of the same root cause, this is structural evidence that the bug is architectural. Log COHERENCE_ARCHITECTURAL_SIGNAL — this does not trigger Phase 7 directly (fix_attempt_count is still the trigger) but the signal is surfaced in Phase 8 report.

Step 3b — Spawn background batched judge (Haiku):

Batch ≤ 5 hypotheses per agent (cost optimization)
Agent receives: batch input file path, evidence file path, known-hypothesis list, and coherence annotations per hypothesis (from Step 3a-coherence; STANDALONE if integrator ran in degraded mode)
Run in background (run_in_background=true) — next cycle's critics can start while judges finish
Timeout: 90s per batch

Step 3c — Two-pass judge protocol (see FORMAT.md §Judge Verdict):

Pass 1: Judge classifies each hypothesis blind (no critic confidence claim). Applies 5 validation checks: falsifiability / contradiction / premise / evidence-grounding / simplicity (see §Validation Checks below). Writes pass-1 verdict to deep-debug-{run_id}/judge-verdicts/batch_{cycle}_{batch_num}.md.
Pass 2: Coordinator provides the critic's confidence claim. Judge can CONFIRM, UPGRADE, or DOWNGRADE with rationale. Writes pass-2 addendum to the same file.
Final plausibility is the pass-2 verdict.

Step 3d — Judge integrity check:

After each cycle: if judge's leading-rate ≥ 80% on ≥ 5 hypotheses, flag judge_suspect: true in state and tag the cycle's leader [JUDGE_SUSPECT]. The coordinator does not re-decide — it surfaces the concern.

Step 3e — Drain pending judges before Phase 4:

Use TaskOutput with block=true for each background judge batch
Apply results to state.json (overwrite preliminary critic plausibility with judge's authoritative verdict)
If a batch timed out: retain critic-proposed plausibility, set judge_status: "timed_out", log JUDGE_TIMEOUT: batch_{id}

Step 3f — Rebuttal round (if ≥ 2 hypotheses at leading or plausible):

Spawn independent rebuttal agent (Sonnet) — separate from judge
Input: leader + strongest alternative + evidence file
Output: deep-debug-{run_id}/rebuttal-cycle{N}.md
See EVIDENCE.md §Rebuttal Round for protocol
Update plausibility based on rebuttal outcome

Phase 4: Discriminating Probe Execution

Purpose: Collapse uncertainty between the top-2 hypotheses with a concrete experiment. Not "think harder" — a command that produces hard evidence.

Step 4a — Probe design:

If exactly 1 hypothesis at leading AND 0 hypotheses at plausible → no probe needed; advance to Phase 5 with the leader
If ≥ 2 hypotheses at leading OR plausible → design a probe that distinguishes them

Probe requirements (see EVIDENCE.md §Discriminating Probe):

Bounded time (≤30 minutes; ideally ≤10)
Hard evidence output (log line, return value, test result)
Pre-declared expected result for each hypothesis (post-hoc rationalization destroys probes)
Reversible or non-destructive (touching production data requires user confirmation even in --auto)

Step 4b — Write probe spec:

deep-debug-{run_id}/probes/probe_{id}.md (see FORMAT.md §Discriminating Probe Specification)
probe_count_this_cycle += 1 (hard cap: 3 per cycle)

Step 4c — Safety check:

Does the probe write to production, delete data, or call an irreversible API? → USER CONFIRMATION REQUIRED
If --auto AND probe is destructive → reject the probe; choose a safer alternative or defer the hypothesis

Step 4d — Execute:

Simple probe (test, query, check): coordinator runs directly
Complex probe (build artifact, multi-step instrumentation): spawn evidence-gatherer agent (Haiku, background) with probe spec as input
Record raw output verbatim in evidence.md §Probe Log (NO summarization)

Step 4e — Apply verdict:

Match actual result against pre-declared expectations:
- Matches "expected if A" → A wins; B's status → falsified_by_probe
- Matches "expected if B" → B wins; A's status → falsified_by_probe
- Matches neither / matches both → inconclusive → design second probe (up to cap)
- Contradicts both → both to rejected; cycle ends as hypothesis_space_saturated; re-enter Phase 2 next cycle with updated evidence

Step 4f — Winner promotion:

If a winner emerges → its status becomes promoted_to_fix; advance to Phase 5
If probe cap hit without winner → cycle ends as hypothesis_space_saturated; fix_attempt_count += 1 (counts toward hard stop); advance to Phase 6

Phase 5: Fix + Verify

Purpose: Convert the promoted-to-fix hypothesis into working code with test-first discipline and verified pass.

REQUIRED sub-skills (invoke in order):

superpowers:test-driven-development — for writing the failing test
superpowers:verification-before-completion — for confirming the fix

Step 5a — Write failing test (MANDATORY):

Test must reproduce the exact symptom from Phase 0
Run it — it MUST fail before the fix
If no test framework exists → write a one-off reproduction script that exits non-zero on the bug
fix_attempts.{N}.failing_test_written: true → update state.json
If you skip this step, the skill's Iron Law is violated — halt, require explicit --allow-untested override from user (rarely appropriate)

Step 5b — Implement ONE change:

Address the hypothesis's identified mechanism
ONE focused change; no bundling, no "while I'm here" refactoring
Use Edit tool with explicit old/new strings
Write the diff to deep-debug-{run_id}/fixes/fix_{N}.diff (use git diff)

Step 5c — Verification:

Re-run the failing test → MUST pass
Run full test suite → MUST be clean (no new regressions)
If any regression → outcome: "failed", revert the change, record in state.json, advance to Phase 6
Capture both test outputs into evidence.md §Fix Attempts Log

Step 5d — Defense-in-depth (optional, strongly recommended):

After a verified fix, add validation at layer boundaries per TECHNIQUES.md §Defense-in-Depth
This is optional for cycle termination but is a report deliverable — the final report's Defense-in-Depth Suggestions section lists which layers the fix SHOULD get

Step 5e — Record outcome:

verified (test passes + suite clean)
partial (test passes, suite reveals 1 small regression that's clearly related and easy to fix inline → fix it and re-verify)
failed (test still fails, or suite has unrelated regressions)
reverted (change had to be rolled back)

Phase 6: Cycle Termination Check

After Phase 5 outcome recorded:

Outcome	Action
`verified`	Advance to Phase 8 with termination label `Fixed — reproducing test now passes`
`partial` (re-verified inline)	Same as verified
`failed` AND `fix_attempt_count < 3`	Return to Phase 2 next cycle. Critical: the failure is NEW EVIDENCE — record what the fix changed, what didn't happen, what other failures appeared. Cycle 2's hypothesis agents get this in evidence.md
`failed` AND `fix_attempt_count >= 3`	MANDATORY Phase 7 escalation. No Fix #4 under the current hypothesis set.
`reverted`	Same as `failed` — counts toward `fix_attempt_count`
`hypothesis_space_saturated` (from Phase 3 or Phase 4)	`fix_attempt_count += 1` (counts even though no fix was attempted); if still `< 3`, return to Phase 2; if `>= 3`, Phase 7

Hard stop trigger: cycle >= hard_stop (default 6) → terminate with label Hard stop at cycle {N}; produce a partial report documenting the hypothesis graveyard.

Phase 7: Architectural Escalation

MANDATORY when fix_attempt_count >= 3 or hypothesis_space_saturated persists for 2 cycles.

Step 7a — Spawn architect agent (Opus — the one Opus spawn in this skill):

Input: all prior cycles' hypothesis graveyards, all fix_attempts with outcomes, current evidence file, symptom
Output: deep-debug-{run_id}/architectural-question.md per FORMAT.md §Architectural Question
Task: "Is this a pattern problem (wrong shared-state model), an invariant problem (assumed-but-violated precondition), or a wrong-abstraction problem (the component being debugged shouldn't exist in this shape)?"

Step 7b — Return to user:

Print the architectural question to the user at turn boundary
DO NOT attempt Fix #4 under any hypothesis from the prior 3 attempts
To continue, the user must re-invoke deep-debug with --hypothesis-override={chosen-architectural-alternative} after making the architectural decision

Step 7c — Terminate with label:

Architectural escalation required — 3 fix attempts failed across distinct hypotheses
Advance to Phase 8 with the architectural question as the primary deliverable

Phase 8: Final Report

Do NOT read raw hypothesis files — use coordinator summary + judge verdicts + state.json
Spawn Sonnet subagent to write deep-debug-{run_id}/debug-report.md (see FORMAT.md §Final Debug Report)
Termination label must be from the 7-label vocabulary (below) — never improvised
If termination is Fixed: include resolved-cause section + defense-in-depth suggestions
If termination is Environmental: include retry/monitoring contract
If termination is Architectural escalation required: include reference to architectural-question.md
Coverage caveats MUST be prominent when required categories are uncovered
If coherence integrator identified emergent patterns: include 'Cross-Dimensional Patterns' section showing which hypotheses shared root causes and whether the pattern signal aligned with the final fix hypothesis

Honest Termination Vocabulary (7 labels)

Exactly one applies at Phase 8:

Label	When
`Fixed — reproducing test now passes`	Phase 5 verified: failing test passes + suite clean
`Environmental — requires retry/monitoring, not fix`	Investigation confirmed bug is not in code (flaky infrastructure, upstream service, timing dep) AND a retry/monitoring contract is specified
`Architectural escalation required — 3 fix attempts failed across distinct hypotheses`	Phase 7 triggered by `fix_attempt_count >= 3`
`Hypothesis space saturated — no plausible hypothesis survives judge`	All dimensions explored, every hypothesis judge-rejected or probe-falsified
`Cannot reproduce — investigation blocked`	Phase 0 reproduction failed and user couldn't provide steps
`User-stopped at phase N`	User chose stop at a gate
`Hard stop at cycle N`	`cycle >= hard_stop` reached

Never use a label not in this table. Never write "probably fixed" / "mostly fixed" / "should be fixed" / "no critical bugs remain."

Validation Checks

Every hypothesis that reaches the judge passes these 5 checks. Failing one = disputed (not rejected). The judge applies them in pass 1.

Falsifiability check — Can you construct a concrete scenario where this hypothesis manifests AND a scenario where it does not? If not, the claim is unfalsifiable. Examples of unfalsifiable: "users might be slow sometimes" (no threshold), "there could be a race somewhere" (no specific ordering).
Contradiction check — Does this hypothesis contradict another accepted hypothesis or primary-tier evidence? If yes, at least one is misdiagnosed; flag both for re-examination.
Premise check — Would this hypothesis manifest in practice given the surrounding code/framework's existing guarantees? Examples that FAIL this check: "race between non-IO lines in gevent" (gevent cooperative scheduling prevents this), "SQLAlchemy cross-session mutation" (scoped sessions are isolated unless explicitly shared).
Evidence-grounding check — Does the hypothesis EXPLAIN the observed symptom, or is it consistent with it but not explanatory? Example: "bug is in file X" (consistent with stack trace but doesn't explain WHY the bug manifests) vs. "bug is in file X because parse() returns None for empty input, which the caller at Y doesn't check, causing AttributeError at Z" (explanatory).
Simplicity check (Occam) — If two hypotheses explain the evidence equally, prefer the one with fewer assumptions. A hypothesis that requires "and also X is misconfigured and also Y is out of order" when a simpler hypothesis would explain the same evidence should be down-ranked.

Incident Mode (`--incident`)

When --incident is passed, deep-debug enters priority mode:

Skip PR creation — deploy hotfix directly
Escalate immediately if 3 fix attempts fail (don't loop silently)
Emit structured incident timeline for post-mortem

Extended Scope (ported from fix skill)

Deep-debug covers not just code bugs but also:

Model drift — accuracy degradation after data refresh, feature distribution shift
Data issues — pipeline dropping records, schema mismatches, partition gaps
Infrastructure — resource exhaustion, auth failures, config drift

Ship Phase (after fix verified)

After verification passes:

Commit fix + regression test atomically
Create PR with: symptom description, root cause analysis, fix description, test evidence
If --incident: skip PR, deploy hotfix directly
If cross-provider model available (mcp__pal__codereview): run parallel verification on non-Claude model for blind-spot diversity

Flags

--incident — priority/hotfix mode: deploy directly, skip PR
--no-ship — stop after verified fix, don't commit/PR
--max-iterations=N — cap hypothesis-fix-verify loop (default: 5)

Golden Rules

No fixes without a hypothesis that survives independent judge + discriminating probe. The Iron Law. Violating the letter of this process is violating the spirit of debugging.
Every hypothesis must be falsifiable with a concrete probe. "Could happen" is not a hypothesis; "would happen under X given Y, falsifiable by Z" is.
Symptom-site ≠ cause-site — trace upstream. The stack trace points where the error surfaces, not where it originates. Use TECHNIQUES.md §Root Cause Tracing for the data-flow dimension.
Evidence strength matters — rank, don't flatten. A tier-1 controlled reproduction outranks ten tier-5 circumstantial clues. See EVIDENCE.md §Evidence Strength Hierarchy.
One fix at a time. No bundling, no "while I'm here" refactoring. If fix doesn't work, can't isolate what changed.
Test-first for fixes. Invoke superpowers:test-driven-development. The failing test is evidence you understand the bug; a passing test after the change is evidence you fixed the right thing.
Three failed fixes = architectural escalation (MANDATORY). No Fix #4 under the current hypothesis set. Phase 7 is not optional.
Coordinator orchestrates; never self-classifies. Hypothesis plausibility, evidence tier, rebuttal verdict, probe winner — all delegated to independent agents. Coordinator reads structured output, doesn't produce verdicts.
Framework guarantees are evidence — but verify, don't assume. "The ORM handles that" / "Gevent prevents that" / "The scoped session isolates that" — cite the specific code or docs that guarantee it, or the hypothesis relying on that guarantee is tier-4 inference, not tier-2 primary.
Environmental bugs get explicit retry/monitoring contracts — never silent "it'll work next time." The Environmental termination label requires a specific retry policy, monitoring, and escalation threshold.
Disputed hypotheses are documented, never silently dropped. The final report surfaces every hypothesis the judge rejected or disputed, with rationale.
Judges must be adversarial. 100% leading-rate across a cycle = broken judge (judge_suspect: true). Expected acceptance is 30–60% of critic-proposed hypotheses.
Disconfirmation must be attempted. A hypothesis without at least one disconfirmation search is plausible at best, never leading. Confirmation-only evidence is confirmation bias dressed as investigation.

Anti-Rationalization Counter-Table

These are excuses the coordinator will be tempted to reach for under pressure. Each row is a defensive entry — when you catch yourself thinking the excuse, look at the reality.

Excuse	Reality
"The stack trace points here — this is obviously the bug."	Symptom-site bias. The stack trace shows where the error surfaces, not where the data went wrong. Run root-cause-tracing (Golden Rule 3); the cause is often 3–5 frames upstream.
"I know this codebase — I don't need a hypothesis judge."	Overconfidence is the dominant mode of failed debugging. The judge's independence invariant applies whether or not you "know" the code — even experts anchor on the first fit.
"One more fix attempt, the pattern is almost there."	At 3+ failed fixes you're debugging the wrong abstraction. Phase 7 escalation is mandatory (Golden Rule 7). No Fix #4.
"This race condition is unlikely under normal load."	Unlikely × scale = certain. Apply the premise check: can you construct the scenario where it manifests? If yes, it's a real hypothesis — probe it, don't dismiss it.
"The test is flaky, not the code."	Sometimes. But "flaky" usually means race, ordering, or pollution. Invoke `flaky-test-diagnoser` before terminating as environmental.
"The framework prevents this (gevent / session scope / ORM lazy-loading)."	Framework guarantees are evidence — verify them in code, not in your head. Cite file:line of the guarantee. Un-cited framework claims are tier-4 inference, not tier-2 primary. (Golden Rule 9)
"Skip the failing test — I'll manually verify."	Manual verification is not a test. Untested fixes don't stick (the change surface gets edited, the bug comes back). Test-first is Golden Rule 6 for a reason.
"Multiple fixes at once saves time."	Can't isolate what worked. If symptom returns or a new bug appears, you have no signal on which change caused it. One fix at a time (Golden Rule 5).
"The probe would take 20 minutes — just fix it."	Probe cost is bounded (≤30 min). Fix-then-revert cost is unbounded (writing the fix, debugging its own issues, backing it out, re-hypothesizing). Run the probe.
"Judge downgraded this but my evidence is strong."	Judge saw the same evidence. If you disagree, use the challenge-token mechanism with a rationale — don't overrule silently. A coordinator that routinely overrides the judge has violated the independence invariant.
"It reproduced once — that's enough to commit."	One reproduction on your machine is not "fixed." Run the full test suite AND re-run the failing test multiple times (if concurrency-related). Only `Fixed — reproducing test now passes` requires both the targeted test AND the full suite.
"Previous hypothesis was close — just tweak it."	A changed hypothesis gets a fresh judge pass from scratch. Previous `leading` verdicts on pre-mutation hypotheses are stale.
"I don't need a discriminating probe; I'll look at the code more carefully."	Re-reading code is gathered evidence, not new evidence. Probes collapse uncertainty by producing hard observations (log line, test result, return value). Code review re-confirms what you already believed. (EVIDENCE.md §Discriminating Probe)
"The symptom and the recent change are obviously unrelated."	`recent-changes` is a required dimension precisely because "obviously unrelated" changes are how regressions hide. The judge will demand the linking probe anyway.
"This looks environmental — skip to retry."	The `Environmental` termination label requires completed investigation across all 8 dimensions AND a retry/monitoring contract. "Skip investigation" is not a label.
"I found ONE hypothesis that fits — that's the answer."	One hypothesis that fits evidence is one of several that MAY fit. Generate competitors (Phase 2) before accepting. Anchoring on the first fit is the second-most-common failure mode of solo debugging.
"The 5 validation checks are overkill for this simple bug."	Simple bugs pass the checks quickly — running them costs nothing. "Skip because simple" is the rationalization that lets unvalidated hypotheses become fixes that don't hold.
"I'll update the symptom to match the new understanding."	Symptom is locked at Phase 0 with sha256 verification. Silent rewrite = `SYMPTOM_TAMPERED` and halts the run. If new understanding changes the symptom, start a new run with a new symptom.
"Probes fired, no winner — I'll just pick the better-looking one."	Inconclusive probe ⇒ inconclusive. Either design another probe (up to cap) or accept `hypothesis_space_saturated` — don't silently pick a loser.
"3 failed fixes is bad luck — the next will hold."	3 failed fixes is a structural signal, not bad luck. Each fix revealed new information; the pattern is consistent: wrong abstraction. Phase 7 escalates.

When you reach for any of these rationalizations: stop. Look at the reality column. Apply the relevant Golden Rule or validation gate.

Red Flags — STOP and Follow Process

If you catch yourself thinking:

"Quick fix for now, investigate later"
"Just try changing X and see if it works"
"Add multiple changes, run tests"
"Skip the failing test, I'll manually verify"
"It's probably X, let me fix that"
"I don't fully understand but this might work"
"Pattern says X but I'll adapt it differently"
"Here are the main problems: [lists fixes without investigation]"
Proposing a fix before judge has classified the hypothesis
"One more fix attempt" (when fix_attempt_count >= 2)
Each failed fix revealed a new problem in a different location
Judge downgraded the leader but coordinator promoted it anyway
Skipped the discriminating probe because "I can tell by reading the code"

ALL of these mean: STOP. Return to whatever phase was skipped.

Self-Review Checklist

Before writing the final report, verify:

Agent Prompt Templates

Hypothesis Agent (Sonnet — one per lane)

You are an adversarial hypothesis generator for a specific debugging angle. Your job is to generate ONE specific, falsifiable hypothesis that explains the observed symptom — with evidence for it AND evidence against it.

Your dimension: {angle.dimension}
Your specific angle: {angle.question}
Cycle: {cycle}
Evidence file: {evidence_file_path}
  Read this file to understand the symptom, reproduction, recent changes, and prior boundary instrumentation.
Known hypotheses file: {known_hypotheses_file_path}
  Read this file for hypothesis IDs and titles from this and prior cycles. Do NOT re-propose any of these.

Instructions:
1. Read the evidence file carefully. Focus especially on the sections relevant to your dimension.
2. Form ONE specific hypothesis that explains the symptom through the lens of your dimension.
3. Your hypothesis MUST include:
   - Mechanism: a 2-4 sentence causal chain. "{cause} happens because {reason}, leading to {symptom}." Be specific about file:line where the mechanism lives.
   - Predictions: 2+ observables that WOULD occur if this hypothesis is true (e.g. "running the test with INSTRUMENT=1 will log `projectDir=''`")
   - Evidence For: tier-ranked per EVIDENCE.md — file:line, log lines, git history, prior boundary instrumentation
   - Evidence Against: REQUIRED — at least one disconfirmation search attempted. If genuinely no contrary evidence found, say so explicitly with "Not yet disconfirmed — searches attempted: {describe}"
   - Critical Unknown: ONE sentence naming the fact that would most collapse uncertainty
   - Discriminating Probe: ONE concrete experiment (bounded time ≤30 min, hard evidence output, pre-declared expected result for this hypothesis vs. the most obvious alternative)
   - Confidence: high | medium | low — based on evidence tier, not feelings
4. Apply the 5 validation checks yourself before filing: falsifiability / contradiction / premise / evidence-grounding / simplicity. If your hypothesis fails any, either fix it or file a minor note instead.
5. Report ≤ 1 new angle you discovered. Genuinely novel — not a rephrasing of this one.
6. Write findings to: {output_path}
7. Use FORMAT specified in FORMAT.md §Hypothesis File
8. Your output MUST include STRUCTURED_OUTPUT_START and STRUCTURED_OUTPUT_END markers

FALSIFIABILITY REQUIREMENT: Every hypothesis must be falsifiable — it must be possible to construct a scenario where it manifests AND a scenario where it does not. "Could be a race" without specific ordering is not a hypothesis — it's a concern. Unfalsifiable concerns go in the Mini-Synthesis as context, not as hypotheses.

FRAMEWORK CONTEXT RULE: If your hypothesis claims a framework behavior (gevent scheduling, SQLAlchemy sessions, Django ORM, asyncio), VERIFY the claim in the actual installed version's code or docs. An unverified framework claim is tier-4 inference, not tier-2 primary. Cite file:line of the guarantee.

DIMENSION DISCIPLINE: Your dimension constrains WHERE you look, not WHAT you conclude. If your lens reveals that the real cause is in a different dimension, say so in the Mini-Synthesis and file the discovered angle. Don't force a hypothesis in your dimension when the evidence points elsewhere.

IMPORTANT — AVOID THESE COMMON HYPOTHESIS MISTAKES:
- Don't re-propose a hypothesis from the known-hypotheses file (re-use of same mechanism under new title counts as re-proposing)
- Don't file a hypothesis you CAN'T falsify — that's a concern, not a hypothesis
- Don't confuse "consistent with evidence" with "explains the evidence" — the 5th validation check (evidence-grounding) catches the difference
- Don't inflate confidence to be persuasive — `CONFIDENCE: high` requires tier-1 or tier-2 evidence PLUS disconfirmation attempted

Outside-Frame Hypothesis Agent (Sonnet — slot #7)

You are an adversarial outside-frame hypothesis generator. Unlike the spec-derived agents that work inside the evidence and dimension framework, your job is to identify hypotheses the evidence narrative might have MISSED.

Symptom: {symptom_verbatim}
(You are NOT given the current evidence file or hypothesis list. Do not ask for them — your value comes from being unconstrained by the framing.)

Question to answer: "What kinds of causes would a careful investigator of this symptom CONSIDER that the current investigation might be overlooking?"

Instructions:
1. Read the symptom and think from first principles — what does a symptom like this typically mean in production systems?
2. Consider: infrastructure causes, operational causes, deploy-pipeline causes, upstream-service causes, data-corruption causes, clock-skew causes, multi-region inconsistency, and any other dimension NOT in the current 8-dimension frame
3. Propose ≤ 2 new hypothesis angles (not full hypotheses — angles that future spec-derived agents should investigate). Each angle: what it investigates + why it might have been missed
4. Propose ≤ 1 full hypothesis yourself — one you think is under-weighted in typical investigations of this symptom
5. Write findings to: {output_path}
6. Use FORMAT specified in FORMAT.md §Hypothesis File
7. Include STRUCTURED_OUTPUT markers

Your new angles do NOT reset the required-category coverage clock (spec-derived agents do that). They supplement the frontier with novel dimensions.

IMPORTANT: You succeed by identifying genuine gaps in how the symptom is being framed. You fail by re-proposing what a spec-derived agent would produce with the standard dimension list.

Hypothesis Judge (Haiku — batched, two-pass blind)

You are an independent hypothesis judge. Your job is to classify debugging hypotheses into plausibility tiers based on evidence and validation checks. You are NOT a rubber-stamp. You are a gatekeeper.

**You succeed by rejecting, disputing, or down-ranking hypotheses that don't meet the bar. You fail by rubber-stamping.**

A 100% `leading`-rate is broken judgment. Expected acceptance in a well-specified debugging run with 6 hypothesis agents is 30-60% at `leading` or `plausible`. A judge that accepts everything is not independent.

**Pass 1 input (blind — confidence stripped):**
- Batch file: {batch_input_path}
- Evidence file: {evidence_file_path}

**Pass 1 instructions:**
For EACH hypothesis in the batch, apply these 5 validation checks AS GENUINE GATEKEEPERS (not box-ticking):

1. **Falsifiability check:** Can you construct both a manifesting scenario and a non-manifesting scenario? If not → disputed.
2. **Contradiction check:** Does this hypothesis contradict another hypothesis in the batch or primary evidence? If yes, at least one is misdiagnosed → flag both.
3. **Premise check:** Given the framework/environment's existing guarantees, can this hypothesis manifest? Example of FAIL: "race between non-IO greenlets in gevent" — gevent's cooperative model prevents this; hypothesis is rejected.
4. **Evidence-grounding check:** Does the hypothesis EXPLAIN the symptom, or is it merely consistent with it? Example of FAIL: "bug is in X because X is on the stack trace" — on-stack is consistent, not explanatory.
5. **Simplicity check:** If two hypotheses explain the same evidence, prefer fewer assumptions (Occam).

Then assign plausibility:
- `leading` — evidence-tier ≤ 2, falsifiable, all 5 checks pass, disconfirmation attempted
- `plausible` — evidence-tier 3, falsifiable, survives validation, disconfirmation not yet attempted
- `disputed` — failed ≥ 1 validation check
- `rejected` — directly contradicted by primary-tier evidence
- `deferred` — probe would exceed budget

Write pass-1 verdict per hypothesis to: {verdict_path}

**Pass 2 input (after pass 1 written):**
- Critic's confidence claims: {confidence_claims_path}

**Pass 2 instructions:**
Review your pass-1 verdict against each critic's confidence claim. You may:
- CONFIRM (agree with pass 1)
- UPGRADE (critic's evidence revealed something you under-weighted — explain what)
- DOWNGRADE (critic's confidence is inflated relative to your independent assessment — explain why)

Write pass-2 addendum to the same verdict file. Final plausibility is your pass-2 conclusion.

Output ONLY this STRUCTURED block (coordinator reads only this):

STRUCTURED_OUTPUT_START
---
HYP_ID: {hypothesis_id}
PLAUSIBILITY: leading|plausible|disputed|rejected|deferred
FALSIFIABLE: true|false
EVIDENCE_TIER: 1|2|3|4|5|6
PASS2_VERDICT: CONFIRM|UPGRADE|DOWNGRADE
---
{repeat for each hypothesis in batch}
STRUCTURED_OUTPUT_END

If any hypothesis input is unparseable → default to `PLAUSIBILITY: disputed` for that hypothesis (never `rejected` on parse error — don't accidentally kill potentially-valid hypotheses).

CALIBRATION: `leading` = strong enough to warrant a fix attempt on its own. `plausible` = competitive with leader but needs a probe. `disputed` = validation failure. `rejected` = falsified. `deferred` = budget-blocked.

If your `leading`-rate across the batch is > 50%, re-read with skepticism — you are probably rubber-stamping.

Evidence-Gatherer (Haiku — for complex Phase 4 probes)

You are an evidence-gatherer executing a pre-specified discriminating probe. Your job is to run the probe AS SPECIFIED and report raw output — not to interpret.

Probe spec: {probe_spec_path}
  Read this file. It contains: question, distinguishes, execution method, expected-per-hypothesis, acceptance criterion.

Instructions:
1. Execute the probe using the specified command / tool / environment
2. If the command fails to execute (tool error, timeout, permission): report the execution failure — do NOT substitute training-knowledge answers
3. Capture the raw output verbatim — no summarization, no abbreviation
4. Apply the acceptance criterion from the spec to classify the result
5. Write output to: {result_path}
6. Format:

Probe Execution Result: {probe_id}

Raw output

{verbatim output}

Classification

Matches expected for: {hypothesis_id | neither | both | inconclusive} Reasoning: {1 sentence connecting raw output to acceptance criterion}


STRUCTURED_OUTPUT_START
PROBE_ID: {probe_id}
WINNER: {hypothesis_id | null}
FALSIFIED: [{list of hypothesis_ids}]
STATUS: completed | inconclusive | execution_failed
STRUCTURED_OUTPUT_END

IMPORTANT: You do NOT get to revise the probe spec. If the spec is flawed, report that in your reasoning — but execute the spec as written. Post-hoc spec rewrite destroys probe integrity.

Rebuttal Agent (Sonnet — runs in Phase 3f)

You are an adversarial rebuttal agent. Two hypotheses survived the judge at `leading` or `plausible`. Your job is to force the leader to defend against the strongest challenge the alternative can make.

Leader hypothesis file: {leader_path}
Alternative hypothesis file: {alternative_path}
Evidence file: {evidence_file_path}

Instructions:
1. Read both hypothesis files and the evidence file
2. Write the strongest possible challenge the alternative can make to the leader (§Challenge section)
   - Format: "The leader predicts X; but evidence Y (tier Z) shows X is not observed. Therefore the leader is missing something." OR "The leader's mechanism depends on assumption Z; if Z is false, the leader fails. Evidence W shows Z is actually false because..."
3. Write the leader's best response (§Response section)
   - MUST be evidence-based, not assertion-based. Cite tier-ranked evidence.
4. Determine outcome:
   - LEADER_HOLDS: response refutes challenge with tier-1 or tier-2 evidence
   - LEADER_WEAKENED: response is tier-4 or weaker while challenge is tier-2 or stronger → re-rank
   - LEADER_FALSIFIED: challenger's evidence directly contradicts a leader prediction → leader to `rejected`
   - INCONCLUSIVE: comparable evidence both sides → advance to discriminating probe
5. Check for convergence/separation:
   - Same root mechanism? → merge
   - Same next probe? → merge probes only (hypotheses stay separate if other predictions differ)
   - Different probes? → keep separate

Write to: {output_path} per FORMAT.md §Rebuttal Round Transcript

STRUCTURED_OUTPUT_START
LEADER: {hyp_id}
ALTERNATIVE: {hyp_id}
OUTCOME: LEADER_HOLDS|LEADER_WEAKENED|LEADER_FALSIFIED|INCONCLUSIVE
NEW_LEADER: {hyp_id}
CONVERGENCE: separate|merged|probe_merged
STRUCTURED_OUTPUT_END

IMPORTANT: Do not default to leader_holds. A rebuttal that changes nothing every time is a failed adversarial round — probably you're defending the leader rather than challenging it.

Architect (Opus — Phase 7 only, spawned ONCE per escalation)

You are an architectural advisor. Three fix attempts have failed on the same symptom. Each fix revealed new information; the pattern is consistent. Your job is to identify the PATTERN-LEVEL question the failed fixes reveal — not to propose another fix.

Inputs:
- State file: {state_file_path} — includes all cycles, hypotheses, probes, fix attempts with outcomes
- Evidence file: {evidence_file_path}
- Symptom: {symptom_verbatim}

Instructions:
1. Read the fix history. For each failed fix, note: which hypothesis was promoted, what change was made, what broke (new regression? old symptom re-surfaced in a new location? fix required massive refactoring to implement cleanly?)
2. Pattern-diagnose:
   - **Pattern problem:** state is shared across N call sites without coordination → each fix changes one site, breaks another
   - **Invariant problem:** an assumed precondition is violated by a specific call path → each fix adds a check, a new path bypasses it
   - **Wrong-abstraction problem:** the component being debugged shouldn't exist in its current shape; fixes try to force correctness onto an abstraction that can't express it
3. Name the architectural question in ≤ 3 sentences. Not "where's the bug" — "should this shared state exist at all?" / "is this abstraction the right shape?" / "does this component's responsibility match what callers expect?"
4. Propose 2-3 concrete architectural alternatives. For each: change description, which failed fixes this would resolve, refactoring scope sketch, risks.
5. Recommend one — or if tradeoffs are balanced, name the axes the user should weigh.

Write to: {output_path} per FORMAT.md §Architectural Question

IMPORTANT: You are NOT being asked for another fix. You are being asked to diagnose at the pattern level. A recommendation like "try this specific code change" is a Phase 5 answer — out of scope. Phase 7's deliverable is the question and the architectural alternatives.

Post-Completion: Retrospect Scan (always runs)

After the final report (Phase 8), run a lightweight retrospect scan:

Scan for P1-P3 signals (user corrections, self-corrections, structural failures)
Scan for P6 signals: check if any QA or judge verdicts reveal defect clusters — same failure class 2+ times, or a hypothesis that should have been tried earlier but was systematically missed. Maps to a behavioral gap, not a one-off miss.
If P1-P3 or qualifying P6 signals found: invoke /retrospect for full enforcement-first analysis
If none found: skip — no output needed

This creates a self-improving loop: debugging anti-patterns become enforceable rules even when the user doesn't intervene.

Integration with Other Skills

Deep-debug explicitly delegates to these skills at specific phases:

superpowers:test-driven-development — REQUIRED in Phase 5 step 5a for writing the failing test. Do not skip.
superpowers:verification-before-completion — REQUIRED in Phase 5 step 5c before marking a fix verified. "I ran the test manually" is not verification.
flaky-test-diagnoser — INVOKE when Phase 0c reproduction status is intermittent. Its bisection / ordering-permutation / timing-analysis workflow produces evidence that deep-debug would otherwise re-build poorly.
oh-my-claudecode:trace — deep-debug absorbed trace's evidence-for/against + discriminating-probe contract into its workflow. If the user already ran /trace and wants to continue with fix + verify, they can pass the trace output as symptom context to deep-debug. Don't run trace separately inside deep-debug.
oh-my-claudecode:deep-interview — OPTIONAL in Phase 7 if the architectural question needs requirements-crystallization before the user can choose an alternative.
superpowers:systematic-debugging — the lighter-weight sibling. If the symptom description suggests an obvious cause, user should use that skill; deep-debug is orthogonal (parallel, expensive, thorough) and overkill for simple bugs.

Mode Flags

--auto — skip all interactive gates, use defaults, stop at budget cap. Still requires USER confirmation for destructive probes (Phase 4 safety check) even in auto mode.
--hard-mode — force Phase 7 architectural escalation to run even on the first failed fix. Useful for post-mortems where the goal is "understand the pattern-level issue" not "just fix this one bug." max_cycles = 1 implicitly in hard-mode.
--hypothesis-override=<description> — in re-runs after Phase 7, the user names the architectural alternative they chose. deep-debug starts with this as the leading hypothesis; other dimensions still run as secondary.
--allow-untested — skip test-first requirement in Phase 5 (RARELY appropriate — e.g. one-off script, legacy code without test framework). Final report prominently flags the fix as untested.

Supplementary Files

DIMENSIONS.md — 8 hypothesis dimensions with typical angles, required categories, cross-dimensional angles, priority ordering
EVIDENCE.md — evidence strength hierarchy, disconfirmation rules, rebuttal round, discriminating probe criteria, down-ranking vocabulary
FORMAT.md — output formats for hypothesis file, evidence file, judge verdict, rebuttal, probe spec, architectural question, final report
STATE.md — state.json schema, concurrency rules, dedup algorithm, state updates, invariants, recovery logic
TECHNIQUES.md — ported techniques: root cause tracing, defense-in-depth, condition-based waiting (used across Phases 1, 5, 8)

Execution routing (sagaflow-first — Contract 6)

Sagaflow is the default execution path. The in-session workflow above is the FALLBACK, used only when the sagaflow worker is confirmed unavailable. See _shared/execution-model-contracts.md Contract 6.

Routing sequence (mandatory before any in-session work):

Run sagaflow doctor
If healthy → launch via sagaflow below. Stop. Do not run in-session.
If unhealthy → log SAGAFLOW_UNAVAILABLE, proceed with in-session fallback.

Launch command:

Bash(
  run_in_background=true,
  command="sagaflow launch deep-debug --arg symptom='<SYMPTOM>' --arg reproduction='<REPRO>' --arg num_hypotheses=<N> --await"
)

Substitute <SYMPTOM> with a one-line description of the failure, <REPRO> with the exact command to trigger it (empty string if unknown), and <N> with the hypothesis count (3–6, default 4). The workflow writes ~/.sagaflow/runs/<run_id>/debug-report.md with the termination label, leading hypothesis, and proposed fix (or escalation rationale).

deep-debug

Deep Debug Skill

Execution Model

Philosophy

The Iron Law

When to Use

Workflow

Phase -1: Context-Aware Composition Routing

Phase 0: Input Validation Gate

Phase 1: Evidence Gathering

Phase 2: Hypothesis Generation

Phase 3: Independent Hypothesis Judge (two-pass blind)

Phase 4: Discriminating Probe Execution

Phase 5: Fix + Verify

Phase 6: Cycle Termination Check

Phase 7: Architectural Escalation

Phase 8: Final Report

Honest Termination Vocabulary (7 labels)

Validation Checks

Incident Mode (--incident)

Extended Scope (ported from fix skill)

Ship Phase (after fix verified)

Flags

Golden Rules

Anti-Rationalization Counter-Table

Red Flags — STOP and Follow Process

Self-Review Checklist

Agent Prompt Templates

Hypothesis Agent (Sonnet — one per lane)

Outside-Frame Hypothesis Agent (Sonnet — slot #7)

Hypothesis Judge (Haiku — batched, two-pass blind)

Evidence-Gatherer (Haiku — for complex Phase 4 probes)

Probe Execution Result: {probe_id}

Raw output

Classification

Rebuttal Agent (Sonnet — runs in Phase 3f)

Architect (Opus — Phase 7 only, spawned ONCE per escalation)

Post-Completion: Retrospect Scan (always runs)

Integration with Other Skills

Mode Flags

Supplementary Files

Execution routing (sagaflow-first — Contract 6)

Deep Debug Skill

Execution Model

Philosophy

The Iron Law

When to Use

Workflow

Phase -1: Context-Aware Composition Routing

Phase 0: Input Validation Gate

Phase 1: Evidence Gathering

Phase 2: Hypothesis Generation

Phase 3: Independent Hypothesis Judge (two-pass blind)

Phase 4: Discriminating Probe Execution

Phase 5: Fix + Verify

Phase 6: Cycle Termination Check

Phase 7: Architectural Escalation

Phase 8: Final Report

Honest Termination Vocabulary (7 labels)

Validation Checks

Incident Mode (--incident)

Extended Scope (ported from fix skill)

Ship Phase (after fix verified)

Flags

Golden Rules

Anti-Rationalization Counter-Table

Red Flags — STOP and Follow Process

Self-Review Checklist

Agent Prompt Templates

Hypothesis Agent (Sonnet — one per lane)

Outside-Frame Hypothesis Agent (Sonnet — slot #7)

Hypothesis Judge (Haiku — batched, two-pass blind)

Evidence-Gatherer (Haiku — for complex Phase 4 probes)

Probe Execution Result: {probe_id}

Raw output

Classification

Rebuttal Agent (Sonnet — runs in Phase 3f)

Architect (Opus — Phase 7 only, spawned ONCE per escalation)

Post-Completion: Retrospect Scan (always runs)

Integration with Other Skills

Incident Mode (`--incident`)

Incident Mode (`--incident`)