| name | ds-review |
| description | This skill should be used when running Phase 4 of the /ds workflow or reviewing data analysis methodology. |
| user-invocable | false |
| disable-model-invocation | true |
| hooks | {"PostToolUse":[{"matcher":"Agent","hooks":[{"type":"command","command":"uv run python3 ${CLAUDE_PLUGIN_ROOT}/hooks/ds-post-subagent-guard.py"}]}],"PreToolUse":[{"matcher":"Agent","hooks":[{"type":"command","command":"uv run python3 ${CLAUDE_PLUGIN_ROOT}/hooks/ds-pre-subagent-clear.py"}]},{"matcher":"Read","hooks":[{"type":"command","command":"uv run python3 ${CLAUDE_PLUGIN_ROOT}/hooks/ds-read-after-subagent-guard.py"}]},{"matcher":"Grep","hooks":[{"type":"command","command":"uv run python3 ${CLAUDE_PLUGIN_ROOT}/hooks/ds-read-after-subagent-guard.py"}]},{"matcher":"Glob","hooks":[{"type":"command","command":"uv run python3 ${CLAUDE_PLUGIN_ROOT}/hooks/ds-read-after-subagent-guard.py"}]},{"matcher":"Write","hooks":[{"type":"command","command":"uv run python3 ${CLAUDE_PLUGIN_ROOT}/hooks/ds-no-main-chat-code-guard.py"}]},{"matcher":"Edit","hooks":[{"type":"command","command":"uv run python3 ${CLAUDE_PLUGIN_ROOT}/hooks/ds-no-main-chat-code-guard.py"}]},{"matcher":"Bash","hooks":[{"type":"command","command":"uv run python3 ${CLAUDE_PLUGIN_ROOT}/hooks/ds-no-main-chat-code-guard.py"}]},{"matcher":"Agent","hooks":[{"type":"command","command":"GATE_ARTIFACT=.planning/VALIDATION.md GATE_STATUS=validated GATE_DESCRIPTION=\"Output validation\" GATE_REMEDY=\"Run ds-validate first; review is gated until .planning/VALIDATION.md has status: validated. A clean validation is validated. If gaps were found, ds-validate flips status to validated ONLY after the user explicitly accepts them (recorded in an Accepted Gaps section). An undispositioned gaps_found means the user has not yet decided fix-vs-accept — go back to ds-validate's gate.\" GATE_BLOCKED_TOOLS=Agent uv run python3 ${CLAUDE_PLUGIN_ROOT}/hooks/phase-gate-guard.py"}]}]} |
Announce: "Using ds-review (Phase 4) to check methodology and quality."
Context Monitoring
| Level | Remaining Context | Action |
|---|
| Normal | >35% | Proceed normally |
| Warning | 25-35% | Complete current review cycle, then trigger ds-handoff |
| Critical | ≤25% | Immediately trigger ds-handoff — do not start new review cycles |
Review Strategy Choice
After announcing phase, choose review strategy.
Skip this choice when:
- Exploratory analysis (one-off, not for publication)
- Trivial changes (formatting, documentation)
- Internal reporting (low-stakes, quick turnaround)
- Single notebook with < 100 LOC
Otherwise, ask the user:
AskUserQuestion(questions=[{
"question": "How should we review this analysis?",
"header": "Review Strategy",
"options": [
{"label": "Single reviewer (Default)", "description": "Combined review covering methodology, data quality, and reproducibility. Faster, lower overhead."},
{"label": "Parallel review (Research-grade)", "description": "Spawn 3 specialized reviewers (Methodology, Reproducibility, Code quality). Use for publications, high-stakes decisions, or research-grade work. Requires reconciliation."}
],
"multiSelect": false
}])
If Single reviewer: Proceed to The Iron Law of DS Review below (current behavior).
If Parallel review: Skip to Parallel Review (Research-Grade).
Parallel Review (Research-Grade)
Use this section when user chose "Parallel review (Research-grade)" above.
Prerequisite: Requires CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS enabled. If unavailable, fall back to single reviewer.
1. Prerequisites Check
Before spawning reviewers, verify:
.planning/SPEC.md exists - reviewers verify against spec, not assumptions
.planning/PLAN.md exists - reviewers check tasks were completed
.planning/LEARNINGS.md exists - reviewers verify data quality pipeline documented
- Analysis files identified - notebooks/scripts in scope for review
If any prerequisite fails, STOP and return to /ds-implement.
2. When to Use Parallel Review
Use parallel review when:
- Publication-bound work (papers, reports, external sharing)
- High-stakes decisions (business strategy, funding, policy)
- Research-grade analysis (academic standards, peer review)
- Regulatory compliance (audit trail required)
- Complex methodology (multiple statistical methods, model comparisons)
- Large codebases (4+ notebooks, multiple scripts)
Do NOT use when:
- Exploratory analysis (one-off, not for publication)
- Internal reporting (low-stakes, quick answers)
- Simple descriptive stats (counts, means, basic visualizations)
- Overhead exceeds benefit (single notebook, < 100 LOC)
3. Create Team and Spawn Reviewers
Team Creation
TeamCreate(name="Analysis Review", task_description="Parallel analysis review with 3 specialized reviewers")
Press Shift+Tab to enter delegate mode. The lead coordinates reviews, does NOT review analysis directly.
Spawn 3 Reviewers
Each reviewer receives a self-contained prompt from a reference file. Reviewers start with a blank conversation and do NOT auto-load skills. Read the prompt, substitute variables, and paste it in full.
Tool Restrictions: All reviewers (Methodology, Reproducibility, Code Quality) are READ-ONLY with allowed_tools=["Read", "Glob", "Grep", "Bash(read-only)"]. Reviewers read code and data, run verification checks, and return verdicts. They MUST NOT use Write or Edit.
Before spawning, substitute these variables in each prompt:
ANALYSIS_FILES → list of notebooks/scripts in scope (paste actual list)
SPEC_CONTEXT → relevant sections of .planning/SPEC.md (paste inline, do NOT reference file)
PLAN_TASKS → task list from .planning/PLAN.md (paste inline, verify completed)
LEARNINGS_PIPELINE → data quality chain from .planning/LEARNINGS.md (paste inline)
PLUGIN_ROOT → resolved base directory for skill paths (relative to this skill's base directory)
Reviewer prompts (read, substitute variables, send as message):
| Reviewer | Focus | Prompt Source |
|---|
| 1. Methodology | Statistical soundness, assumptions, bias | references/methodology-reviewer.md |
| 2. Reproducibility | Seeds, versions, data traceability | references/reproducibility-reviewer.md |
| 3. Code Quality | Data quality handling, bugs, efficiency | references/code-quality-reviewer.md |
4. Lead Monitoring
While reviewers work, the lead:
- Watches for completion messages from all 3 reviewers
- Does NOT review analysis directly - your job is coordination and reconciliation
- If a reviewer asks a question: Answer it, then broadcast to other reviewers if relevant
- If a reviewer is taking significantly longer than others: Message them for status
- When all 3 reviewers complete: Proceed to reconciliation
5. Reconciliation Protocol (3 Passes)
After ALL reviewers message completion, the lead performs three passes.
This flowchart IS the specification. If the prose below and this diagram disagree, the diagram wins.
3 reviewer findings sets (Methodology, Reproducibility, Code Quality)
│
▼
┌─────────────────────────────────────────┐
│ Pass 1 — DEDUPLICATE │
│ group by file:location + root cause, │
│ merge dups (keep highest confidence) │
└───────────────────┬─────────────────────┘
▼
┌─────────────────────────────────────────┐
│ Pass 2 — PRIORITIZE │
│ rank critical / important / minor │
└───────────────────┬─────────────────────┘
▼
┌─────────────────────────────────────────┐
│ Pass 3 — INTEGRATION CHECK │
│ do findings conflict / interact? │
└───────────────────┬─────────────────────┘
conflict? │
┌──── yes ───────┴────── no ──────┐
▼ ▼
┌──────────────────┐ ┌──────────────────────────┐
│ escalate to user │ │ any critical/important? │
│ with the conflict│ └────────────┬─────────────┘
└──────────────────┘ ┌── yes ──┴── no ──┐
▼ ▼
┌────────────────┐ ┌────────────────┐
│ CHANGES REQ'D →│ │ APPROVED → │
│ /ds-implement │ │ ds-verify │
│ (max 3 cycles) │ └────────────────┘
└────────────────┘
**Pass 1 — Deduplication:**
Multiple reviewers may find the same issue (e.g., missing seed found by both Reproducibility and Code Quality reviewers).
- Read all reviewer findings
- Group by file and location
- Identify duplicates:
- Same file:location
- Same root cause (even if described differently)
- Merge duplicates:
- Keep the highest confidence score
- Combine descriptions if both add value
- Attribute to both reviewers
Example:
Reproducibility found: "notebook.ipynb cell 5 - Random seed not set (Confidence: 85)"
Code Quality found: "notebook.ipynb cell 5 - Stochastic operation unseeded (Confidence: 80)"
→ Merge: "notebook.ipynb cell 5 - Random seed missing for train_test_split (Confidence: 85, found by Reproducibility + Code Quality)"
Pass 2 — Prioritization:
Not all issues are equally important. Rank by:
- Severity × Confidence:
- Critical (90-100 confidence) > Important (80-89)
- Methodology > Reproducibility > Code Quality (when confidence is equal)
- Impact on conclusions:
- Invalidates results > Affects interpretation > Inconvenient
- Correctness > Reproducibility > Style
- Fix effort:
- Quick wins (< 30 min) should be fixed now
- Large refactors (> 2 hours) should be documented as limitations
Create final prioritized list:
1. [CRITICAL] Methodology: Selection bias invalidates results (Confidence: 95)
2. [CRITICAL] Code Quality: Join explosion duplicates rows (Confidence: 90)
3. [IMPORTANT] Reproducibility: Random seed missing (Confidence: 85)
4. [IMPORTANT] Code Quality: High-null column in final data (Confidence: 80)
Pass 3 — Integration Check:
Proposed fixes may conflict with each other or create new problems.
- Read each reviewer's suggested fixes
- Check for conflicts:
- Do two fixes modify the same code?
- Does one fix introduce a problem another reviewer would flag?
- Do fixes require contradictory approaches?
- If conflicts exist:
- Design a unified fix addressing all concerns
- OR: Flag the conflict and ask reviewers for input
Example conflict:
Methodology: "Use stratified sampling to control for confounder"
Code Quality: "Simplify sampling code for readability"
→ Unified: "Use stratified sampling (methodology) with clear variable names and comments (code quality)"
If ANY pass finds conflicts → resolve before reporting final verdict.
6. Final Verdict
After reconciliation, the lead reports:
## Parallel Analysis Review: [Analysis Name]
Reviewed by: Methodology, Reproducibility, Code Quality
### Reconciliation Summary
**Issues found:** X total (Y critical, Z important)
**Duplicates merged:** N
**Conflicts resolved:** M
### Critical Issues (Must Fix)
[Deduplicated, prioritized list from Pass 1 + 2]
### Important Issues (Should Fix)
[Deduplicated, prioritized list from Pass 1 + 2]
### Verdict: APPROVED | CHANGES REQUIRED
[If APPROVED]
All 3 reviewers approved with no issues >= 80 confidence. The analysis meets research-grade standards.
[If CHANGES REQUIRED]
X critical and Y important issues must be addressed. Return to /ds-implement.
After parallel review completes:
If APPROVED: Immediately discover and load the ds-verify skill:
Read ${CLAUDE_SKILL_DIR}/../../skills/ds-verify/SKILL.md and follow its instructions.
If CHANGES REQUIRED: Return to /ds-implement to fix reported issues.
Maximum 3 review cycles. If issues persist after 3 rounds of review → implement → re-review, escalate to the user with a summary of unresolved issues. Do not loop indefinitely.
Analysis Review
Single-pass review combining methodology correctness, data quality handling, and reproducibility checks. Uses confidence-based filtering.
## The Iron Law of DS Review
You MUST only report issues with >= 80% confidence. This is not negotiable.
Before reporting ANY issue, you MUST:
- Verify it's not a false positive
- Verify it impacts results or reproducibility
- Assign a confidence score
- Only report if score >= 80
This applies even when:
- "This methodology looks suspicious"
- "I think this might introduce bias"
- "The approach seems unusual"
- "I would have done it differently"
STOP - If you catch yourself about to report a low-confidence issue, DISCARD IT. You're about to compromise the review's integrity.
## The Iron Law of Re-Review
NO "FIXED" CLAIMS WITHOUT FRESH RE-REVIEW. This is not negotiable.
When review returns CHANGES REQUIRED and the analyst applies fixes, you MUST:
- Re-run the SAME review criteria (methodology, data quality, reproducibility)
- Verify issues are actually resolved (not assumed)
- Check for new issues introduced by fixes (data changes, methodology shifts)
- Only THEN return APPROVED
"I fixed it" without re-reviewing is NOT HELPFUL — unverified fixes ship wrong results to the user.
The Audit-Fix Loop (Max 3 Iterations)
Iteration 1: Review → CHANGES REQUIRED → Fix → Re-Review
↓
Iteration 2: Re-Review → CHANGES REQUIRED → Fix → Re-Review
↓
Iteration 3: Re-Review → CHANGES REQUIRED → Fix → Re-Review
↓
Still issues? → ESCALATE to user
All clean? → APPROVED
Track iterations in .planning/REVIEW_STATE.md:
---
iteration: 1
max_iterations: 3
last_review_date: 2026-03-09
issues_found_count: 5
---
Exit criteria:
- APPROVED: Zero issues >= 80 confidence
- ESCALATE: iteration >= 3 AND issues remain
- CONTINUE: iteration < 3 AND issues remain → loop back
Before returning any verdict, check iteration count:
- READ
.planning/REVIEW_STATE.md (create if missing with iteration: 1)
- If iteration >= 3 and issues remain: ESCALATE (don't return CHANGES REQUIRED)
- If iteration < 3 and issues remain: INCREMENT iteration, return CHANGES REQUIRED
- If no issues: APPROVED
Claiming APPROVED without re-review after fixes is NOT HELPFUL — the user acts on an approval that has no verification behind it.
Re-Review Facts
- An analyst's "I fixed it" is a claim, not a verification — approving on it is rubber-stamping, and the approval carries no evidence.
- Spot-checks of fixed code miss downstream impacts; a "minor" analysis change can break conclusions several steps away. Re-review runs the full criteria, not a diff of the fix.
- A 20-minute re-review is cheap against a retracted result — skipping it to deliver faster is anti-helpful on its own terms.
Shared Enforcement
Load shared ds constraints before reviewing.
Auto-load all constraints matching applies-to: ds-review:
!uv run python3 ${CLAUDE_SKILL_DIR}/../../scripts/load-constraints.py ds-review
You MUST have these constraints loaded before proceeding. No claiming you "remember" them.
Review Focus Areas
Spec Compliance
Master Dataset Consistency
Parameter Transparency
Data Quality Handling
Independent Verification (MANDATORY)
**Do NOT trust the analyst's claims about data quality. Run these checks yourself.**
The analyst may have reported "no duplicates" without actually checking, or "handled missing values" by silently dropping rows. You MUST run independent verification.
Load shared check definitions first. Read ${CLAUDE_SKILL_DIR}/../../skills/ds-implement/references/ds-checks.md and follow its instructions.
Run checks DQ1-DQ5, M1 from the shared definitions. This ensures ds-review and ds-fix use identical checks.
Post-subagent boundary (C5): After any review Task agent returns, do NOT read project source code or data files to "double-check." Read the agent's report only. If issues found, re-dispatch.
Dispatch a Task agent to run these checks on the final analysis data:
for col in df.columns:
if df[col].nunique() <= 1:
print(f"WARNING: {col} is constant or empty ({df[col].nunique()} unique values)")
null_pct = df.isnull().mean()
high_null = null_pct[null_pct > 0.5]
if len(high_null) > 0:
print(f"WARNING: Columns >50% null still in data:\n{high_null}")
key_cols = [...]
dupes = df.duplicated(subset=key_cols, keep=False)
if dupes.sum() > 0:
print(f"WARNING: {dupes.sum()} duplicate rows on {key_cols}")
print(df[dupes].head())
print(f"Final row count: {len(df)}")
for col in df.select_dtypes(include='object').columns:
n_unique = df[col].nunique()
if n_unique > 0.9 * len(df):
print(f"WARNING: {col} has near-unique cardinality ({n_unique}/{len(df)}) — likely an ID, not a category")
if n_unique == len(df):
print(f"INFO: {col} is fully unique — confirm this is a key, not a category used in groupby")
If ANY check produces a WARNING, this is a high-confidence issue (>=80). Report it.
Methodology Appropriateness
Reproducibility
Output Quality
Confidence Scoring
Rate each potential issue from 0-100:
| Score | Meaning |
|---|
| 0 | False positive or style preference |
| 25 | Might be real, methodology is unusual but valid |
| 50 | Real issue but minor impact on conclusions |
| 75 | Verified issue, impacts result interpretation |
| 100 | Certain error that invalidates conclusions |
CRITICAL: You MUST only report issues with confidence >= 80. If you report below this threshold, you're misrepresenting your certainty.
Common DS Issues to Check
Data Leakage
- Training data contains information from future
- Test data used in feature engineering
- Target variable used directly or indirectly in features
Selection Bias
- Filtering introduced systematic bias
- Survivorship bias in longitudinal data (analyzing only surviving entities — e.g., only companies that didn't delist)
- Non-random sampling not addressed
Join Explosion
- Many-to-many joins silently multiplying rows
- Detection: compare
COUNT(*) before and after join — any increase signals duplication
SELECT 'before' AS stage, COUNT(*) FROM a
UNION ALL
SELECT 'after', COUNT(*) FROM a JOIN b ON a.key = b.key;
- Always check join key uniqueness:
SELECT key, COUNT(*) FROM b GROUP BY key HAVING COUNT(*) > 1
Incomplete Period Comparison
- Comparing a partial current period (e.g., this month so far) to a full prior period
- Metrics will always look lower for the incomplete period — normalize by days elapsed or filter to complete periods only
Denominator Shifting
- Rate or ratio denominators change between periods, making rates incomparable
- Example: "conversion rate dropped" but actually the denominator (total visitors) grew while numerator stayed flat
- Always report both numerator and denominator, not just the ratio
Average of Averages
- Averaging pre-computed group averages produces incorrect results when group sizes differ
- Must compute weighted average or aggregate from raw data
- Example: avg(store_avg_price) ≠ avg(price) across all items
Timezone Mismatches
- Different data sources using different timezones (UTC vs local vs server time)
- Symptoms: off-by-one day counts, missing hours around DST transitions, events appearing at impossible times
- Always document timezone assumptions per source and convert to a single timezone early in the pipeline
Simpson's Paradox
- Aggregate trend reverses when data is segmented by a confounding variable
- Example: treatment appears better overall but worse in every subgroup because of unequal group sizes
- When reporting aggregate results, always check if the trend holds within key segments
Statistical Errors
- Multiple testing without correction
- p-hacking or selective reporting
- Correlation interpreted as causation
- Inadequate sample size for claimed precision
Reproducibility Failures
- Random operations without seeds
- Undocumented data preprocessing
- Hard-coded paths or environment dependencies
- Missing package versions
Required Output Structure
## Analysis Review: [Analysis Name]
Reviewing: [files/notebooks being reviewed]
### Critical Issues (Confidence >= 90)
#### [Issue Title] (Confidence: XX)
**Location:** `file/path.py:line` or `notebook.ipynb cell N`
**Problem:** Clear description of the issue
**Impact:** How this affects results/conclusions
**Requirement:** [REQ-ID from SPEC.md if this issue relates to a specific requirement, or "general"]
**Fix:**
```python
# Specific fix
Important Issues (Confidence 80-89)
[Same format as Critical Issues]
Data Quality Checklist
| Check | Status | Notes |
|---|
| Missing values | PASS/FAIL | [details] |
| Duplicates | PASS/FAIL | [details] |
| Outliers | PASS/FAIL | [details] |
| Type correctness | PASS/FAIL | [details] |
Methodology Checklist
| Check | Status | Notes |
|---|
| Appropriate for data | PASS/FAIL | [details] |
| Assumptions checked | PASS/FAIL | [details] |
| Sample size adequate | PASS/FAIL | [details] |
Reproducibility Checklist
| Check | Status | Notes |
|---|
| Seeds set | PASS/FAIL | [details] |
| Versions documented | PASS/FAIL | [details] |
| Data versioned | PASS/FAIL | [details] |
Summary
Verdict: APPROVED | CHANGES REQUIRED
[If APPROVED]
The analysis meets quality standards. No methodology issues with confidence >= 80 detected.
[If CHANGES REQUIRED]
X critical issues and Y important issues must be addressed before proceeding.
## Agent Invocation
Spawn a Task agent to review the analysis:
Task(subagent_type="general-purpose",
allowed_tools=["Read", "Glob", "Grep", "Bash(read-only)"],
prompt="""
Review analysis against .planning/SPEC.md.
Execute TWO-PASS review:
PASS 1 - Independent Data Quality Verification (RUN CODE):
- Load the final analysis data
- Check for empty/constant columns (nunique <= 1)
- Check for high-null columns (>50% null)
- Check for duplicate rows on key columns
- Verify row count matches LEARNINGS.md chain
- Check cardinality of categorical columns
Report any WARNING as confidence >= 80.
PASS 2 - Methodology and Compliance Review (READ CODE):
- Spec compliance - verify all SPEC.md objectives addressed
- Master dataset consistency - confirm every exhibit derives from a declared master (none re-pulls raw sources); exhibits sharing a master tie out (matching N/counts); the dataset-construction mermaid diagram matches the code that ran. Mismatched counts across exhibits that should share a sample = confidence >= 80.
- Parameter transparency - grep for inline numeric literals in filter/clip/winsorize/date-comparison code; confirm all come from the named config location and match the Filters & Parameters table; confirm principled(✓) params cite a source/validation and every convenience(⚠) param has an EXECUTED disposition (robustness panel / verified-redundant / display-only). Stray magic numbers, or ⚠ params with unexecuted dispositions, = confidence >= 80.
- Data quality handling - confirm issues from PLAN.md were resolved
- Methodology - verify appropriate methods, assumptions checked
- Reproducibility - confirm seeds, versions, documentation
Confidence score each issue (0-100).
Report only issues with >= 80 confidence.
Return structured output per /ds-review format.
""")
### Review Facts
- Reading the analyst's notebook or LEARNINGS.md and thinking "looks reasonable" is not verification — only executed code that independently checks the results is. An approval without it is a rubber stamp the user mistakes for review.
- Output-first verification during implementation catches per-step issues, not cumulative ones — the final state must be checked independently, or the review verifies nothing the implementer didn't already.
- The verification checks take seconds; a skipped 30-second check has let a join explosion through that took 3 days to debug. Reading instead of running is anti-efficient.
- Issues reported below 80 confidence are noise that drowns the real findings — flagging them to appear thorough makes the review less useful, not more.
- "Minor data issues won't affect conclusions" is a magnitude claim made without measuring the magnitude — quantify the impact, then decide.
## Delete & Restart: Fundamental Methodology Failures
<EXTREMELY-IMPORTANT>
**If methodology is fundamentally flawed, DELETE the implementation and return to ds-plan. No patching.**
A methodology is fundamentally flawed when:
- **Wrong statistical approach** (e.g., linear regression on non-linear data, parametric test on non-normal data without justification)
- **Wrong data source** (e.g., using quarterly data when daily is required, wrong table entirely)
- **Missing critical variable** (e.g., no control for a known confounder, omitted variable bias)
- **Wrong unit of analysis** (e.g., analyzing at firm-level when the question is about transactions)
**Methodology-flaw facts:** tweaking parameters around a wrong statistical method is p-hacking, not fixing; a known confounder omitted from the main analysis invalidates it — adding it as a "robustness check" leaves the headline result wrong; and patching a flawed foundation produces a patched flawed analysis. All three cases mean DELETE and replan, not iterate.
**When you identify a fundamental flaw:**
1. Document the flaw in LEARNINGS.md (what's wrong and why it can't be patched)
2. Report to user: "Methodology is fundamentally flawed: [specific reason]. Returning to ds-plan."
3. Return to ds-plan (not ds-implement) — the plan itself needs rethinking
**Patching a broken methodology to avoid rework is NOT HELPFUL — the user deserves correct analysis, not fast wrong analysis.**
</EXTREMELY-IMPORTANT>
## Visual Diagnostics for Review Decision Points
When presenting review findings to the user (especially at CHANGES REQUIRED verdicts), generate diagnostic plots to support the decision:
| Review Finding | Diagnostic to Generate |
|---------------|----------------------|
| Join explosion detected | Row count waterfall (before/after each join) |
| Selection bias suspected | Distribution comparison (included vs excluded populations) |
| Missing value impact | Missingness heatmap (columns x time periods) |
| Outlier influence | Coefficient sensitivity plot (with/without outliers) |
| Reproducibility failure | Side-by-side run comparison (key metrics from Run 1 vs Run 2) |
**Format:** Inline plots in notebooks, or saved to `scratch/diagnostics/` for script-based workflows. Present alongside the review verdict.
**When to generate:** Only at `decision` checkpoints where the user must choose between accepting or fixing. Do not generate plots for clean review passes (no decision needed).
**Observe → record → offer (learn-by-doing):** After each review decision checkpoint, append one line to `.planning/LEARNINGS.md` recording **which diagnostic the user actually looked at** to make the call (e.g. `review-view: row-count waterfall (join explosion)` or `review-view: read verdict summary only — no plot`). Do NOT build visualizations speculatively. After the **same** view is requested 3+ times across reviews, offer to bundle a script in `skills/ds-review/scripts/` that generates it automatically. Until then, the table above is an offer menu, not a mandate.
## Quality Standards
- **You must NOT report methodology preferences not backed by statistical principles.** Your opinion about how code should be written is not a review issue.
- **You must treat alternative valid approaches as non-issues (confidence = 0).** If the approach works correctly, don't report it.
- Ensure each reported issue is immediately actionable
- **If you're unsure, rate it below 80 confidence.** Uncertainty is not a reason to report—it's a reason to investigate more.
- Focus on what affects conclusions, not style. **STOP if you catch yourself criticizing coding style—that's not your role here.**
## Gate: Exit Review Loop
**Checkpoint type:** human-verify (review scores are machine-verifiable)
Before claiming review is complete (APPROVED or ESCALATE):
-
IDENTIFY → What proves the review verdict is valid?
- APPROVED: Zero issues >= 80 confidence
- ESCALATE: iteration >= 3 AND issues remain
-
RUN → Check .planning/REVIEW_STATE.md for iteration count
Read review output for issue count
-
READ → Examine both:
- Review output (issues list)
- REVIEW_STATE.md (iteration number)
-
VERIFY → Verdict matches state:
- APPROVED only if 0 issues
- ESCALATE only if iteration >= 3
- CHANGES REQUIRED only if iteration < 3
-
CLAIM → Only after steps 1-4 pass, return verdict
**If iteration >= 3 and you're returning CHANGES REQUIRED instead of ESCALATE, you're being anti-helpful — the user needs to know when the loop isn't converging.**
## Phase Complete
After review completes, handle verdict-specific transitions:
### If APPROVED (no issues >= 80 confidence)
Mark review complete in `.planning/REVIEW_STATE.md`:
```yaml
---
status: APPROVED
iteration: [N]
max_iterations: 3
last_review_date: [date]
issues_found_count: 0
verdict: APPROVED
---
status: APPROVED is the structural gate field — ds-verify declares a PreToolUse phase-gate-guard.py hook that blocks Agent dispatch until .planning/REVIEW_STATE.md shows status: APPROVED. While the review loop is unresolved (CHANGES_REQUIRED / ESCALATE), status is NOT APPROVED, so verification is structurally blocked.
Immediately discover and load ds-verify:
Read ${CLAUDE_SKILL_DIR}/../../skills/ds-verify/SKILL.md and follow its instructions.
If CHANGES REQUIRED (issues >= 80 confidence found, iteration < 3)
Update .planning/REVIEW_STATE.md:
---
status: CHANGES_REQUIRED
iteration: [N+1]
max_iterations: 3
last_review_date: [date]
issues_found_count: [count]
verdict: CHANGES_REQUIRED
---
Return to /ds-implement with specific issues. Analyst MUST re-invoke /ds-review after fixes.
Critical: When analyst returns claiming "fixed", you MUST re-run the FULL review. No shortcuts.
If ESCALATE (iteration >= 3, issues remain)
Update .planning/REVIEW_STATE.md:
---
status: ESCALATE
iteration: 3
max_iterations: 3
last_review_date: [date]
issues_found_count: [count]
verdict: ESCALATE
---
Report to user:
Review Loop Escalation (3 iterations completed)
After 3 fix-review cycles, [N] issues remain:
[List issues]
Options:
1. Accept current state and document limitations
2. Extend review (manual approval for iteration 4+)
3. Rethink methodology (return to /ds-plan)
Which option do you prefer?
Workflow Continuity After Review
| Verdict | Next Action | Iteration Counter |
|---|
| APPROVED | Invoke /ds-verify immediately | Reset to 1 for next analysis |
| CHANGES REQUIRED | Return to /ds-implement, analyst fixes then re-invokes /ds-review | Increment |
| ESCALATE | Ask user for direction | Keep at max |
Do NOT pause between review completion and next action. The workflow is sequential.