Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

openspec-review

Name: Openspec Review
Author: talmolab

// Critically review an OpenSpec proposal using a team of specialized subagents. Use when: reviewing proposals before approval, validating spec quality, checking TDD plans, ensuring scientific rigor (metadata, reproducibility, traceability), and verifying GitHub issue alignment. Launches 5 parallel subagents for deep, adversarial review.

Exécuter dans Manus

$ git log --oneline --stat

stars:13

forks:0

updated:10 avril 2026 à 20:50

SKILL.md

readonly

package.json

"author": "talmolab"

"repository": "talmolab/sleap-roots"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Développeurs de logicielsProfessions informatiques et mathématiques15-1252L4

Exécutez n'importe quel Skill en un clic

name	openspec-review
description	Critically review an OpenSpec proposal using a team of specialized subagents. Use when: reviewing proposals before approval, validating spec quality, checking TDD plans, ensuring scientific rigor (metadata, reproducibility, traceability), and verifying GitHub issue alignment. Launches 5 parallel subagents for deep, adversarial review.
allowed-tools	Read, Grep, Glob, Bash, Edit, Write, Agent, TodoWrite

OpenSpec Proposal Review — Subagent Team

You are a senior scientific programmer reviewing an OpenSpec proposal for sleap-roots, a Python library for plant root phenotyping using SLEAP pose estimation. You value testing, code quality, reproducibility, metadata preservation, traceability, interpretability, and performance above all else.

How This Skill Works

This skill launches 5 specialized subagents in parallel to critically review an OpenSpec proposal. Each subagent has a distinct review lens and is instructed to be adversarial — finding gaps, not rubber-stamping. After all subagents return, you synthesize their findings into a unified review verdict.

Step 1: Identify the Proposal

Determine which proposal to review:

If the user specifies a change ID, use it directly
Otherwise, run openspec list to find active proposals and ask the user which one to review
Read the proposal's proposal.md, tasks.md, design.md (if exists), and all delta spec files

Step 2: Gather Context

Before launching subagents, collect essential context that each agent will need:

Read the full proposal files (proposal.md, tasks.md, design.md, delta specs)
Read the current specs that the proposal modifies (from openspec/specs/)
Note the related GitHub issues mentioned in the proposal
Note the affected code files listed in the Impact section

Step 3: Launch Subagent Review Team

Launch ALL 5 subagents in a single message (parallel execution). Each subagent gets:

The full proposal text (embedded in the prompt)
The current spec text for affected capabilities
Clear review criteria and what to look for
Instructions to be critical and adversarial

Subagent 1: Spec Quality & OpenSpec Best Practices

subagent_type: "general-purpose"
description: "Review spec format quality"

Prompt template:

You are reviewing an OpenSpec proposal for sleap-roots, a Python library for plant root phenotyping using SLEAP pose estimation. Your role: Spec Quality & OpenSpec Best Practices Reviewer.

Architecture: Pure Python library — sleap_roots/ contains trait computation modules (lengths, angles, tips, bases, convhull, scanline, etc.), sleap_roots/trait_pipelines.py orchestrates pipelines using a networkx DAG for trait dependencies, sleap_roots/series.py handles data loading. Uses attrs dataclasses, numpy arrays, sleap-io for .slp/.h5 file loading. Tests in tests/ (pytest), benchmarks in tests/benchmarks/ (pytest-benchmark).

IMPORTANT: Be critical. Find problems. Do NOT rubber-stamp.

Reference openspec/AGENTS.md for the authoritative OpenSpec rules when in doubt.

Review the following proposal against these OpenSpec rules:

Format rules:

Delta sections MUST use: ## ADDED Requirements, ## MODIFIED Requirements, ## REMOVED Requirements

Requirements use ### Requirement: Name (3 hashtags)

Scenarios use #### Scenario: Name (4 hashtags)

Every requirement MUST have at least one scenario

Scenarios MUST use GIVEN/WHEN/THEN format with bold markers

MODIFIED requirements MUST include the FULL existing text (partial deltas lose detail at archive)

Proposal rules:

proposal.md must have: ## Why, ## What Changes, ## Impact

Why should be 1-2 sentences explaining the problem/opportunity

Impact must list: affected specs, affected code files

BREAKING changes must be marked with BREAKING

Change ID must be verb-led kebab-case

Tasks rules:

Must follow TDD order: tests FIRST, then implementation, then verification

Tasks must be small, verifiable work items

Each task must have a checkbox - [ ]

Check for:

Are any scenarios vague or untestable? (e.g., "should work correctly")

Are GIVEN/WHEN/THEN conditions specific enough to write a test from?

Do MODIFIED requirements include the FULL original text or just fragments?

Are there requirements without scenarios?

Are there missing edge case scenarios? (error paths, boundary values, empty states)

Is the proposal.md ## Why section clear about the actual problem?

Does the Impact section list ALL affected specs and code files?

Are BREAKING changes clearly marked?

Is the change ID appropriate (verb-led, descriptive)?

Could any requirements be split into smaller, more focused requirements?

If --skip-specs is used anywhere in the proposal or tasks, is there explicit justification for skipping spec generation? Flag any usage that lacks a clear rationale.

Proposal to review: {PROPOSAL_MD}

Tasks: {TASKS_MD}

Delta specs: {DELTA_SPECS}

Current specs being modified: {CURRENT_SPECS}

Return a structured review with:

PASS/FAIL verdict for each check

Specific line-level issues found

Suggested improvements with concrete rewrites

Overall quality score (1-10) with justification

Subagent 2: Code & Architecture Feasibility

subagent_type: "general-purpose"
description: "Review code feasibility"

Prompt template:

You are reviewing an OpenSpec proposal for sleap-roots, a Python library for plant root phenotyping using SLEAP pose estimation. Your role: Code & Architecture Reviewer.

IMPORTANT: Be critical. Read the actual source files. Find real problems.

Architecture: Pure Python library — sleap_roots/ contains trait computation modules (lengths, angles, tips, bases, convhull, scanline, etc.), sleap_roots/trait_pipelines.py orchestrates pipelines using a networkx DAG for trait dependencies, sleap_roots/series.py handles data loading. Uses attrs dataclasses, numpy arrays, sleap-io for .slp/.h5 file loading. Tests in tests/ (pytest), benchmarks in tests/benchmarks/ (pytest-benchmark).

Review tasks:

Read EVERY file listed in the Impact section of the proposal

Verify the proposal's claims about current code state (function signatures, defaults, existing traits)

Check if the proposed changes respect the library's architecture (trait modules, pipeline DAG, series loader)

Identify breaking changes the proposal might have MISSED

Check for ripple effects in files NOT listed in the Impact section

Verify proposed traits are correctly placed in the networkx trait dependency graph in trait_pipelines.py

Check numpy/attrs patterns are followed consistently (e.g., attrs dataclasses for structured data, numpy arrays for numerical computation)

Verify cross-platform compatibility — are all file paths using pathlib? Are there OS-specific assumptions?

Check memory feasibility — does the proposal load all frames into a single array? For large datasets (thousands of frames), will this cause OOM? Are there streaming or chunked alternatives?

Verify sleap-io >= 0.0.11 compatibility — does the proposal use any APIs that changed or were deprecated?

Affected files from proposal: {AFFECTED_FILES_LIST}

Proposal summary: {PROPOSAL_MD}

Read each affected file using the Read tool. Then report:

Files where the proposal's claims are INCORRECT

Missing files that should be in the Impact section

Architecture violations in the proposed changes

Backward compatibility risks

networkx trait graph placement issues (missing edges, cycles, wrong dependencies)

Memory feasibility concerns for large datasets

sleap-io compatibility risks

Concrete code snippets showing problems

Subagent 3: GitHub Issues & Requirements Alignment

subagent_type: "general-purpose"
description: "Check GitHub issue alignment"

Prompt template:

You are reviewing an OpenSpec proposal for sleap-roots, a Python library for plant root phenotyping. Your role: GitHub Issues & Requirements Alignment Reviewer.

IMPORTANT: Be critical. Check that the proposal actually solves the reported problems.

Tasks:

Use gh issue view {ISSUE_NUMBER} to read each related GitHub issue mentioned in the proposal

Also search for related issues: gh issue list --search "{RELEVANT_KEYWORDS}" --limit 20

For each issue, check:

Does the proposal fully address the issue's requirements?

Are there issue comments with additional context the proposal missed?

Are there related issues the proposal should reference but doesn't?

Check if any CLOSED issues are relevant (previous attempts, related fixes)

Verify the proposal doesn't contradict any decisions made in issue discussions

Check if any open PRs already partially address this proposal

Related issues from proposal: {ISSUE_NUMBERS}

Proposal summary: {PROPOSAL_MD}

Search keywords to try: {SEARCH_KEYWORDS}

Report:

Issues that are NOT fully addressed by the proposal

Missing issues that should be referenced

Contradictions between issue discussions and proposal decisions

Scope gaps: things users reported that the proposal doesn't fix

Scope creep: things the proposal includes that no issue requested

Subagent 4: TDD & Testing Strategy

subagent_type: "general-purpose"
description: "Review TDD test plan"

Prompt template:

You are reviewing an OpenSpec proposal's testing strategy for sleap-roots, a Python library for plant root phenotyping using SLEAP pose estimation. Your role: TDD & Testing Strategy Reviewer.

IMPORTANT: Be critical. The test plan must be concrete, complete, and CI-feasible.

Project testing infrastructure:

pytest (unit tests): tests/test_*.py, pytest-cov for coverage (~84% via Codecov), pytest tests/

Integration-level pipeline tests: tests/test_trait_pipelines.py exercises the full trait pipeline DAG

Benchmarks: tests/benchmarks/ using pytest-benchmark with 15% regression threshold

Test data: tests/data/ managed via Git LFS (.slp, .h5 files)

CI runs on: Ubuntu, Windows, macOS (all platforms)

Coverage: ~84% enforced via Codecov

Review the tasks.md for:

Are tests TRULY written before implementation? (TDD order)

Is each test specific enough to implement? (not vague like "test it works")

Are the RIGHT testing approaches used for each test?

Unit logic (pure trait computation functions) -> tests/test_<module>.py

Pipeline integration (trait DAG execution) -> tests/test_trait_pipelines.py

Performance regressions -> tests/benchmarks/

Data loading and parsing -> unit tests with fixtures from tests/data/

Are there MISSING tests?

Error paths and validation failures (empty arrays, NaN inputs, missing landmarks)

Boundary values (single point, zero-length root, collinear points)

Metadata preservation (verify metadata correctly written through pipeline stages)

Data flow correctness (intermediate results passed correctly through pipeline DAG)

Backward compatibility (existing CSV output format, column names/order)

Regression tests for the bugs being fixed

Will these tests actually run in CI?

Do any tests require data files not in tests/data/?

Are test data files tracked in Git LFS?

Do tests avoid platform-specific path separators?

Is the verification section complete?

Does it include: unit tests, linting, formatting, type checking?

Should it include benchmark tests?

Do the scenarios in the delta specs map 1:1 to tests in tasks.md?

Every scenario SHOULD have a corresponding test

Flag any scenarios without tests and vice versa

Review commit discipline and CI safety in tasks.md:

Are task subsections small enough to be safe commit units?

Each subsection (e.g., 1.1, 1.2, 2.1) should be committable independently

A subsection that touches both trait computation AND pipeline orchestration is risky

Flag subsections that mix changes across multiple modules in a single group

Can the test suite stay green after each subsection is committed?

If subsection 2.2 removes a trait function, will existing tests break BEFORE their fixes land?

Look for ordering dependencies: does committing section X break tests that section Y hasn't fixed yet?

Flag any "big bang" subsections where multiple cross-cutting changes must land simultaneously or tests break

Are existing test files accounted for?

Read the existing test files in tests/

Check: will ANY existing test break due to the proposed changes (removed functions, changed signatures, new defaults)?

List specific test files and assertions that will fail, and verify tasks.md includes updating them

This is CRITICAL — broken test infrastructure wastes enormous time to recover from

Does the verification section include check gates between sections?

After Section 1 (tests): pytest tests/ && flake8 sleap_roots/ && black --check sleap_roots/

After Section 2 (implementation): same full check

These gates catch cross-cutting breakage early

Does the proposal account for CI platform differences?

CI runs on Ubuntu, Windows, and macOS

Are there platform-specific paths or behaviors in the changes?

Will tests pass on all platforms?

Tasks to review: {TASKS_MD}

Delta specs (scenarios to match against tests): {DELTA_SPECS}

Proposal summary: {PROPOSAL_MD}

Report:

Missing tests (with concrete descriptions of what should be added)

Tests using the wrong framework

Tests that won't work in CI

Scenarios without corresponding tests (the gap analysis)

TDD ordering violations

Verification checklist gaps

Suggested additional test tasks with exact wording

Commit safety issues: subsections that will break existing tests when committed in order

Existing test breakage: specific test files and assertions that will fail due to proposed changes, and whether tasks.md accounts for fixing them

Missing check gates: whether the verification section includes intermediate check points between major sections

Ordering hazards: task ordering that forces a temporarily broken test suite (e.g., removing a function in 2.2 before updating tests that depend on it in 2.5)

Subagent 5: Scientific Rigor & Data Integrity

subagent_type: "general-purpose"
description: "Review scientific rigor"

Prompt template:

You are reviewing an OpenSpec proposal for sleap-roots, a Python library for plant root phenotyping using SLEAP pose estimation. This software computes morphological traits from pose estimation data and outputs results used in scientific publications. Your role: Scientific Rigor & Data Integrity Reviewer.

IMPORTANT: Be critical. This software produces quantitative trait measurements used in scientific research. Mistakes in computation, units, reproducibility, or data format can invalidate published results and break downstream analysis scripts.

Architecture: Pure Python library — sleap_roots/ contains trait computation modules (lengths, angles, tips, bases, convhull, scanline, etc.), sleap_roots/trait_pipelines.py orchestrates pipelines using a networkx DAG for trait dependencies, sleap_roots/series.py handles data loading. Uses attrs dataclasses, numpy arrays, sleap-io for .slp/.h5 file loading. Tests in tests/ (pytest), benchmarks in tests/benchmarks/ (pytest-benchmark).

Core scientific values to check:

Trait Computation Accuracy

Are trait computation algorithms mathematically correct?

Are coordinate systems explicitly documented? (y-down image coordinates are standard)

Are algorithm references cited for non-trivial computations?

Are edge cases handled correctly (collinear points, zero-length segments, single-node roots)?

Reproducibility

Are units explicitly specified in all trait outputs? (pixels, degrees, etc.)

Are default values documented so future researchers know what parameters were used?

If defaults change, what happens to reproducibility of previous analyses?

Can a researcher reproduce results using the same input data and library version?

Numerical Stability

How does the proposal handle NaN propagation? (missing landmarks produce NaN — does it cascade?)

Are there floating point precision concerns in angle or distance computations?

Is any warnings.filterwarnings suppression justified, or does it hide real problems?

Are division-by-zero cases handled (zero-length roots, overlapping points)?

Data Format Stability

CSV column names and order changes are BREAKING for downstream scripts

Are new columns added at the end? Are existing columns preserved?

Are column names consistent with existing conventions?

Is the H5 output format stable?

Memory for Large Datasets

Loading all frames into a single numpy array can cause OOM for large experiments

Are there safeguards for datasets with thousands of frames or hundreds of plants?

Is memory usage documented or bounded?

Migration Paths for Breaking Changes

If function signatures change, is there a deprecation path?

If output formats change, is there documentation for migrating existing scripts?

Are version-specific behaviors documented?

sleap-io Compatibility

Does the proposal rely on sleap-io >= 0.0.11 features?

Are there API changes in sleap-io that could affect data loading?

Is the .slp/.h5 file format version handled correctly?

Proposal to review: {PROPOSAL_MD}

Delta specs: {DELTA_SPECS}

Tasks: {TASKS_MD}

Report:

Trait computation accuracy concerns

Numerical stability issues (NaN propagation, floating point, division by zero)

Reproducibility gaps (missing units, undocumented defaults, coordinate system ambiguity)

Data format stability risks (CSV column changes, H5 format changes)

Memory feasibility concerns for large datasets

Missing migration paths for breaking changes

sleap-io compatibility risks

Suggestions for additional scenarios covering data integrity

Step 4: Synthesize Review

After ALL subagents return, synthesize their findings:

Deduplicate: Merge overlapping findings from multiple reviewers
Prioritize: Categorize issues as:
- BLOCKING — Must fix before approval (spec errors, missing tests, data integrity risks)
- IMPORTANT — Should fix before implementation (missing edge cases, unclear scenarios)
- SUGGESTION — Nice to have (style improvements, additional context)
Create a unified review with this structure:

# OpenSpec Review: {change-id}

## Verdict: APPROVED / NEEDS REVISION / BLOCKED

## Summary

[2-3 sentence overall assessment]

## Blocking Issues

[Issues that MUST be resolved]

## Important Issues

[Issues that SHOULD be resolved]

## Suggestions

[Optional improvements]

## Review Details

### Spec Quality

[Findings from Subagent 1]

### Code & Architecture

[Findings from Subagent 2]

### GitHub Issue Alignment

[Findings from Subagent 3]

### TDD & Testing Strategy

[Findings from Subagent 4]

### Scientific Rigor & Data Integrity

[Findings from Subagent 5]

### Commit Safety & CI Health

[Findings from Subagent 4 — commit discipline section]

- Existing tests that will break (list specific files)
- Task ordering hazards (where the suite goes red between commits)
- Missing check gates in verification section
- Subsections that are too large or mix too many concerns

Step 5: Offer to Fix

After presenting the review, ask the user if they want you to:

Fix blocking and important issues automatically
Generate a revised proposal.md, tasks.md, and/or delta specs
Open GitHub issues for items that need further discussion

name	openspec-review
description	Critically review an OpenSpec proposal using a team of specialized subagents. Use when: reviewing proposals before approval, validating spec quality, checking TDD plans, ensuring scientific rigor (metadata, reproducibility, traceability), and verifying GitHub issue alignment. Launches 5 parallel subagents for deep, adversarial review.
allowed-tools	Read, Grep, Glob, Bash, Edit, Write, Agent, TodoWrite

OpenSpec Proposal Review — Subagent Team

How This Skill Works

Step 1: Identify the Proposal

Determine which proposal to review:

If the user specifies a change ID, use it directly
Otherwise, run openspec list to find active proposals and ask the user which one to review
Read the proposal's proposal.md, tasks.md, design.md (if exists), and all delta spec files

Step 2: Gather Context

Before launching subagents, collect essential context that each agent will need:

Read the full proposal files (proposal.md, tasks.md, design.md, delta specs)
Read the current specs that the proposal modifies (from openspec/specs/)
Note the related GitHub issues mentioned in the proposal
Note the affected code files listed in the Impact section

Step 3: Launch Subagent Review Team

Launch ALL 5 subagents in a single message (parallel execution). Each subagent gets:

The full proposal text (embedded in the prompt)
The current spec text for affected capabilities
Clear review criteria and what to look for
Instructions to be critical and adversarial

Subagent 1: Spec Quality & OpenSpec Best Practices

subagent_type: "general-purpose"
description: "Review spec format quality"

Prompt template:

You are reviewing an OpenSpec proposal for sleap-roots, a Python library for plant root phenotyping using SLEAP pose estimation. Your role: Spec Quality & OpenSpec Best Practices Reviewer.

Architecture: Pure Python library — sleap_roots/ contains trait computation modules (lengths, angles, tips, bases, convhull, scanline, etc.), sleap_roots/trait_pipelines.py orchestrates pipelines using a networkx DAG for trait dependencies, sleap_roots/series.py handles data loading. Uses attrs dataclasses, numpy arrays, sleap-io for .slp/.h5 file loading. Tests in tests/ (pytest), benchmarks in tests/benchmarks/ (pytest-benchmark).

IMPORTANT: Be critical. Find problems. Do NOT rubber-stamp.

Reference openspec/AGENTS.md for the authoritative OpenSpec rules when in doubt.

Review the following proposal against these OpenSpec rules:

Format rules:

Delta sections MUST use: ## ADDED Requirements, ## MODIFIED Requirements, ## REMOVED Requirements

Requirements use ### Requirement: Name (3 hashtags)

Scenarios use #### Scenario: Name (4 hashtags)

Every requirement MUST have at least one scenario

Scenarios MUST use GIVEN/WHEN/THEN format with bold markers

MODIFIED requirements MUST include the FULL existing text (partial deltas lose detail at archive)

Proposal rules:

proposal.md must have: ## Why, ## What Changes, ## Impact

Why should be 1-2 sentences explaining the problem/opportunity

Impact must list: affected specs, affected code files

BREAKING changes must be marked with BREAKING

Change ID must be verb-led kebab-case

Tasks rules:

Must follow TDD order: tests FIRST, then implementation, then verification

Tasks must be small, verifiable work items

Each task must have a checkbox - [ ]

Check for:

Are any scenarios vague or untestable? (e.g., "should work correctly")

Are GIVEN/WHEN/THEN conditions specific enough to write a test from?

Do MODIFIED requirements include the FULL original text or just fragments?

Are there requirements without scenarios?

Are there missing edge case scenarios? (error paths, boundary values, empty states)

Is the proposal.md ## Why section clear about the actual problem?

Does the Impact section list ALL affected specs and code files?

Are BREAKING changes clearly marked?

Is the change ID appropriate (verb-led, descriptive)?

Could any requirements be split into smaller, more focused requirements?

If --skip-specs is used anywhere in the proposal or tasks, is there explicit justification for skipping spec generation? Flag any usage that lacks a clear rationale.

Proposal to review: {PROPOSAL_MD}

Tasks: {TASKS_MD}

Delta specs: {DELTA_SPECS}

Current specs being modified: {CURRENT_SPECS}

Return a structured review with:

PASS/FAIL verdict for each check

Specific line-level issues found

Suggested improvements with concrete rewrites

Overall quality score (1-10) with justification

Subagent 2: Code & Architecture Feasibility

subagent_type: "general-purpose"
description: "Review code feasibility"

Prompt template:

You are reviewing an OpenSpec proposal for sleap-roots, a Python library for plant root phenotyping using SLEAP pose estimation. Your role: Code & Architecture Reviewer.

IMPORTANT: Be critical. Read the actual source files. Find real problems.

Architecture: Pure Python library — sleap_roots/ contains trait computation modules (lengths, angles, tips, bases, convhull, scanline, etc.), sleap_roots/trait_pipelines.py orchestrates pipelines using a networkx DAG for trait dependencies, sleap_roots/series.py handles data loading. Uses attrs dataclasses, numpy arrays, sleap-io for .slp/.h5 file loading. Tests in tests/ (pytest), benchmarks in tests/benchmarks/ (pytest-benchmark).

Review tasks:

Read EVERY file listed in the Impact section of the proposal

Verify the proposal's claims about current code state (function signatures, defaults, existing traits)

Check if the proposed changes respect the library's architecture (trait modules, pipeline DAG, series loader)

Identify breaking changes the proposal might have MISSED

Check for ripple effects in files NOT listed in the Impact section

Verify proposed traits are correctly placed in the networkx trait dependency graph in trait_pipelines.py

Check numpy/attrs patterns are followed consistently (e.g., attrs dataclasses for structured data, numpy arrays for numerical computation)

Verify cross-platform compatibility — are all file paths using pathlib? Are there OS-specific assumptions?

Check memory feasibility — does the proposal load all frames into a single array? For large datasets (thousands of frames), will this cause OOM? Are there streaming or chunked alternatives?

Verify sleap-io >= 0.0.11 compatibility — does the proposal use any APIs that changed or were deprecated?

Affected files from proposal: {AFFECTED_FILES_LIST}

Proposal summary: {PROPOSAL_MD}

Read each affected file using the Read tool. Then report:

Files where the proposal's claims are INCORRECT

Missing files that should be in the Impact section

Architecture violations in the proposed changes

Backward compatibility risks

networkx trait graph placement issues (missing edges, cycles, wrong dependencies)

Memory feasibility concerns for large datasets

sleap-io compatibility risks

Concrete code snippets showing problems

Subagent 3: GitHub Issues & Requirements Alignment

subagent_type: "general-purpose"
description: "Check GitHub issue alignment"

Prompt template:

You are reviewing an OpenSpec proposal for sleap-roots, a Python library for plant root phenotyping. Your role: GitHub Issues & Requirements Alignment Reviewer.

IMPORTANT: Be critical. Check that the proposal actually solves the reported problems.

Tasks:

Use gh issue view {ISSUE_NUMBER} to read each related GitHub issue mentioned in the proposal

Also search for related issues: gh issue list --search "{RELEVANT_KEYWORDS}" --limit 20

For each issue, check:

Does the proposal fully address the issue's requirements?

Are there issue comments with additional context the proposal missed?

Are there related issues the proposal should reference but doesn't?

Check if any CLOSED issues are relevant (previous attempts, related fixes)

Verify the proposal doesn't contradict any decisions made in issue discussions

Check if any open PRs already partially address this proposal

Related issues from proposal: {ISSUE_NUMBERS}

Proposal summary: {PROPOSAL_MD}

Search keywords to try: {SEARCH_KEYWORDS}

Report:

Issues that are NOT fully addressed by the proposal

Missing issues that should be referenced

Contradictions between issue discussions and proposal decisions

Scope gaps: things users reported that the proposal doesn't fix

Scope creep: things the proposal includes that no issue requested

Subagent 4: TDD & Testing Strategy

subagent_type: "general-purpose"
description: "Review TDD test plan"

Prompt template:

You are reviewing an OpenSpec proposal's testing strategy for sleap-roots, a Python library for plant root phenotyping using SLEAP pose estimation. Your role: TDD & Testing Strategy Reviewer.

IMPORTANT: Be critical. The test plan must be concrete, complete, and CI-feasible.

Project testing infrastructure:

pytest (unit tests): tests/test_*.py, pytest-cov for coverage (~84% via Codecov), pytest tests/

Integration-level pipeline tests: tests/test_trait_pipelines.py exercises the full trait pipeline DAG

Benchmarks: tests/benchmarks/ using pytest-benchmark with 15% regression threshold

Test data: tests/data/ managed via Git LFS (.slp, .h5 files)

CI runs on: Ubuntu, Windows, macOS (all platforms)

Coverage: ~84% enforced via Codecov

Review the tasks.md for:

Are tests TRULY written before implementation? (TDD order)

Is each test specific enough to implement? (not vague like "test it works")

Are the RIGHT testing approaches used for each test?

Unit logic (pure trait computation functions) -> tests/test_<module>.py

Pipeline integration (trait DAG execution) -> tests/test_trait_pipelines.py

Performance regressions -> tests/benchmarks/

Data loading and parsing -> unit tests with fixtures from tests/data/

Are there MISSING tests?

Error paths and validation failures (empty arrays, NaN inputs, missing landmarks)

Boundary values (single point, zero-length root, collinear points)

Metadata preservation (verify metadata correctly written through pipeline stages)

Data flow correctness (intermediate results passed correctly through pipeline DAG)

Backward compatibility (existing CSV output format, column names/order)

Regression tests for the bugs being fixed

Will these tests actually run in CI?

Do any tests require data files not in tests/data/?

Are test data files tracked in Git LFS?

Do tests avoid platform-specific path separators?

Is the verification section complete?

Does it include: unit tests, linting, formatting, type checking?

Should it include benchmark tests?

Do the scenarios in the delta specs map 1:1 to tests in tasks.md?

Every scenario SHOULD have a corresponding test

Flag any scenarios without tests and vice versa

Review commit discipline and CI safety in tasks.md:

Are task subsections small enough to be safe commit units?

Each subsection (e.g., 1.1, 1.2, 2.1) should be committable independently

A subsection that touches both trait computation AND pipeline orchestration is risky

Flag subsections that mix changes across multiple modules in a single group

Can the test suite stay green after each subsection is committed?

If subsection 2.2 removes a trait function, will existing tests break BEFORE their fixes land?

Look for ordering dependencies: does committing section X break tests that section Y hasn't fixed yet?

Flag any "big bang" subsections where multiple cross-cutting changes must land simultaneously or tests break

Are existing test files accounted for?

Read the existing test files in tests/

Check: will ANY existing test break due to the proposed changes (removed functions, changed signatures, new defaults)?

List specific test files and assertions that will fail, and verify tasks.md includes updating them

This is CRITICAL — broken test infrastructure wastes enormous time to recover from

Does the verification section include check gates between sections?

After Section 1 (tests): pytest tests/ && flake8 sleap_roots/ && black --check sleap_roots/

After Section 2 (implementation): same full check

These gates catch cross-cutting breakage early

Does the proposal account for CI platform differences?

CI runs on Ubuntu, Windows, and macOS

Are there platform-specific paths or behaviors in the changes?

Will tests pass on all platforms?

Tasks to review: {TASKS_MD}

Delta specs (scenarios to match against tests): {DELTA_SPECS}

Proposal summary: {PROPOSAL_MD}

Report:

Missing tests (with concrete descriptions of what should be added)

Tests using the wrong framework

Tests that won't work in CI

Scenarios without corresponding tests (the gap analysis)

TDD ordering violations

Verification checklist gaps

Suggested additional test tasks with exact wording

Commit safety issues: subsections that will break existing tests when committed in order

Existing test breakage: specific test files and assertions that will fail due to proposed changes, and whether tasks.md accounts for fixing them

Missing check gates: whether the verification section includes intermediate check points between major sections

Ordering hazards: task ordering that forces a temporarily broken test suite (e.g., removing a function in 2.2 before updating tests that depend on it in 2.5)

Subagent 5: Scientific Rigor & Data Integrity

subagent_type: "general-purpose"
description: "Review scientific rigor"

Prompt template:

You are reviewing an OpenSpec proposal for sleap-roots, a Python library for plant root phenotyping using SLEAP pose estimation. This software computes morphological traits from pose estimation data and outputs results used in scientific publications. Your role: Scientific Rigor & Data Integrity Reviewer.

IMPORTANT: Be critical. This software produces quantitative trait measurements used in scientific research. Mistakes in computation, units, reproducibility, or data format can invalidate published results and break downstream analysis scripts.

Architecture: Pure Python library — sleap_roots/ contains trait computation modules (lengths, angles, tips, bases, convhull, scanline, etc.), sleap_roots/trait_pipelines.py orchestrates pipelines using a networkx DAG for trait dependencies, sleap_roots/series.py handles data loading. Uses attrs dataclasses, numpy arrays, sleap-io for .slp/.h5 file loading. Tests in tests/ (pytest), benchmarks in tests/benchmarks/ (pytest-benchmark).

Core scientific values to check:

Trait Computation Accuracy

Are trait computation algorithms mathematically correct?

Are coordinate systems explicitly documented? (y-down image coordinates are standard)

Are algorithm references cited for non-trivial computations?

Are edge cases handled correctly (collinear points, zero-length segments, single-node roots)?

Reproducibility

Are units explicitly specified in all trait outputs? (pixels, degrees, etc.)

Are default values documented so future researchers know what parameters were used?

If defaults change, what happens to reproducibility of previous analyses?

Can a researcher reproduce results using the same input data and library version?

Numerical Stability

How does the proposal handle NaN propagation? (missing landmarks produce NaN — does it cascade?)

Are there floating point precision concerns in angle or distance computations?

Is any warnings.filterwarnings suppression justified, or does it hide real problems?

Are division-by-zero cases handled (zero-length roots, overlapping points)?

Data Format Stability

CSV column names and order changes are BREAKING for downstream scripts

Are new columns added at the end? Are existing columns preserved?

Are column names consistent with existing conventions?

Is the H5 output format stable?

Memory for Large Datasets

Loading all frames into a single numpy array can cause OOM for large experiments

Are there safeguards for datasets with thousands of frames or hundreds of plants?

Is memory usage documented or bounded?

Migration Paths for Breaking Changes

If function signatures change, is there a deprecation path?

If output formats change, is there documentation for migrating existing scripts?

Are version-specific behaviors documented?

sleap-io Compatibility

Does the proposal rely on sleap-io >= 0.0.11 features?

Are there API changes in sleap-io that could affect data loading?

Is the .slp/.h5 file format version handled correctly?

Proposal to review: {PROPOSAL_MD}

Delta specs: {DELTA_SPECS}

Tasks: {TASKS_MD}

Report:

Trait computation accuracy concerns

Numerical stability issues (NaN propagation, floating point, division by zero)

Reproducibility gaps (missing units, undocumented defaults, coordinate system ambiguity)

Data format stability risks (CSV column changes, H5 format changes)

Memory feasibility concerns for large datasets

Missing migration paths for breaking changes

sleap-io compatibility risks

Suggestions for additional scenarios covering data integrity

Step 4: Synthesize Review

After ALL subagents return, synthesize their findings:

Deduplicate: Merge overlapping findings from multiple reviewers
Prioritize: Categorize issues as:
- BLOCKING — Must fix before approval (spec errors, missing tests, data integrity risks)
- IMPORTANT — Should fix before implementation (missing edge cases, unclear scenarios)
- SUGGESTION — Nice to have (style improvements, additional context)
Create a unified review with this structure:

# OpenSpec Review: {change-id}

## Verdict: APPROVED / NEEDS REVISION / BLOCKED

## Summary

[2-3 sentence overall assessment]

## Blocking Issues

[Issues that MUST be resolved]

## Important Issues

[Issues that SHOULD be resolved]

## Suggestions

[Optional improvements]

## Review Details

### Spec Quality

[Findings from Subagent 1]

### Code & Architecture

[Findings from Subagent 2]

### GitHub Issue Alignment

[Findings from Subagent 3]

### TDD & Testing Strategy

[Findings from Subagent 4]

### Scientific Rigor & Data Integrity

[Findings from Subagent 5]

### Commit Safety & CI Health

[Findings from Subagent 4 — commit discipline section]

- Existing tests that will break (list specific files)
- Task ordering hazards (where the suite goes red between commits)
- Missing check gates in verification section
- Subsections that are too large or mix too many concerns

Step 5: Offer to Fix

After presenting the review, ask the user if they want you to:

Fix blocking and important issues automatically
Generate a revised proposal.md, tasks.md, and/or delta specs
Open GitHub issues for items that need further discussion