| name | test |
| description | Prove it works. Multi-dimensional quality validation across functional, non-functional, security, DevOps, DX, and observability. Run after /build. |
| argument-hint | component-or-failure-description |
| allowed-tools | Task, Skill, Read, Glob, Grep, Bash(*) |
| user-invocable | true |
@CLAUDE.md
Test: $ARGUMENTS
If $ARGUMENTS is empty, test the current branch diff against the base branch.
Step 0: Classify PR Type
Detect the base branch from gh pr view --json baseRefName or fall back to main. Run git diff origin/<base-branch> --name-only and classify changed files:
| Type | Patterns | Gates to Run |
|---|
| CODE | *.py, *.ps1, *.ts, *.js, *.cs | All 6 gates |
| WORKFLOW | *.yml in .github/workflows/ | Gates 1, 3, 4 |
| CONFIG | *.json, *.yaml (non-workflow) | Gates 3, 4 |
| DOCS | *.md, *.txt, *.rst | Gate 5 only |
| MIXED | Combination | Apply per-file rules |
Print: PR TYPE: [type]. Running gates: [list].
Skip non-applicable gates. Do not waste agent invocations on irrelevant dimensions.
Gate 1: Functional Testing
Invoke Skill(skill="code-qualities-assessment") for quality baseline.
Task(subagent_type="qa"): You are a senior QA engineer. Your job is to catch issues that will cause production incidents. Be skeptical. Cite specific file:line evidence for every finding. Evaluate:
- Unit coverage - Each method in isolation, dependencies injected. Every new function has at least 1 test.
- Integration coverage - Contracts between components verified. Cross-module boundaries exercised.
- Acceptance coverage - Each requirement has a passing test. Map to acceptance criteria from /spec output.
- Edge cases - Null/empty/boundary values, invalid types, concurrent access where applicable.
- Error paths - Every catch/error branch tested. No silent swallowing. Resources cleaned up on failure.
- Regression risk - High-risk areas (auth, data persistence, payments) require full coverage regardless of change size.
Output: VERDICT: PASS|WARN|CRITICAL_FAIL with findings array.
Gate 2: Non-Functional Testing
Task(subagent_type="analyst"): You are a performance and reliability engineer. Focus on failure modes, not the happy path. Use measurable criteria, not subjective judgments. Evaluate:
- Performance - No N+1 queries, no O(n*m) in hot paths, no blocking calls in async context.
- Scalability - Will this bottleneck under load? Connection pooling, caching strategy, pagination.
- Reliability - Retry logic, circuit breakers, graceful degradation. Failure modes tested.
- Complexity - Cyclomatic complexity <=10. Methods <=60 lines. No deep nesting.
- Maintainability - Readability, naming clarity, consistency with existing patterns.
Output: VERDICT: PASS|WARN|CRITICAL_FAIL with findings array.
Gate 3: Security Testing
Invoke Skill(skill="security-scan") for CWE pattern detection.
Task(subagent_type="security"): You are a security auditor performing OWASP Top 10 review. Assume every input is malicious. Reference CWE numbers for every finding. Evaluate:
- Injection - Shell (CWE-78), XSS (CWE-79), SQL (CWE-89). No string interpolation in queries.
- Authentication - Session handling, credential storage, token validation.
- Secrets - No hardcoded API keys, passwords, tokens in diff. Secrets via environment only.
- Input validation - All user-facing inputs validated. LLM output treated as untrusted.
- Dependencies - New packages reviewed for known vulnerabilities. Versions pinned.
Output: VERDICT: PASS|WARN|CRITICAL_FAIL with findings array including CWE references.
Gate 4: DevOps Testing
Task(subagent_type="devops"): You are a build and release engineer. Focus on pipeline safety, reproducibility, and supply chain security. Evaluate:
- Pipeline impact - Do changes affect CI/CD? Are workflow files valid YAML?
- Actions security - Pinned to SHA? Permissions scoped minimally? No secrets in logs?
- Shell quality - Input sanitization, exit code handling, error propagation.
- Build reproducibility - Deterministic builds, locked dependencies, no floating versions.
- Artifact integrity - Correct upload/download, retention policy, no sensitive data in artifacts.
Output: VERDICT: PASS|WARN|CRITICAL_FAIL with findings array.
Gate 5: Developer Experience (DX)
Invoke Skill(skill="orphan-ref-validator"). Reject the gate on VERDICT: CRITICAL_FAIL or VERDICT: ERROR; VERDICT: WARN is non-blocking and surfaces in the test summary. This mirrors /build Mandatory Exit Gate 4 (per .claude/commands/build.md:56) so a reference to a deleted skill or a missing script path is caught at /test as well as at /build. To diagnose a failure, re-run the skill with --output human; each finding shows path:line plus a one-line recommendation. Manifest count drift is owned by the canonical build/scripts/validate_marketplace_counts.py (which the skill's COUNT_CLAIM_RE mirrors but does not duplicate emission); pass --enforce-counts only when you want single-plugin count_claim emission directly from the skill. The skill invocation is platform-agnostic; each platform mirror runs its own copy of scan.py. If pre-existing drift outside the PR's scope blocks the gate, fix it in the same PR (the directives at <!-- orphan-ref-ignore --> and <!-- orphan-ref-ignore-file --> are documented in the skill's SKILL.md).
Task(subagent_type="critic"): You are a developer advocate reviewing from the consumer perspective. Would a new contributor understand this code? Would the API frustrate or delight? Evaluate:
- API ergonomics - Consumer perspective. Are signatures intuitive? Error messages helpful?
- Documentation - Is changed behavior documented? Are code comments accurate (not stale)?
- Debuggability - Can a developer diagnose failures from logs alone? Stack traces preserved?
- Onboarding - Would a new contributor understand this code? Are conventions followed?
- Tooling - Does this work with existing linters, formatters, IDE support?
Output: VERDICT: PASS|WARN|CRITICAL_FAIL with findings array.
Gate 6: Observability and Monitoring
Task(subagent_type="architect"): You are an SRE reviewing production readiness. If this code fails at 3am, can oncall diagnose it without reading the source? Evaluate:
- Logging - Are meaningful events logged? Structured logging with correlation IDs?
- Metrics - Are SLIs defined for new features? Latency, error rate, throughput tracked?
- Alerting - Would failures trigger alerts? Are thresholds appropriate?
- Tracing - Are distributed traces propagated? Span context preserved across boundaries?
- Health checks - New services have liveness/readiness probes? Degradation detectable?
Output: VERDICT: PASS|WARN|CRITICAL_FAIL with findings array.
Principles
- Testability is design feedback: Hard to test means poor encapsulation, tight coupling, Law of Demeter violation, weak cohesion, or procedural code.
- Tests are proof: A passing test is evidence. A missing test is a gap in knowledge.
- Hypothesis-driven debugging: When a test fails, form a hypothesis before changing code. Verify the hypothesis. Then fix.
- Defense in depth: Assume the happy path works. Focus on failure modes.
Process
- Identify what changed (git diff against base branch)
- Classify PR type (Step 0). Skip non-applicable gates.
- Run applicable gates sequentially. Each gate dispatches its own agent.
- If any gate produces CRITICAL_FAIL: continue remaining gates (findings are additive). Mark overall verdict as CRITICAL_FAIL immediately.
- For test failures: hypothesis, verify, fix (never change code without understanding why)
- Invoke Skill(skill="quality-grades") to synthesize gate verdicts into overall quality score.
Output
Each gate MUST produce a verdict line and findings array:
GATE: [name]
VERDICT: PASS|WARN|CRITICAL_FAIL
FINDINGS:
- [SEVERITY] (file:line) description: recommendation
Synthesize into overall report:
| Gate | Verdict | Findings | Evidence |
|---|
| Functional | PASS/WARN/CRITICAL_FAIL | Count | file:line citations |
| Non-Functional | PASS/WARN/CRITICAL_FAIL | Count | file:line citations |
| Security | PASS/WARN/CRITICAL_FAIL | Count | CWE references |
| DevOps | PASS/WARN/CRITICAL_FAIL | Count | file:line citations |
| DX | PASS/WARN/CRITICAL_FAIL | Count | file:line citations |
| Observability | PASS/WARN/CRITICAL_FAIL | Count | file:line citations |
Overall verdict: CRITICAL_FAIL if any gate fails. WARN if any gate warns. PASS if all gates pass.