تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

testing-quality

Name: Testing Quality
Author: joshsymonds

// Use when bugs keep slipping through despite high test coverage, when suspecting tests are giving false confidence, before a major refactor that will depend on the existing test suite, or when coverage metrics don't match incident rates. User phrases like "do these tests actually catch bugs?", "is this suite any good?", "why didn't the tests catch this?".

تشغيل في Manus

$ git log --oneline --stat

stars:٠

forks:٠

updated:٣١ مايو ٢٠٢٦ في ٠١:٠٢

مستكشف الملفات

2 ملفات

SKILL.md

readonly

related-skills.json

نفس المستودع

brainstorming.md

from "joshsymonds/gambit"

Use when user has a new feature idea, rough concept, or unexplored approach. Include when planning before code, breaking a design into tasks, creating an implementation plan, laying out tasks and dependencies, exploring architectural options, or requirements are vague. User phrases like "I want to build X", "should we do this", "let's think through Y", "explore approaches", "break this into tasks", "make an implementation plan". Do NOT use for executing existing plans, fixing bugs, refactoring, or when requirements and an epic already exist.

2026-05-310

debugging.md

from "joshsymonds/gambit"

Use when a test is failing, when a bug is reported, when behavior is unexpected or intermittent, when a build or integration step fails, or when a flaky test keeps resurfacing. Especially when "the fix seems obvious", when multiple previous fixes haven't stuck, or when under time pressure to ship.

2026-05-310

executing-plans.md

from "joshsymonds/gambit"

Use when an epic Task exists and subtasks are ready to implement, when resuming work after a previous checkpoint, when iteratively building a feature, or when implementation has revealed unexpected work that needs a new task. User phrases like "continue the plan", "next task", "resume where we left off", "pick up the epic".

2026-05-310

review.md

from "joshsymonds/gambit"

Use after all tasks in an epic complete, after refactoring verifies, or before merging to main. Triggers when independent validation is needed that code meets requirements, has no security gaps, passes quality standards, and has no performance regressions. User phrases like "review this", "is this ready to merge", "validate the implementation".

2026-05-310

task-refinement.md

from "joshsymonds/gambit"

Use when a task plan has just been created and needs review before execution, when brainstorming just handed off, when unsure whether a junior could execute without questions, or when you see placeholder text, vague success criteria, or missing edge cases. User phrases like "review these tasks", "are these ready?", "before we start", "catch any gaps". Do NOT use when implementation is already in progress or for creating plans from scratch.

2026-05-310

using-gambit.md

from "joshsymonds/gambit"

Use at the start of every session before any response or action. Also invoke whenever uncertain which gambit skill applies, when about to implement / debug / refactor / test / plan / brainstorm, or when a user request could match any gambit skill even at 1% probability.

2026-05-310

package.json

"author": "joshsymonds"

"repository": "joshsymonds/gambit"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

name	testing-quality
description	Use when bugs keep slipping through despite high test coverage, when suspecting tests are giving false confidence, before a major refactor that will depend on the existing test suite, or when coverage metrics don't match incident rates. User phrases like "do these tests actually catch bugs?", "is this suite any good?", "why didn't the tests catch this?".
user_invokable	true

Testing Quality Analysis

Overview

Audit test suites for real effectiveness, not vanity metrics. Identify tests that provide false confidence and missing corner cases. Create Tasks for improvements.

Core principle: Tests must catch bugs, not inflate coverage metrics. Coverage measures execution, not assertion quality.

Announce at start: "I'm using gambit:testing-quality to audit these tests with SRE-level scrutiny."

Rigidity Level

MEDIUM FREEDOM — Follow analysis phases exactly. RED/YELLOW/GREEN criteria are rigid. Corner case discovery adapts to the codebase.

Quick Reference

Phase	Action	Output
1	Inventory all test files	Test catalog
2	Read production code	Context for analysis
3	Categorize (skeptical default)	RED/YELLOW/GREEN per test
4	Self-review all classifications	Validated categories
5	Discover missing corner cases	Gap analysis
6	Prioritize by business impact	Priority matrix
7	Create Tasks for improvements	Tracked improvement plan

Iron Law: Read production code BEFORE categorizing ANY test.

CRITICAL MINDSET: Assume tests were written by junior engineers optimizing for coverage metrics. A test is RED or YELLOW until proven GREEN.

When to Use

Production bugs appear despite high test coverage
Suspecting coverage gaming or tautological tests
Before major refactoring (ensure tests catch regressions)
Onboarding to unfamiliar codebase (assess test quality)
Planning test improvement initiatives

Don't use when:

Writing new tests → use gambit:test-driven-development
Just need to run tests → use test-runner agent

The Process

Phase 1: Test Inventory

Create complete catalog of tests to analyze. Use Glob and Grep to find all test files and count tests per module. Adapt file patterns to the language.

Phase 2: Read Production Code

MANDATORY before categorizing ANY test.

For each test file:

Read the production code the test claims to exercise
Understand what the production code actually does
Trace the test's call path to verify it reaches production code

Why: Without reading production code, you WILL miscategorize tests as GREEN when they're YELLOW or RED. Junior engineers commonly create test utilities and test THOSE instead of production code, or set up mocks that determine test outcomes.

Phase 3: Categorize Each Test (Skeptical Default)

Assume every test is RED or YELLOW until you have concrete evidence it's GREEN.

For EACH test, answer these four questions:

What bug would this catch? (Can't name one → RED)
Does it exercise PRODUCTION code or a mock/test utility? (Mock determines outcome → RED)
Could production break while test passes? (Yes → YELLOW or RED)
Meaningful assertion on PRODUCTION output? (!= nil, testing fixtures → weak)

RED — Must Remove or Replace

Tests that pass by definition or test mocks instead of production code:

Tautological: Asserts something guaranteed by the type system or compiler
Mock-testing: Mock determines the test outcome — test verifies what the mock returns, not what production does
Line hitters: Execute code without meaningful assertions (just "no crash")
Evergreen/Liar: Always pass regardless of production behavior (swallowed exceptions, bypassed logic)

See REFERENCE.md for detailed code examples of each RED pattern.

YELLOW — Must Strengthen

Tests with real value but significant gaps:

Happy path only: Tests valid input, misses edge cases
Weak assertions: != nil or > 0 when exact values are available
Partial coverage: Tests success but not failure paths

See REFERENCE.md for detailed code examples of each YELLOW pattern.

GREEN — Exceptional Quality Required

GREEN is the EXCEPTION, not the rule. A test is GREEN only if ALL four conditions are true:

Exercises actual PRODUCTION code (not mocks, not test utilities)
Has precise assertions (exact values, not != nil)
Would fail if production breaks (name the specific bug)
Tests behavior, not implementation (survives valid refactoring)

Before marking ANY test GREEN, you MUST state:

"This test exercises [specific production code path]"
"It would catch [specific bug] because [reason]"
"The assertion verifies [exact production behavior], not a test fixture"

If you cannot fill in those blanks, the test is YELLOW at best.

Phase 4: Self-Review

Before finalizing ANY categorization, verify:

For each GREEN test:

Did I read the PRODUCTION code this test exercises?
Does the test call PRODUCTION code or a test utility/mock?
Can I name the SPECIFIC BUG this test would catch?
If production broke, would this test DEFINITELY fail?
Am I being too generous because the test "looks reasonable"?

For each YELLOW test:

Should this actually be RED? Is there ANY bug-catching value?
Is the weakness fundamental (tests a mock) or fixable (weak assertion)?

If you have ANY doubt about a GREEN, downgrade to YELLOW.

Phase 5: Line-by-Line Justification

MANDATORY for every RED or YELLOW classification.

This forces verification that your classification is correct by explaining exactly WHY the test is problematic.

Required format:

### [Test Name] - RED/YELLOW

**Test code (file:lines):**
- Line X: `code` - [what this line does]
- Line Y: `assertion` - [what this asserts]

**Production code it claims to test (file:lines):**
- [Brief description of what production code does]

**Why RED/YELLOW:**
- [Specific reason with line references]
- [What bug could slip through despite this test passing]

If you cannot write this justification, you haven't done the analysis properly.

Phase 6: Corner Case Discovery

For each module, identify missing corner case tests across these categories:

Input validation: Empty values, boundary values, unicode, injection, malformed data
State: Uninitialized, already closed, concurrent access, re-entrant calls
Integration: Timeouts, partial responses, rate limiting, service errors

See REFERENCE.md for the complete corner case tables with specific examples and recommended test names.

Phase 7: Prioritize by Business Impact

Priority	Criteria	Action Timeline
P0 - Critical	Auth, payments, data integrity	This sprint
P1 - High	Core business logic, user-facing	Next sprint
P2 - Medium	Internal tools, admin features	Backlog
P3 - Low	Utilities, non-critical paths	As time permits

Phase 8: Create Tasks for Improvements

Create epic Task for test quality improvement, then subtasks for each action group (remove RED tests, strengthen YELLOW tests, add missing corner cases).

Each subtask must be:

Scoped: one focused sitting (~15-45 min)
Explicit: File paths and line numbers specified
Testable: At least 3 success criteria

Set dependencies so removal happens before additions.

See REFERENCE.md for epic and subtask templates.

Output Format

Present results as a structured report. See REFERENCE.md for the complete output template.

Executive summary table:

Metric	Count	%
Total tests analyzed	N	100%
RED (remove/replace)	N	X%
YELLOW (strengthen)	N	X%
GREEN (keep)	N	X%
Missing corner cases	N	-

Overall Assessment: CRITICAL / NEEDS WORK / ACCEPTABLE / GOOD

Critical Rules

Read production code FIRST — before categorizing ANY test
Skeptical default — RED/YELLOW until proven GREEN
Justify every GREEN — name the production path, the bug, and the assertion
Justify every RED/YELLOW — line-by-line with file references
Self-review before finalizing — challenge every GREEN classification
Create actionable Tasks — don't just report, create tracked improvement plan

Common Rationalizations

Excuse	Reality
"Test looks reasonable"	Looking reasonable ≠ catching bugs. Read production code.
"High coverage = good tests"	Coverage measures execution, not assertion quality
"Mock is necessary here"	Mock is fine, but assert on production behavior, not mock returns
"Test exercises the function"	Calling a function without meaningful assertions is a line hitter
"It would catch obvious bugs"	Name the specific bug. If you can't, it's YELLOW at best.
"Too many tests to justify each"	Unjustified classifications are wrong classifications

Anti-patterns

Don't:

Mark tests GREEN because they "look reasonable" (verify call paths)
Trust test names and comments (code doesn't lie, comments do)
Give benefit of the doubt (skeptical default, always)
Rush categorization (read production code FIRST)
Mark YELLOW when it's actually RED (mock determines outcome → RED)
Skip corner case analysis ("existing tests are enough")

Do:

Read production code before categorizing ANY test
Trace call paths to verify production code is exercised
Apply skeptical default (RED/YELLOW until proven GREEN)
Complete self-review checklist for all GREEN classifications
Create actionable Tasks for improvements

Verification Checklist

Analysis Quality (MANDATORY):

Read production code for EVERY test before categorizing
Traced call paths to verify tests exercise production, not mocks/utilities
Applied skeptical default (assumed RED/YELLOW, required proof for GREEN)
Completed self-review checklist for ALL GREEN tests
Each GREEN test has explicit justification (production path + bug + assertion)
Each RED/YELLOW has line-by-line justification

Per module:

All tests categorized (RED/YELLOW/GREEN)
RED tests have specific removal/replacement actions
YELLOW tests have specific strengthening actions
Corner cases identified (input, state, integration)
Priority assigned (P0/P1/P2/P3)

Task Integration:

Created epic for test quality improvement
Created subtasks for each category (remove, strengthen, add)
Set task dependencies

Integration

Called by:

User via /gambit:testing-quality
Before major refactoring efforts
When coverage is high but bugs slip through

Creates:

Tasks for removing RED tests
Tasks for strengthening YELLOW tests
Tasks for adding missing corner cases

Workflow:

gambit:testing-quality → Analyze → Create improvement Tasks
gambit:executing-plans → Implement improvements with TDD
gambit:verification → Verify improvements complete

testing-quality

المزيد من هذا المستودع

المزيد من هذا المستودع

Testing Quality Analysis

Overview

Rigidity Level

Quick Reference

When to Use

The Process

Phase 1: Test Inventory

Phase 2: Read Production Code

Phase 3: Categorize Each Test (Skeptical Default)

RED — Must Remove or Replace

YELLOW — Must Strengthen

GREEN — Exceptional Quality Required

Phase 4: Self-Review

Phase 5: Line-by-Line Justification

Phase 6: Corner Case Discovery

Phase 7: Prioritize by Business Impact

Phase 8: Create Tasks for Improvements

Output Format

Critical Rules

Common Rationalizations

Anti-patterns

Verification Checklist

Integration

Testing Quality Analysis

Overview

Rigidity Level

Quick Reference

When to Use

The Process

Phase 1: Test Inventory

Phase 2: Read Production Code

Phase 3: Categorize Each Test (Skeptical Default)

RED — Must Remove or Replace

YELLOW — Must Strengthen

GREEN — Exceptional Quality Required

Phase 4: Self-Review

Phase 5: Line-by-Line Justification

Phase 6: Corner Case Discovery

Phase 7: Prioritize by Business Impact

Phase 8: Create Tasks for Improvements

Output Format

Critical Rules

Common Rationalizations

Anti-patterns

Verification Checklist

Integration