بنقرة واحدة
deflake
// Finds flaky tests on the main branch by analyzing GitHub Actions failures, ranks them by frequency, and enters parallel plan mode to design deflake strategies. Use when you want to find and fix the flakiest tests.
// Finds flaky tests on the main branch by analyzing GitHub Actions failures, ranks them by frequency, and enters parallel plan mode to design deflake strategies. Use when you want to find and fix the flakiest tests.
| name | deflake |
| description | Finds flaky tests on the main branch by analyzing GitHub Actions failures, ranks them by frequency, and enters parallel plan mode to design deflake strategies. Use when you want to find and fix the flakiest tests. |
Discovers, ranks, and plans fixes for flaky tests by analyzing GitHub Actions failures on main.
/deflake # Full analysis: discover, rank, and plan fixes
/deflake --report # Report only: show flake rankings without planning fixes
/deflake --top N # Analyze and plan fixes for the top N flakes (default: 3)
Run the collection script. It handles all deterministic data collection and aggregation. If CI log formats change over time, update the script directly.
python3 .claude/skills/deflake/collect-flakes.py
The script outputs three sections:
Read the script output and use it directly for the report. The LLM's only job in this phase is to categorize each entry as a flake, real bug, or infra issue:
git log for related fixes[INFRA] by the script, or failures with mode connection refused / infraPresent the script output as a formatted report. Add categorization (flake / real bug / infra) to each entry. Example format:
## Flake Report — main branch
**Period**: 2026-04-01 to 2026-04-10
**Runs analyzed**: 23 total, 8 failed (35% failure rate)
### Top Flaky Tests
| Rank | Test | Job | Failures | Failure Mode |
|------|------|-----|----------|--------------|
| 1 | Workload lifecycle ... [It] should track ... | E2E (api-workloads) | 5/23 | timeout (120s) |
| 2 | ... | ... | ... | ... |
### Real Bugs (not flakes)
- [Test name] — Introduced by [commit], fixed by [commit/PR]
### Infra Failures
- [N] runs failed due to [description]
If the user passed --report, stop here. Otherwise continue to Phase 3.
For the top N flakes (default 3), launch parallel agents to investigate each one simultaneously.
For each flake, spawn an Agent (subagent_type: general-purpose) that:
_test.go files alongside prod code, in e2e/, in acceptance_test files, or elsewhere. For each test found, document what it covers, what level it operates at (unit/integration/E2E), and whether it's stable or also flakyIMPORTANT: Launch all agents in a single message so they run in parallel.
Wait for all agents to complete, then consolidate findings.
For each flake, present a high-level plan with alternatives considered:
### Flake #N: [Test Name]
**Root cause**: [one-sentence explanation]
**Failure logs**: [links to 2-3 example runs]
**Options considered**:
1. [Option A] — [why it was rejected or chosen]
2. [Option B] — [why it was rejected or chosen]
3. [Option C] — [why it was rejected or chosen]
**Recommended approach**: [which option and why it's the best fit]
- [High-level description of the changes]
**Confidence**: High / Medium / Low
**Risk**: [What could go wrong with this approach]
Present all plans and wait for user feedback. The user may choose a different option, combine approaches, or ask for more investigation. Do NOT enter plan mode or start implementing until the user approves the approach for each flake.
Once the user approves approaches, enter plan mode to design the detailed implementation. The plan should:
These principles guide all fix proposals. Prefer simplifying code and tests over adding complexity.
Eventually with 1s poll interval)Eventually/Consistently matchers, not arbitrary time.Sleeptime.Sleep() to "fix" timing issues[Flaky] or Skip without fixing themGuide for using ToolHive CLI (thv) to run and manage MCP servers and skills. Use when running, listing, stopping, building, or configuring MCP servers locally. Covers server lifecycle, registry browsing, secrets management, client registration, groups, container builds, exports, permissions, network isolation, authentication, and skill management (install, uninstall, list, info, build, push, validate). NOT for Kubernetes operator usage or ToolHive development/contributing.
Validates operator chart contribution practices (helm template, ct lint, docs generation) before committing changes.
Creates ToolHive release PRs by analyzing commits since the last release, categorizing changes, recommending semantic version bump type (major/minor/patch), and triggering the release workflow. Use when cutting a release, preparing a new version, checking what changed since last release, or when the user mentions "release", "version bump", or "cut a release".
Implements a GitHub user story from planning through PR creation, with research, codebase analysis, and structured commits.
Generates polished GitHub release notes for a ToolHive release by analyzing every merged PR, cross-referencing linked issues, dispatching expert agents to assess breaking changes, and producing a formatted release body. Use when the user provides a GitHub release URL, tag name, or says "release notes".
Augments human code review by summarizing changes, surfacing key review questions, assessing test coverage, and identifying low-risk sections. Use when reviewing a diff, PR, or code snippet as a senior review partner.