원클릭으로
flaky-test-catcher
// Detects broken and flaky acceptance tests from recent CI failures on main and opens structured GitHub issues for automated remediation. Follow this skill strictly when analyzing CI failures.
// Detects broken and flaky acceptance tests from recent CI failures on main and opens structured GitHub issues for automated remediation. Follow this skill strictly when analyzing CI failures.
| name | flaky-test-catcher |
| description | Detects broken and flaky acceptance tests from recent CI failures on main and opens structured GitHub issues for automated remediation. Follow this skill strictly when analyzing CI failures. |
This skill defines the end-to-end protocol for detecting broken and flaky acceptance tests by analyzing recent CI failures on main, then opening structured GitHub issues so that automated remediation workflows can address the root causes.
You must follow this protocol strictly. Do not improvise or skip steps.
The workflow pre-activation step has already computed all run-level data. Do not re-query GitHub for run lists or issue counts. Use the values injected into your prompt:
| Variable | Meaning |
|---|---|
failed_run_ids | JSON array of run IDs of failed test.yml runs on main in the last 3 days |
total_run_count | Count of completed runs with meaningful conclusions (success, failure, timed_out, neutral, action_required) — cancelled runs excluded |
open_issues | Current count of open flaky-test issues |
issue_slots_available | How many new issues you may create (max 3) |
Parse failed_run_ids as a JSON array immediately. Example: ["12345678","87654321"].
For each run ID in failed_run_ids:
gh api /repos/{owner}/{repo}/actions/runs/{run_id}/jobs?per_page=100
Replace {owner} and {repo} with the values from the repository's GitHub remote (visible in git remote get-url origin).
From the returned jobs array, keep only jobs where both conditions hold:
conclusion == "failure"Matrix Acceptance Test (this is the job group that runs acceptance tests)Ignore infrastructure jobs (e.g. lint, build, generate) — they do not produce --- FAIL: lines.
The GitHub API log endpoint returns a ZIP archive. Use the gh CLI to stream plain-text logs for a job:
gh run view --job {job_id} --log | grep '^--- FAIL:'
Log size warning: Job logs can be very large (10 MB+). Do not load the full log into context. Instead:
grep so only matching lines are retained.--- FAIL: pattern (see §4).grep -B3 -A3 or similar when you also need surrounding context for the "Sample Failure Output" issue section.--- FAIL: line) for the "Sample Failure Output" issue section.If a job list response has 100 items and there may be more, check the Link header for a next page URL and repeat the request. In practice, a single run rarely has more than 100 jobs.
--- FAIL: extraction patternScan each log for lines that match this exact pattern:
^--- FAIL: TestName (timing)
Examples of matching lines:
--- FAIL: TestAccResourceAgentConfiguration_alternateEnvironment (12.34s)
--- FAIL: TestAccSomeResource_basic (0.45s)
Rules:
--- FAIL: (three hyphens, a space, FAIL:, a space).TestAcc... with optional underscore-separated sub-name.(12.34s).FAIL lines without the --- prefix; those are package-level failure markers, not individual test failures.Collect all extracted test names across all runs and all jobs. A test may appear multiple times (once per run where it failed) — track counts.
Same-run deduplication: If the same test name appears in multiple failing jobs within a single run (e.g. multiple shards both failing the same test), count it only once for that run. Deduplication is by run ID, not job ID — use a set per run when accumulating test names.
For each unique test name:
fail_rate = fail_count / total_run_count
Where:
fail_count = number of distinct run IDs in which this test appeared as --- FAIL:total_run_count = the value from pre-activation context (already excludes cancelled runs)| Classification | Condition | Action |
|---|---|---|
| Broken | fail_rate == 1.0 (fails in 100% of runs) | Create issue |
| Flaky | fail_rate >= 0.20 and < 1.0 | Create issue |
| Noise | fail_rate < 0.20 | Ignore — do not create issues |
Extract the base test name from each test function name using this rule:
Take the substring from the beginning up to (but not including) the first underscore
_.
Pattern: TestAcc[^_]+
Examples:
| Full test name | Base test name |
|---|---|
TestAccResourceAgentConfiguration_alternateEnvironment | TestAccResourceAgentConfiguration |
TestAccResourceAgentConfiguration_minimal | TestAccResourceAgentConfiguration |
TestAccResourceAgentConfiguration | TestAccResourceAgentConfiguration |
TestAccSomeResource_basic | TestAccSomeResource |
One issue per base test name. All scenario variants (subtests/suffixes) belonging to the same base name are consolidated into a single issue. List each specific variant inside the issue body.
Fallback for non-TestAcc tests: All acceptance tests in this project follow the TestAcc prefix convention. If a non-TestAcc test name appears in the logs, treat everything up to the first _ (or the full name if no _) as the base name.
For each base test name that will receive an issue, investigate whether any recent commit may already address the failure:
From the failed_run_ids list, identify the oldest run's created_at timestamp. You can get metadata for a single run:
gh api /repos/{owner}/{repo}/actions/runs/{run_id}
Take the minimum created_at across all failed runs.
main since that timestampgh api "/repos/{owner}/{repo}/commits?sha=main&since={timestamp}&per_page=50"
Replace {timestamp} with the ISO 8601 value from step 7.1 (e.g. 2024-01-15T12:00:00Z).
For each commit returned:
a. Commit message relevance — does it reference any of:
TestAccResourceAgentConfiguration)agent_configuration → AgentConfiguration)fix, flaky, test, revertb. Changed file relevance — fetch the full commit detail to get changed file paths:
gh api /repos/{owner}/{repo}/commits/{sha}
Check if any file in files[].filename matches patterns like:
*_test.go files whose name contains a token from the resource name⚠️ may already be addressed in `{short_sha}` — {one-line message summary}
No recent commits appear to address this failure.
Frame the analysis as "has this been fixed yet?" — not as blame attribution.
Do not suppress issue creation: Even if a fix commit is found, always proceed with creating the issue and include the fix-detection note in the Commit Analysis section. The issue serves as the remediation trigger regardless.
Before creating an issue for a base test name, check whether one already exists:
gh api "/repos/{owner}/{repo}/issues?labels=flaky-test&state=open&per_page=100"
title.[flaky-test] {BaseTestName} (the [flaky-test] prefix is applied automatically by the create-issue safe output). When checking for duplicates, compare against this full title as it appears in GitHub.issue_slots_available; use only the value from pre-activation.next link in the Link response header and repeat the request for subsequent pages until all open issues are fetched.Issue title: Pass only {BaseTestName} to the create-issue safe output — the [flaky-test] prefix is added automatically.
Every issue you create must contain exactly these 5 sections in this order:
## Broken Tests
List each test function name (including scenario suffix) that failed in 100% of runs:
- ❌ `TestAccResourceFoo_basic` — failed in 5/5 runs
## Flaky Tests
List each test function name that failed in ≥ 20% but < 100% of runs, with the observed rate:
- ⚠️ `TestAccResourceFoo_update` — failed in 3/5 runs (60%)
- ⚠️ `TestAccResourceFoo_import` — failed in 1/5 runs (20%)
## Commit Analysis
{Output from §7. Either a ⚠️ note about a possible fix commit, or the "No recent commits" message. Include commit SHA, message, and affected file paths when relevant.}
## Sample Failure Output
{Short excerpt (5–15 lines) of the actual log output surrounding a `--- FAIL:` line. Include any immediately preceding error messages for context.}
## Affected Stack Versions
{List the Elastic Stack versions / matrix dimension values (e.g. Elasticsearch version, Kibana version) from the failing job names or log metadata. If not determinable, write "Unknown — not present in log output".}
Formatting rules:
❌ for broken tests (100% fail rate).⚠️ for flaky tests (20%–99% fail rate), and always include the fraction and percentage.Cap enforcement: Before creating each issue, verify that the number of issues created so far in this run has not reached issue_slots_available. Stop creating issues once the cap is reached, even if additional base test names remain.
Call noop with a descriptive explanation (do not create any issues) when any of these conditions holds:
All failures already have open issues — every qualifying base test name matched an existing open flaky-test issue during deduplication; nothing new to open.
All failures are below the 20% threshold — every observed --- FAIL: test has fail_rate < 0.20; there are no broken or flaky tests to report.
No --- FAIL: patterns found — none of the logs for the provided failed_run_ids contained a --- FAIL: line; the CI failures were likely infrastructure failures (network timeouts, setup errors, etc.) rather than test logic failures.
When calling noop, state which condition applied and include basic counts (e.g. "3 failures observed, all below 20% threshold").
failed_run_ids from pre-activation context.Matrix Acceptance Test jobs → fetch and scan logs for --- FAIL: lines.< 20%).flaky-test issues.issue_slots_available): run commit analysis, then create an issue with all 5 required sections.noop.Orchestrates an end-to-end implementation loop for a single OpenSpec change: select a change, ask commit-only vs PR delivery, triage the change to determine an execution strategy (inline, single-implementor, or per-task) based on change size and complexity, implement tasks using the chosen strategy, run review and validation, feed findings back for fixes, push to origin, then either watch GitHub Actions on the branch (commit mode) or create a PR and delegate PR monitoring to the pr-monitoring-loop skill (PR mode). Use when the user wants to implement an approved OpenSpec proposal/change with iterative review and CI feedback.
Monitor GitHub pull requests through a subagent-based loop that watches CI checks, review comments, PR comments, review state, merge conflicts, and branch freshness. Use when a workflow reaches PR monitoring, CI polling, review feedback handling, or asks to keep a PR merge-ready.
Examines an existing Terraform resource or data source implementation and produces an OpenSpec requirements document under openspec/specs/. Use when the user asks to document requirements for a Terraform entity, capture behavior from code, or write a requirements doc for a resource/data source.
Guides migration of Terraform resources from Plugin SDK to Plugin Framework. Use when migrating SDK resources to PF, planning SDK-to-PF migrations, or when the user asks to migrate a resource to the Plugin Framework.
Gathers initial requirements for a new Terraform resource or data source by examining API clients (go-elasticsearch, generated kbapi), Elastic API docs (Elastic docs MCP server and/or web), then interviewing the user for gaps. Produces an OpenSpec proposal (change with proposal, design, tasks, and delta specs)—not a hand-written spec under openspec/specs/ alone. Use when designing a new entity, drafting requirements from an API, or before implementing a new resource/data source.
Analyzes an OpenSpec requirements spec for internal consistency, implementation compliance, and test opportunities; when a shell is available, run openspec validate first for structural checks. Use when reviewing specs, verifying implementation against requirements, or identifying test gaps.