| name | summarize-test-suite |
| description | Summarize a GitHub Actions test suite run for the mongodb/terraform-provider-mongodbatlas repo — fetches job logs, classifies failures, and produces a Slack summary led by an explicit code-regression verdict. Use when the user shares a workflow run URL or run_id, or when triggered by the test-suite webhook to decide whether on-call should investigate. |
Summarize Test Suite Execution
Inputs
The only input is a run_id (GitHub Actions run identifier). The repo is always mongodb/terraform-provider-mongodbatlas, and the run URL is https://github.com/mongodb/terraform-provider-mongodbatlas/actions/runs/<run_id> — derive both from run_id.
run_id can arrive as a workflow run or job URL the user pastes (extract the numeric ID from …/actions/runs/<run_id> or …/actions/runs/<run_id>/job/<job_id>), as a number passed by the test-suite webhook, or as a value the caller spells out directly.
Output
The output is always Slack-flavoured markdown (mrkdwn) — *bold* (never **bold**), backticks for code, • for bullets, <url|label> for links. The caller is responsible for posting it to Slack; the skill only produces the text.
The skill operates only on the run it is given — do not look at prior runs or other workflows.
Goal
The first character of the output is one of three emojis that telegraph urgency to on-call at a glance:
:red_circle: — at least one code regression. Investigate now.
:yellow_circle: — only infrastructure noise (capacity / timeout / cleanup / API contract). No on-call action.
:green_circle: — all tests passed. No on-call action; this is the daily heartbeat.
Confidence
Decide a confidence level for your verdict and embed it on the second line of the summary (templates below). Always show it — high confidence is useful information for on-call, not just low.
- high — the failure pattern maps cleanly onto the rules in Step 4, the categorisation matches the logs you read, and you are sure about the verdict.
- medium — the verdict is clear, but some category counts are estimated (you didn't enumerate every package's logs) or one failure type didn't fit any category cleanly.
- low — the failure pattern is novel, the logs were incomplete or truncated, you had to make a judgement call between two categories for ≥1 failure, or the rules in this skill don't cover something you saw.
Confidence modulates urgency on the second line: lower confidence on a yellow or green verdict raises the action from "no immediate action" to "please review", because the agent is signalling it might have missed something. Red is always "investigate immediately" regardless of confidence — but the confidence prefix still appears so the on-call knows whether to fully trust the regression call.
When confidence is medium or low, the summary must end with two extra lines:
*Why <confidence> confidence*: <one or two sentences naming the specific ambiguity> — e.g., "saw an INVALID_REGION API error that doesn't match the category 3a indicators", "a flex_cluster package failed with no --- RUN/--- PASS and an unfamiliar unable to start dispatcher line, ambiguous between build error and cleanup", "two failing tests share a deferred-test-helper panic that doesn't match any category 1 example", or "the generic job's logs returned HTTP 404, so its failures could not be enumerated". Be concrete; vague phrases like "complex output" aren't useful.
If this ambiguity recurs, consider updating the skill rules. — verbatim, immediately after the Why line. This is a standing nudge that medium / low confidence often reflects a missing rule rather than a one-off run; the maintainer can use it to decide when to revisit the skill.
Both lines are omitted on high confidence.
Provider repo layout
When you cite "code regressions in ", use the actual Go package path, not the test-job display name. The job name maps to a Go package by dropping underscores, lowercasing, and prefixing with internal/service/ — e.g., advanced_cluster → internal/service/advancedcluster, global_cluster_config → internal/service/globalclusterconfig. Apply this rule for any job not covered by the special cases below.
Special cases (do not apply the simple rule):
network is a super-job covering several packages (networkcontainer, networkpeering, privatelinkendpoint*, privateendpointregionalmode). Cite the specific package the failing test lives in, not "network".
autogen_fast / autogen_slow run the autogenerated resources under internal/serviceapi/.
config_sa_mig runs the same packages as the config group, but only the migration tests, in SA-auth mode.
clean-before / clean-after are cleanup utilities under internal/testutil/clean/, not provider code. Failures in these never count as code regressions.
Well-known manual fixes
Some failing tests have recurring root causes that aren't fully captured by category indicators alone. When a failing test name in the current run matches an entry below, surface the matching manual fix inline in the summary so on-call can act without re-investigating from scratch. Match strictly on test name within this run only — do not consult any other run.
| Test name | Manual fix |
|---|
TestAccNetworkRSNetworkPeering_Azure | Check for existing peering connections from a prior run. |
TestAccEncryptionAtRest_azure_requirePrivateNetworking | Ensure no active encryption-at-rest private-endpoint connections from a prior run. |
When the Known manual fixes section is emitted, append a closing line See team's internal wiki for more details. after the per-test bullets, so on-call knows there's a deeper playbook to consult. Do not include any URL — the wiki is internal and the skill output ships through a public skill file; on-call already knows where to find the team wiki.
Workflow
Step 1: Run context and failed jobs
First, fetch the run metadata so the summary can cite which environment was tested:
gh api repos/mongodb/terraform-provider-mongodbatlas/actions/runs/{run_id} \
--jq '{head_sha, run_number}'
Extract commit: first 7 chars of head_sha.
Then list jobs:
gh api repos/mongodb/terraform-provider-mongodbatlas/actions/runs/{run_id}/jobs --paginate
Derive env and auth from job names (the test-suite run's display_title is always just "Test Suite" and does not embed them):
auth: any test-job name has the matrix prefix <terraform_version>-<provider_version>-<auth> (e.g. 1.14.x-latest-pak / ... or 1.14.x-latest-sa / ...). Extract the third dash-separated token of the prefix (pak or sa).
env: nested called-workflow jobs are named tests-<terraform_version>-<provider_version>-<env> / ... (e.g. tests-1.14.x-latest-dev or tests-1.14.x-latest-qa). Extract the suffix after the last dash of that segment (dev or qa).
If no test jobs exist (only cleanup ran), default env/auth to "n/a".
Split jobs with conclusion: "failure" into two buckets:
- Cleanup jobs: any job whose name starts with
clean-before or clean-after. Treat all failures inside these collectively as category 5 (cleanup) — surface as a single line ("Cleanup: N stuck projects from prior runs"), do not enumerate the TestCleanProjectAndClusters/<id> subtests in the Failing tests list. Cleanup-job failures are leftovers from previous runs, never signals of a code regression in the current run.
- Test jobs: everything else. These go through Steps 2 to 4.
If both buckets are empty, emit the green template (:green_circle:) immediately and return.
Step 2: Fetch failure verdicts strictly
For each test job, fetch the per-test verdicts using a strict filter. Do not grep for Error or generic FAIL — [ERROR] sdk.proto: lines and Error in create/update/delete are emitted by many tests that ultimately PASS:
gh api repos/mongodb/terraform-provider-mongodbatlas/actions/jobs/{job_id}/logs 2>&1 \
| grep -E "(--- FAIL|^FAIL\t|panic:)"
The --- FAIL: TestName line is the source of truth for an individual test failure. Never classify a test as failed without one.
Package-level FAIL\t<package> without any --- FAIL: TestName for that package is not a test failure. Sub-classify it before counting:
- If the job log shows tests started for that package (any
--- RUN or --- PASS lines for it), the failure is post-test teardown (deferred TestMain, cleanup) — category 5 (cleanup).
- If no tests started for that package (no
--- RUN or --- PASS lines, just compile errors like # <package> followed by line:column errors, or cannot find module, command not found, undefined:, dependency-resolution failures, or panic: during go test startup, dependency download, or binary installation), it is a build / tooling error that prevents the runner from executing any tests in that package — category 1 (code regression). Examples: a recent provider commit broke a transitive import, a dependency yanked a version, a test-infra change broke binary discovery, a third-party installer panicked verifying a signature or checksum. These warrant immediate investigation regardless of whether the root cause is internal or upstream — the runner can't verify provider behaviour at all.
In neither case should the package name go in the Failing tests list — instead, surface a build error as its own bullet under code regressions ("Build error in <package>: ") and a cleanup-only package failure as part of the cleanup count.
When gh api .../jobs/{job_id}/logs returns an HTTP error (404, 403, 500) or empty output: do not guess the job's category. The job's tests cannot be verified. Surface it as a single bullet in the summary body — • Logs unavailable: \` (failed after min — logs inaccessible)— and do **not** count its tests toward any category and do **not** list them in the *Failing tests* line. Logs-unavailable jobs do not, by themselves, escalate the verdict to red, but they are still failed jobs: when a logs-unavailable job exists, the floor for the leading emoji is:yellow_circle:(never:green_circle:), even if no category 1–5 indicators were found in any other job. Set confidence to **medium** with this exact text used as the Why medium confidence:line at the end of the summary:logs unavailable for `` (); all other jobs cleanly classified, where isHTTP (e.g.HTTP 404) when the API returned a non-2xx status, or empty log response` when the call succeeded but the body was empty. Rationale: a job that ran tens of minutes and then failed is most likely infra noise, but without logs you cannot confirm; medium confidence asks on-call to glance without triggering a false red alert.
Subtest aggregation: a --- FAIL: Parent/<subname> line is one subtest of Parent. If multiple subtests of the same parent fail in the same job, list Parent once with the count (Parent (×N subtests)), not N separate entries.
Step 3: Get error context for each failed test
For each unique parent test (after subtest aggregation), fetch the surrounding context:
gh api repos/mongodb/terraform-provider-mongodbatlas/actions/jobs/{job_id}/logs 2>&1 \
| grep -B10 "FAIL: TestName"
Step 4: Classify failures
Assign each failing test (or aggregated subtest group) to exactly one category, first match wins:
| Priority | Category | Indicators |
|---|
| 1 | Code regression (high-signal) | Error: Provider produced inconsistent result after apply / provider produced an unexpected new value / This is a bug in the provider / any panic: line (in test execution, in TestMain, during dependency install, or during binary-installer signature/checksum verification) EXCEPT the Go test framework's deadline panic (see exception below) / Plugin did not respond / rpc error: / non-empty plan after a successful apply / assertion failures on attribute values / unexpected plan diffs / build or tooling errors that prevented any tests from running in a package (per the Step 2 sub-classification) / INVALID_ATTRIBUTE from the Atlas API when a test fails (the provider sent a field or value the API rejects; root cause may be provider code, test config, or a new Atlas API validation, but all cases require investigation and a fix) |
| 2 | Cloud capacity | OUT_OF_CAPACITY, NO_CAPACITY, No Capacity |
| 3 | API errors | Generic HTTP 4xx/5xx with API error codes (only when not covered by category 1; raw API errors that are not provider bugs go here); polled-state failures of the form unexpected state '<terminal-fail-state>', wanted target '<success-state>' (emitted by terraform-plugin-sdk's retry helper when an Atlas-side resource enters a terminal failure state during create / update / delete polling — e.g. unexpected state 'FAILED', wanted target 'COMPLETED'). The test got a definitive negative answer from the Atlas backend, not no answer |
| 3a | API contract errors | API rejects a request the test thought was valid: PROVIDER_UNSUPPORTED for combinations that used to work, CANNOT_ASSUME_ROLE, region/cross-region constraint errors. Sub-bucket of category 3, still yellow, but worth surfacing separately so on-call can triage instead of treating it as plain capacity noise |
| 4 | Timeout / flake | timeout while waiting for state, context deadline exceeded, eventual-consistency assertion failures, Go test deadline panics of the form panic: test timed out after <duration> (e.g. panic: test timed out after 5h0m0s) — these are emitted by the go test runner when a test exceeds its -timeout flag, not by code under test; HTTP 404 / *_NOT_FOUND error codes on a non-Create step (Update / PATCH / Read / Delete) when an earlier step for the same test succeeded — the resource exists but propagation is lagging; canonical examples: STREAM_CONNECTION_NOT_FOUND_FOR_NAME, CLUSTER_NOT_FOUND_FOR_NAME. Also includes HTTP 4xx with "does not exist" / "not found" wording referencing another Atlas resource created earlier by the same test, even when the HTTP code is not 404 and the error code is a generic Atlas code rather than *_NOT_FOUND — cross-resource propagation lag within Atlas where the referenced resource exists but the dependent service hasn't seen it yet. Canonical example: HTTP 400 STREAM_PROCESSOR_GENERIC_ERROR with body wording "connection does not exist" when a stream_processor test references a stream_connection created earlier in the same apply via the dependency graph. The prior create is implicit — helpers like acc.ProjectIDExecutionWith* fail loud (require.NoError) and Terraform aborts a failed apply, so reaching this error means the prerequisite was provisioned. Only false-positive to rule out is a name mismatch (typo / wrong reference in the test config); if the name in the error matches a name created earlier in the run, classify as category 4, otherwise category 3. Also includes HTTP 4xx errors with "does not exist" wording from a downstream-cloud provider (GCP / AWS / Azure) referencing an Atlas-created identity (service account, role, IAM principal) — cross-cloud propagation lag where Atlas created the resource but the downstream cloud doesn't see it yet; canonical example: googleapi: Error 400: Service account <atlas-sa>@<project>.iam.gserviceaccount.com does not exist during a google_storage_bucket_iam_member apply |
| 5 | Cleanup | still exists after destroy, CANNOT_CLOSE_GROUP_ACTIVE_ATLAS_CLUSTERS, plus everything in cleanup jobs from Step 1 |
Discipline rule: any "error in the provider" message (priority 1 patterns) is always a code regression. Never reclassify as flake, capacity, or cleanup. The canonical example: provider state-handling that diverges from the API (e.g., a config_server_type-style attribute that resolves to a different value than was applied), which is easy to miss for days because the underlying error gets buried among capacity / timeout failures.
Discipline rule: the leading emoji must match the worst category seen — any category 1 failure → :red_circle:; category 2–5 only → :yellow_circle:; no failures → :green_circle:. Never inflate (e.g. don't return :red_circle: for capacity-only failures even if 22 tests failed) and never deflate (don't return :yellow_circle: if even one provider-inconsistency error exists).
Discipline rule (conflicting indicators within one test): when a single failing test's log lines contain indicators from two different categories, classify by the root-cause indicator, not the downstream symptom. The most common case is a cleanup leftover that produces a downstream state-wait timeout: an overlapping-CIDR error, still exists, CANNOT_CLOSE_GROUP_ACTIVE_ATLAS_CLUSTERS, or DUPLICATE_AZURE_SERVICE_PRINCIPAL (category 5 indicators) alongside timeout while waiting for state or a state-wait deadline (category 4) → category 5. The timeout is an effect of the leftover resource, not a separate failure. Never split a single test across two categories — the root cause wins.
Discipline rule (tests that exercise cleanup / delete-on-timeout behaviour): when a failing test's name encodes the very cleanup behaviour whose failure would normally read as category 5 (e.g. *DeleteOnCreateTimeout*, *createTimeoutWithDelete*, *deleteOnCreate*, *CleanupOnTimeout*), and the DUPLICATE_* / still exists / leftover-resource indicator references a resource the same test created in an earlier step of the current --- RUN <test> execution (not a leftover from a previous test or a previous CI run), escalate to category 1 and flag it in the code-regression section. The cleanup leftover is the regression the test is meant to catch. The root cause may end up being in the provider (the timeout-triggered delete didn't fire) or in the test itself (its own delete-wait helper is too brittle, e.g. discards errors or uses too short a timeout). Both cases need on-call investigation; neither is safely ignorable. Surface the test in the regression bullet with wording like <TestName> — provider's delete-on-create-timeout or its test helper failed: DUPLICATE_CLUSTER_NAME at step N. Guard: if the same-execution evidence is missing (e.g. the failing step is Step 1 with no prior create, or the leftover name doesn't match an earlier apply complete in this run), fall back to category 5 — a hardcoded or non-unique resource name across CI runs would be the more likely explanation.
Root-cause aggregation across jobs: when ≥3 failing tests share an identical root-cause string (the same OUT_OF_CAPACITY, the same HTTP 4xx error code, the same timeout while waiting for state), report once as "N tests, all (in )" rather than enumerating each test's identical short error. Compress noise without losing the count.
Common misclassifications to avoid (these have happened in the past, do not repeat):
- Do not invent categories. The five buckets above (1, 2, 3, 3a, 4, 5) are exhaustive. Never create new ones like "CI infrastructure", "tooling noise", "third-party panic", "Go toolchain", or similar. If you find yourself wanting to do that, the failure already belongs in an existing category — usually category 1 (a panic or build error) or category 5 (cleanup).
- Do not downgrade
panic: based on root cause. A panic: is always category 1, regardless of where the panic originated (provider code, test setup, TestMain, an installer like hc-install, a dependency download, signature/checksum verification, an SDK initialiser). The criterion for category 1 is "tests cannot run or cannot be trusted", not "provider code is at fault". On-call needs to investigate any panic immediately, even when the root cause turns out to be upstream — the action is to revert / pin / patch the offending dependency or test infra. A run with three jobs panicking on the same toolchain issue is :red_circle:, not :yellow_circle:.
- One exception to the panic rule: Go test deadline panics. A line of the form
panic: test timed out after <duration> (e.g. panic: test timed out after 5h0m0s) is not a runtime panic — it is the go test runner enforcing the -timeout flag. It signals "this test/binary exceeded its deadline", which is a category 4 (timeout / flake) concern: the action is to investigate the test's runtime, not to triage a code regression. Classify these as category 4, list them under Timeout in the body, and do not enumerate them in the code-regressions section. The "any panic" rule still applies to every other panic: line.
- Do not downgrade build errors based on root cause. Same principle: if no tests ran in a package because of a compile error, missing binary, or installer panic, that's category 1. The fix may be in test infra rather than provider code, but the urgency is the same: the runner can't verify behaviour.
- Do not reclassify a category 1 finding to make the message look "less alarming". If even one panic or
Provider produced inconsistent exists, the verdict is :red_circle: and the header is CODE REGRESSION DETECTED, full stop. Never deflate.
- Do not classify
INVALID_ATTRIBUTE as category 3a. When a failing test's error context shows INVALID_ATTRIBUTE, the provider sent an attribute value the Atlas API rejects — root cause may be provider code, test config, or a new Atlas API validation, but it's still category 1 and requires investigation. Tests that use ExpectError: regexp.MustCompile("INVALID_ATTRIBUTE") pass cleanly; Step 2's --- FAIL filter already excludes them from classification.
- Do not classify propagation-lag errors as category 3 or 3a. Three shapes count as eventual-consistency flake (category 4), not API contract errors:
- Within Atlas, same resource: a test's Create step succeeded and a later step (Update / PATCH / Read / Delete) returns HTTP 404 or a
*_NOT_FOUND error on the same resource — the resource exists, the cluster hasn't propagated the creation yet.
- Within Atlas, cross-resource: a test creates one Atlas resource (via a helper like
acc.ProjectIDExecutionWith*, via an earlier resource in the same TF apply ordered by dependency, or in an earlier TF step) and a later operation creates a dependent Atlas resource that references it by name. The dependent-resource API returns HTTP 4xx with "does not exist" / "not found" wording about the referenced resource, even though the HTTP code is not 404 and the error code is a generic Atlas code rather than *_NOT_FOUND (e.g. STREAM_PROCESSOR_GENERIC_ERROR with body "connection does not exist"). The referenced resource exists; the dependent service's index hasn't caught up. The prior create is implicit (helpers fail loud, Terraform aborts on failed dependencies, an earlier failed step would have stopped the test). Only false-positive to rule out is a name mismatch — if the name in the error doesn't match any name created earlier in the run, classify as category 3 (real test bug), otherwise category 4.
- Across clouds: a test creates an Atlas resource (e.g. a service account) and a downstream-cloud provider (GCP / AWS / Azure) returns HTTP 4xx with "does not exist" wording referencing that identity in the same apply — the Atlas resource exists, the downstream cloud's IAM hasn't picked it up yet.
Only classify a 404 as category 3 if the initial Create or lookup of an Atlas resource fails — that indicates the endpoint or resource type doesn't exist, not that propagation is lagging.
- Do not classify a
DUPLICATE_* / "still exists" failure inside a delete-on-timeout test as category 5 when the leftover is from the same test execution. Tests whose name encodes the cleanup / delete-on-create-timeout behaviour (*DeleteOnCreateTimeout*, *createTimeoutWithDelete*, *deleteOnCreate*, *CleanupOnTimeout*) exist because the team wants to catch exactly this kind of failure. When the duplicate is between steps of the same --- RUN <test> execution (an earlier step created the resource and the failing step tries to recreate it after the timeout-triggered delete was supposed to clean it up), the leftover is the regression — escalate to category 1. Root cause is either a provider regression (the timeout-triggered delete didn't fire) or a test-helper bug (the test's delete-wait is brittle: too short a budget, errors discarded, magic sleep); both need on-call to look. Cluster-name uniqueness reinforces this — most acceptance helpers generate fresh per-execution names, so a DUPLICATE_CLUSTER_NAME between steps of the same test execution cannot come from another run. The opposite holds too: if the resource name appears to be hardcoded (no random / per-execution suffix) and the duplicate is at Step 1 with no prior create in the log, keep category 5 — the leftover is genuinely from a previous CI run.
- Do not classify polled-state FAILED as category 4 (timeout). The indicator is
unexpected state 'FAILED', not timeout while waiting for state — the test got a definitive answer (the resource entered a terminal failure state during polling), not no answer. It belongs in category 3.
Step 5: Produce output
Build the summary in Slack mrkdwn from the templates below and return it. Never emit GitHub-flavoured markdown — **bold**, # heading, [label](url) will not render correctly. The TL;DR line appears only in the red template; green and yellow have no TL;DR (the header and categorisation say everything).
Template — code regression detected
Use this when at least one failure is category 1.
:red_circle: *Test Suite #<run_number> — CODE REGRESSION DETECTED* (`<commit>` on `<env>`, `<auth>`)
<confidence> confidence — investigate immediately — <run_url|view run>
*<N> code regressions* in <packages>:
• `TestName1` — <one-line root cause, e.g. "Provider produced inconsistent result: .config_server_type was EMBEDDED, now DEDICATED">
• `TestName2` — <one-line root cause>
• Build error in `<package>` — <reason, e.g. "checksum mismatch installing terraform" or "cannot find module ...">
*Other failures*:
• Cloud capacity: <N> tests
• API contract: <N> tests
• Timeout: <N> tests
• Cleanup: <N> tests (incl. clean-before / clean-after)
[only when present:] • Logs unavailable: `<job-name>` (failed after <N> min — logs inaccessible)
[only when ≥1 failing test in this run matches the *Well-known manual fixes* table:]
*Known manual fixes*:
• `TestNameX` — <manual fix from the table>
See team's internal wiki for more details.
*Failing tests* (first 10): `TestName1`, `TestName2`, `TestA`, `TestB`, …, and 7 more
[only when confidence is medium or low:]
*Why <confidence> confidence*: <specific ambiguity in this run>
If this ambiguity recurs, consider updating the skill rules.
*TL;DR*: <one sentence on the suspected regression>. <if any failure took >30 min, mention "longest failure: TestX 75 min">. Check recent atlas-sdk-go bumps and upstream Atlas deployments around <approximate failure window>.
Template — infrastructure noise only
Use this when every failure is category 2–5.
:yellow_circle: *Test Suite #<run_number> — Infrastructure noise only* (`<commit>` on `<env>`, `<auth>`)
<confidence> confidence — <urgency> — <run_url|view run>
<N> tests failed, all classified as flake / capacity / cleanup / API contract:
• Cloud capacity: <N> tests (e.g., all `OUT_OF_CAPACITY` in cloud-<env>)
• API contract: <N> tests (list test names if ≤3 — these are worth a triage glance)
• Timeout: <N> tests
• Cleanup: <N> tests (incl. clean-before / clean-after)
[only when present:] • Logs unavailable: `<job-name>` (failed after <N> min — logs inaccessible)
[only when ≥1 failing test in this run matches the *Well-known manual fixes* table:]
*Known manual fixes*:
• `TestNameX` — <manual fix from the table>
See team's internal wiki for more details.
*Failing tests* (first 10): `TestA`, `TestB`, …, and 4 more
[only when confidence is medium or low:]
*Why <confidence> confidence*: <specific ambiguity in this run>
If this ambiguity recurs, consider updating the skill rules.
Template — all tests passed
Use this when step 1 finds no failed jobs (test or cleanup).
:green_circle: *Test Suite #<run_number> — All tests passed* (`<commit>` on `<env>`, `<auth>`)
<confidence> confidence — <urgency> — <run_url|view run>
[only when confidence is medium or low:]
*Why <confidence> confidence*: <specific ambiguity in this run>
If this ambiguity recurs, consider updating the skill rules.
Short by design — green days don't need a body.
Urgency phrase, by verdict and confidence
Substitute <urgency> in the second line according to this table. The <confidence> placeholder is always one of high / medium / low (lowercase).
| Verdict | high confidence | medium confidence | low confidence |
|---|
| red | investigate immediately | investigate immediately | investigate immediately |
| yellow | no immediate on-call action needed | please review when time allows | please review the run manually |
| green | no on-call action | please verify the run | please review the run manually |
The shape of the second line is therefore always: <confidence> confidence — <urgency> — <run_url|view run>.
Formatting rules
- Use
*text* for bold (Slack mrkdwn), never **text** (markdown).
- Use backticks around test names and code identifiers.
- Use
• for bullet points.
- Never reproduce internal ticket IDs (
HELP-*, CLOUDP-*, etc.) or internal artefact names in the output, even when they appear in logs or anywhere in this skill text. Describe the failure in its own technical terms (e.g. "provider produced inconsistent result for config_server_type").
- Code regressions: enumerate every failing test (after subtest aggregation) with its short error.
- Categories 2 to 5 (infrastructure noise): collapse to a single count line per category. Apply root-cause aggregation when ≥3 tests share the same error string.
- Cap the Failing tests line at the first 10 test names, comma-separated (no bullets), positioned below the summary. If more than 10 failed, suffix with
, and N more where N is the remaining count.
- Use
<url|label> for links.
- Known manual fixes: emit only when at least one failing test name in this run matches the Well-known manual fixes table; one bullet per matching test, with the table's manual fix quoted verbatim, followed by a closing
See team's internal wiki for more details. line. Do not invent manual fixes, and do not include any URL (the wiki reference must stay neutral).
- Length budget is enforced by Step 6 below — see that step for the target, the hard cap, and the drop-priority list.
Step 6: Length check (mandatory before returning)
Measure the exact length of your summary before returning it via Bash:
printf '%s' "<your summary>" | wc -c
Hard cap: 2900 chars. The caller prepends an on-call tag (≈30 chars) and posts the result inside a single Slack section block (Slack-side hard limit: 3000 chars); over-budget messages are rejected by Slack and the workflow step fails — on-call still sees no summary, just a red workflow run.
If wc -c reports more than 2900, compress in this order, re-measuring after each step until the count is under the cap:
- Drop the Why confidence line and the paired
If this ambiguity recurs, … nudge together if confidence is medium (keep both on low — that's where they matter most). The two lines must always be present or absent together; never emit one without the other.
- Drop the Known manual fixes section if confidence is high (the agent already classified cleanly; the fixes are nice-to-have at that point).
- Compress the TL;DR to one short sentence.
- Reduce the Failing tests cap from 10 to 5 (suffix with
, and N more).
- Collapse Other failures bullets onto a single line, e.g.,
*Other failures*: Cloud capacity 5, Timeout 12, Cleanup 3.
Key Learnings
- The
--- PASS / --- FAIL verdict is the source of truth — many ERROR-level lines come from tests that pass.
- Subtests of the form
Parent/<id> are one logical test with N children. Never inflate the failure count by listing them separately.
- Capacity, timeout, and cleanup categories are noisy by nature in cloud-dev; root-cause aggregation across jobs keeps the message scannable while preserving the count.
- Log output is large; always filter with
grep rather than reading raw logs.
- Cleanup-class indicators (CIDR overlap,
still exists, leftover principal) are root causes; a timeout on the same test is their downstream symptom. Classify by root cause (category 5), not symptom.
- When
gh api .../jobs/{id}/logs returns 404 or empty, surface as "Logs unavailable" in the summary body and set confidence to medium — never guess the category.
- HTTP 404 on a secondary step after a successful Create is eventual-consistency flake (category 4). Reserve category 3 for 404s on the initial Create or lookup.
- Some failing tests have well-known recurring root causes — match the failing test name against the Well-known manual fixes table and surface the matching manual fix when applicable.
unexpected state 'FAILED', wanted target '...' from terraform-plugin-sdk's retry helper is a category 3 polled-state failure: Atlas reported a terminal failure during state polling. Don't conflate with category 4 timeouts (no timeout while waiting) or category 1 (no provider bug).
- Cross-cloud propagation lag (Atlas creates an identity, GCP / AWS / Azure returns "does not exist" on it shortly after) is category 4 (eventual-consistency flake), not category 3 — same shape as the within-Atlas 404 case but the HTTP code may be 400 because the downstream cloud rejects the reference rather than the lookup itself.
- Atlas-internal cross-resource propagation lag (e.g. a
stream_processor step referencing a stream_connection created earlier in the same apply and getting HTTP 400 STREAM_PROCESSOR_GENERIC_ERROR "connection X does not exist") is also category 4. The wire shape can be a non-404 HTTP code and a generic Atlas error code. The prior create is implicit (helpers fail loud, Terraform aborts on failed dependencies); the only check needed is name-match between the error and earlier creates — if the names don't match, it's a test misconfig (category 3).
DUPLICATE_* / still exists inside a test whose name encodes cleanup-on-timeout behaviour (*DeleteOnCreateTimeout*, *createTimeoutWithDelete*, *deleteOnCreate*, *CleanupOnTimeout*) is category 1, not category 5, when the leftover came from an earlier step of the same test execution — the leftover is the regression. The root cause may be a provider bug or a brittle test wait, both of which need on-call to look. If the leftover is from a previous CI run (no prior create in the current run's log, or the resource name appears hardcoded), it stays category 5.