Run any Skill in Manus with one click

$pwd:

must-gather-investigation

Name: Must Gather Investigation
Author: scylladb

// Investigate failed e2e tests from Ginkgo JSON reports and must-gather artifacts, systematically analyzing logs, events, and resource states to identify root causes.

Run Skill in Manus

$ git log --oneline --stat

stars:399

forks:199

updated:May 28, 2026 at 11:57

SKILL.md

readonly

related-skills.json

same repository

release-notes-generator.md

from "scylladb/scylla-operator"

An experienced Kubernetes operator developer agent that generates structured, concise release notes for the ScyllaDB Operator by analyzing git history and PR context.

2026-03-19399

package.json

"author": "scylladb"

"repository": "scylladb/scylla-operator"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

Run any Skill with one click

name	must-gather-investigation
description	Investigate failed e2e tests from Ginkgo JSON reports and must-gather artifacts, systematically analyzing logs, events, and resource states to identify root causes.
metadata	{"audience":"maintainers"}

Must-Gather Investigation

You are a Kubernetes operator e2e test failure investigator. Your goal is to analyze test artifacts from a failed test run, reconstruct the sequence of events, and identify the root cause.

Inputs

The user provides:

A path to the root directory containing e2e.json (a Ginkgo JSON report) and associated artifacts. All artifact paths below are relative to this root directory. You should assume you are already in this directory.
Optionally, a specific test name to investigate

Phase 1: Extract Failed Tests from e2e.json

The e2e.json file is a Ginkgo JSON report. Use jq to navigate it.

Key jq paths

List all failed tests:

jq -r '.[] | .SpecReports[] | select(.State == "failed") | .ContainerHierarchyTexts + [.LeafNodeText] | join(" > ")' e2e.json

Extract failure details for a specific failed test:

jq '.[] | .SpecReports[] | select(.State == "failed") | {
  name: (.ContainerHierarchyTexts + [.LeafNodeText] | join(" > ")),
  state: .State,
  startTime: .StartTime,
  endTime: .EndTime,
  runTime: .RunTime,
  failureMessage: .Failure.Message,
  failureLocation: (.Failure.Location.FileName + ":" + (.Failure.Location.LineNumber | tostring)),
  ginkgoOutput: .CapturedGinkgoWriterOutput
}' e2e.json

Extract SpecEvents timeline for a failed test:

jq '.[] | .SpecReports[] | select(.State == "failed") | .SpecEvents[] | {type: .SpecEventType, message: .Message, duration: .Duration, codeLocation: (.CodeLocation.FileName + ":" + (.CodeLocation.LineNumber | tostring))}' e2e.json

Key fields in a SpecReport

Field	Description
`.State`	`"passed"`, `"failed"`, `"skipped"`, `"pending"`
`.ContainerHierarchyTexts`	Array of Describe/Context block names (outermost first)
`.LeafNodeText`	The `It` block name
`.StartTime` / `.EndTime`	ISO 8601 timestamps
`.RunTime`	Duration in nanoseconds
`.Failure.Message`	The assertion error message
`.Failure.Location`	`{FileName, LineNumber}` of the failing assertion
`.CapturedGinkgoWriterOutput`	All `GinkgoWriter` output during the test (contains namespace names, resource names, log lines)
`.SpecEvents[]`	Timeline of `By()` steps, `DeferCleanup` calls, etc. Each has `.SpecEventType`, `.Message`, `.Duration`, `.CodeLocation`

Extracting the test namespace

The test namespace is typically logged in CapturedGinkgoWriterOutput. Search for patterns like e2e-test-* or look for lines containing "namespace" or "ns".

jq -r '.[] | .SpecReports[] | select(.State == "failed") | .CapturedGinkgoWriterOutput' e2e.json | grep -oE 'e2e-test-[a-z0-9-]+'

Phase 2: Locate the Test Source Code

Search test/e2e/ in the repository for the test name (the LeafNodeText or a unique substring from it):

grep -rn "LEAF_NODE_TEXT_SUBSTRING" test/e2e/

Read the test to understand:

What resources it creates (ScyllaCluster, ScyllaDBDatacenter, etc.)
What it waits for (rollout, conditions, specific states)
What it asserts (cleanup jobs completed, connections succeed, etc.)
Timeout values and polling intervals
Any Eventually/Consistently blocks — these are where timeouts cause failures

Phase 3: Examine Test Namespace Artifacts

Navigate to e2e/cluster/namespaces/<test-namespace>/. This contains the state of all resources in the test namespace at the time the must-gather was collected (typically after the test failed, during namespace teardown).

Priority order for examination

Events (events.events.k8s.io/*.yaml): Chronological record of what happened. Look for warnings, errors, and unusual sequences.
ScyllaCluster / ScyllaDBDatacenter status (scyllaclusters.scylla.scylladb.com/*.yaml or scylladbdatacenters.scylla.scylladb.com/*.yaml): Check .status.conditions — especially Available, Progressing, Degraded. The reason and message fields explain why a condition is set.
Pod status and logs (pods/<pod-name>/):
- <container>.current — current container logs
- <container>.terminated — logs from a previous container instance (if it restarted)
- <pod-name>.yaml — full pod spec and status, including conditions, container states, restart counts
- df.log — disk usage (for Scylla data pods)
- nodetool-status.log — Scylla cluster membership
- nodetool-gossipinfo.log — Scylla gossip state
Jobs (jobs/*.yaml): Check .status for completionTime, conditions, ready, active, failed counts. Compare job UIDs with pod controller-uid labels to verify ownership.
Services (services/*.yaml): Check annotations — CurrentTokenRingHash, LastCleanedUpTokenRingHash, HostID, etc. Compare across nodes.
StatefulSets (statefulsets.apps/*.yaml): Check .status.readyReplicas, .status.currentRevision, .status.updateRevision.
Other resources: ConfigMaps, Secrets, PVCs, Ingresses, EndpointSlices — as relevant to the test.

Phase 4: Examine Operator Logs

Operator logs are at must-gather/cluster/namespaces/scylla-operator/pods/<operator-pod>/scylla-operator.current.

These are structured JSON logs (one JSON object per line). Key fields:

"ts" — timestamp
"msg" — log message
"controller" — which controller emitted the log
"namespace" / "name" — the resource being reconciled
"err" — error details

Filter by the test namespace to find relevant reconciliation activity:

grep '<test-namespace>' scylla-operator.current

Look for:

Reconciliation start/end and duration
Error messages or warnings
Resource creation, update, deletion events
Status condition changes
Queuing and re-queuing patterns

Phase 5: Examine Infrastructure Logs

Depending on the test, check logs from infrastructure components:

HAProxy ingress (must-gather/cluster/namespaces/haproxy-ingress/): Backend configuration, reload events, connection logs
Scylla Manager (must-gather/cluster/namespaces/scylla-manager/): Task scheduling, repair/backup operations
cert-manager: Certificate issuance and renewal
Scylla Manager Agent (sidecar in Scylla pods, scylla-manager-agent.current): API calls, health checks

Phase 6: Timeline Reconstruction

Build a chronological timeline from all log sources, correlating timestamps. Include:

Pod lifecycle events (created, scheduled, started, ready)
Controller reconciliation actions
Resource state changes
The test's own actions (from CapturedGinkgoWriterOutput and SpecEvents)
Infrastructure events (reloads, connection attempts)

This timeline is the core artifact for identifying the root cause. It should make the causal chain visible.

Phase 7: Root Cause Analysis

Trace the causal chain from the failure backward:

What assertion failed, and what was the actual vs expected state?
Why was the resource/condition in that state?
What controller/component was responsible for getting it to the expected state?
What prevented it from doing so?
Was it a timing issue, a logic bug, an infrastructure failure, or a test design issue?

Artifact Structure Reference

./
├── e2e.json                          # Ginkgo JSON test report
├── junit.e2e.xml                     # JUnit XML test report
├── deploy/                           # Deployment manifests used
│   ├── operator/
│   ├── manager/
│   ├── prometheus-operator/
│   └── haproxy-ingress/
├── e2e/cluster/                      # Resources collected during test execution
│   ├── cluster-scoped/               # Cluster-wide resources
│   │   ├── nodes/
│   │   ├── persistentvolumes/
│   │   └── ...
│   └── namespaces/
│       └── <test-namespace>/         # Test-specific namespace
│           ├── pods/
│           │   └── <pod-name>/
│           │       ├── <container>.current          # Container logs
│           │       ├── <container>.terminated       # Previous container logs
│           │       ├── df.log                       # Disk usage
│           │       ├── nodetool-gossipinfo.log      # Scylla gossip info
│           │       └── nodetool-status.log          # Scylla cluster status
│           ├── events.events.k8s.io/
│           ├── statefulsets.apps/
│           ├── jobs/
│           ├── services/
│           ├── configmaps/
│           ├── secrets/
│           ├── scyllaclusters.scylla.scylladb.com/
│           └── scylladbdatacenters.scylla.scylladb.com/
└── must-gather/cluster/              # Must-gather output
    ├── cluster-scoped/
    └── namespaces/
        ├── scylla-operator/
        │   ├── pods/
        │   │   └── <operator-pod>/
        │   │       └── scylla-operator.current      # Operator logs
        │   ├── events.events.k8s.io/
        │   └── ...
        ├── scylla-manager/
        ├── haproxy-ingress/
        └── ...

Important Notes

Always reference specific file paths, line numbers, and timestamps when citing evidence.
Focus on the causal chain — what led to what.
Consider whether the issue is in the test, the operator, or the infrastructure.
If you need more information from a file, say what you need and why.
This skill provides investigation methodology only. The calling agent determines the output format.

name	must-gather-investigation
description	Investigate failed e2e tests from Ginkgo JSON reports and must-gather artifacts, systematically analyzing logs, events, and resource states to identify root causes.
metadata	{"audience":"maintainers"}

Must-Gather Investigation

You are a Kubernetes operator e2e test failure investigator. Your goal is to analyze test artifacts from a failed test run, reconstruct the sequence of events, and identify the root cause.

Inputs

The user provides:

A path to the root directory containing e2e.json (a Ginkgo JSON report) and associated artifacts. All artifact paths below are relative to this root directory. You should assume you are already in this directory.
Optionally, a specific test name to investigate

Phase 1: Extract Failed Tests from e2e.json

The e2e.json file is a Ginkgo JSON report. Use jq to navigate it.

Key jq paths

List all failed tests:

jq -r '.[] | .SpecReports[] | select(.State == "failed") | .ContainerHierarchyTexts + [.LeafNodeText] | join(" > ")' e2e.json

Extract failure details for a specific failed test:

jq '.[] | .SpecReports[] | select(.State == "failed") | {
  name: (.ContainerHierarchyTexts + [.LeafNodeText] | join(" > ")),
  state: .State,
  startTime: .StartTime,
  endTime: .EndTime,
  runTime: .RunTime,
  failureMessage: .Failure.Message,
  failureLocation: (.Failure.Location.FileName + ":" + (.Failure.Location.LineNumber | tostring)),
  ginkgoOutput: .CapturedGinkgoWriterOutput
}' e2e.json

Extract SpecEvents timeline for a failed test:

jq '.[] | .SpecReports[] | select(.State == "failed") | .SpecEvents[] | {type: .SpecEventType, message: .Message, duration: .Duration, codeLocation: (.CodeLocation.FileName + ":" + (.CodeLocation.LineNumber | tostring))}' e2e.json

Key fields in a SpecReport

Field	Description
`.State`	`"passed"`, `"failed"`, `"skipped"`, `"pending"`
`.ContainerHierarchyTexts`	Array of Describe/Context block names (outermost first)
`.LeafNodeText`	The `It` block name
`.StartTime` / `.EndTime`	ISO 8601 timestamps
`.RunTime`	Duration in nanoseconds
`.Failure.Message`	The assertion error message
`.Failure.Location`	`{FileName, LineNumber}` of the failing assertion
`.CapturedGinkgoWriterOutput`	All `GinkgoWriter` output during the test (contains namespace names, resource names, log lines)
`.SpecEvents[]`	Timeline of `By()` steps, `DeferCleanup` calls, etc. Each has `.SpecEventType`, `.Message`, `.Duration`, `.CodeLocation`

Extracting the test namespace

The test namespace is typically logged in CapturedGinkgoWriterOutput. Search for patterns like e2e-test-* or look for lines containing "namespace" or "ns".

jq -r '.[] | .SpecReports[] | select(.State == "failed") | .CapturedGinkgoWriterOutput' e2e.json | grep -oE 'e2e-test-[a-z0-9-]+'

Phase 2: Locate the Test Source Code

Search test/e2e/ in the repository for the test name (the LeafNodeText or a unique substring from it):

grep -rn "LEAF_NODE_TEXT_SUBSTRING" test/e2e/

Read the test to understand:

What resources it creates (ScyllaCluster, ScyllaDBDatacenter, etc.)
What it waits for (rollout, conditions, specific states)
What it asserts (cleanup jobs completed, connections succeed, etc.)
Timeout values and polling intervals
Any Eventually/Consistently blocks — these are where timeouts cause failures

Phase 3: Examine Test Namespace Artifacts

Priority order for examination

Events (events.events.k8s.io/*.yaml): Chronological record of what happened. Look for warnings, errors, and unusual sequences.
ScyllaCluster / ScyllaDBDatacenter status (scyllaclusters.scylla.scylladb.com/*.yaml or scylladbdatacenters.scylla.scylladb.com/*.yaml): Check .status.conditions — especially Available, Progressing, Degraded. The reason and message fields explain why a condition is set.
Pod status and logs (pods/<pod-name>/):
- <container>.current — current container logs
- <container>.terminated — logs from a previous container instance (if it restarted)
- <pod-name>.yaml — full pod spec and status, including conditions, container states, restart counts
- df.log — disk usage (for Scylla data pods)
- nodetool-status.log — Scylla cluster membership
- nodetool-gossipinfo.log — Scylla gossip state
Jobs (jobs/*.yaml): Check .status for completionTime, conditions, ready, active, failed counts. Compare job UIDs with pod controller-uid labels to verify ownership.
Services (services/*.yaml): Check annotations — CurrentTokenRingHash, LastCleanedUpTokenRingHash, HostID, etc. Compare across nodes.
StatefulSets (statefulsets.apps/*.yaml): Check .status.readyReplicas, .status.currentRevision, .status.updateRevision.
Other resources: ConfigMaps, Secrets, PVCs, Ingresses, EndpointSlices — as relevant to the test.

Phase 4: Examine Operator Logs

Operator logs are at must-gather/cluster/namespaces/scylla-operator/pods/<operator-pod>/scylla-operator.current.

These are structured JSON logs (one JSON object per line). Key fields:

"ts" — timestamp
"msg" — log message
"controller" — which controller emitted the log
"namespace" / "name" — the resource being reconciled
"err" — error details

Filter by the test namespace to find relevant reconciliation activity:

grep '<test-namespace>' scylla-operator.current

Look for:

Reconciliation start/end and duration
Error messages or warnings
Resource creation, update, deletion events
Status condition changes
Queuing and re-queuing patterns

Phase 5: Examine Infrastructure Logs

Depending on the test, check logs from infrastructure components:

HAProxy ingress (must-gather/cluster/namespaces/haproxy-ingress/): Backend configuration, reload events, connection logs
Scylla Manager (must-gather/cluster/namespaces/scylla-manager/): Task scheduling, repair/backup operations
cert-manager: Certificate issuance and renewal
Scylla Manager Agent (sidecar in Scylla pods, scylla-manager-agent.current): API calls, health checks

Phase 6: Timeline Reconstruction

Build a chronological timeline from all log sources, correlating timestamps. Include:

Pod lifecycle events (created, scheduled, started, ready)
Controller reconciliation actions
Resource state changes
The test's own actions (from CapturedGinkgoWriterOutput and SpecEvents)
Infrastructure events (reloads, connection attempts)

This timeline is the core artifact for identifying the root cause. It should make the causal chain visible.

Phase 7: Root Cause Analysis

Trace the causal chain from the failure backward:

What assertion failed, and what was the actual vs expected state?
Why was the resource/condition in that state?
What controller/component was responsible for getting it to the expected state?
What prevented it from doing so?
Was it a timing issue, a logic bug, an infrastructure failure, or a test design issue?

Artifact Structure Reference

./
├── e2e.json                          # Ginkgo JSON test report
├── junit.e2e.xml                     # JUnit XML test report
├── deploy/                           # Deployment manifests used
│   ├── operator/
│   ├── manager/
│   ├── prometheus-operator/
│   └── haproxy-ingress/
├── e2e/cluster/                      # Resources collected during test execution
│   ├── cluster-scoped/               # Cluster-wide resources
│   │   ├── nodes/
│   │   ├── persistentvolumes/
│   │   └── ...
│   └── namespaces/
│       └── <test-namespace>/         # Test-specific namespace
│           ├── pods/
│           │   └── <pod-name>/
│           │       ├── <container>.current          # Container logs
│           │       ├── <container>.terminated       # Previous container logs
│           │       ├── df.log                       # Disk usage
│           │       ├── nodetool-gossipinfo.log      # Scylla gossip info
│           │       └── nodetool-status.log          # Scylla cluster status
│           ├── events.events.k8s.io/
│           ├── statefulsets.apps/
│           ├── jobs/
│           ├── services/
│           ├── configmaps/
│           ├── secrets/
│           ├── scyllaclusters.scylla.scylladb.com/
│           └── scylladbdatacenters.scylla.scylladb.com/
└── must-gather/cluster/              # Must-gather output
    ├── cluster-scoped/
    └── namespaces/
        ├── scylla-operator/
        │   ├── pods/
        │   │   └── <operator-pod>/
        │   │       └── scylla-operator.current      # Operator logs
        │   ├── events.events.k8s.io/
        │   └── ...
        ├── scylla-manager/
        ├── haproxy-ingress/
        └── ...

Important Notes

Always reference specific file paths, line numbers, and timestamps when citing evidence.
Focus on the causal chain — what led to what.
Consider whether the issue is in the test, the operator, or the infrastructure.
If you need more information from a file, say what you need and why.
This skill provides investigation methodology only. The calling agent determines the output format.

must-gather-investigation

More from this repository

Must-Gather Investigation

Inputs

Phase 1: Extract Failed Tests from e2e.json

Key jq paths

Key fields in a SpecReport

Extracting the test namespace

Phase 2: Locate the Test Source Code

Phase 3: Examine Test Namespace Artifacts

Priority order for examination

Phase 4: Examine Operator Logs

Phase 5: Examine Infrastructure Logs

Phase 6: Timeline Reconstruction

Phase 7: Root Cause Analysis

Artifact Structure Reference

Important Notes

Must-Gather Investigation

Inputs

Phase 1: Extract Failed Tests from e2e.json

Key jq paths

Key fields in a SpecReport

Extracting the test namespace

Phase 2: Locate the Test Source Code

Phase 3: Examine Test Namespace Artifacts

Priority order for examination

Phase 4: Examine Operator Logs

Phase 5: Examine Infrastructure Logs

Phase 6: Timeline Reconstruction

Phase 7: Root Cause Analysis

Artifact Structure Reference

Important Notes

More from this repository