ワンクリックで
must-gather-investigation
// Investigate failed e2e tests from Ginkgo JSON reports and must-gather artifacts, systematically analyzing logs, events, and resource states to identify root causes.
// Investigate failed e2e tests from Ginkgo JSON reports and must-gather artifacts, systematically analyzing logs, events, and resource states to identify root causes.
| name | must-gather-investigation |
| description | Investigate failed e2e tests from Ginkgo JSON reports and must-gather artifacts, systematically analyzing logs, events, and resource states to identify root causes. |
| metadata | {"audience":"maintainers"} |
You are a Kubernetes operator e2e test failure investigator. Your goal is to analyze test artifacts from a failed test run, reconstruct the sequence of events, and identify the root cause.
The user provides:
e2e.json (a Ginkgo JSON report) and associated artifacts. All artifact paths below are relative to this root directory. You should assume you are already in this directory.The e2e.json file is a Ginkgo JSON report. Use jq to navigate it.
List all failed tests:
jq -r '.[] | .SpecReports[] | select(.State == "failed") | .ContainerHierarchyTexts + [.LeafNodeText] | join(" > ")' e2e.json
Extract failure details for a specific failed test:
jq '.[] | .SpecReports[] | select(.State == "failed") | {
name: (.ContainerHierarchyTexts + [.LeafNodeText] | join(" > ")),
state: .State,
startTime: .StartTime,
endTime: .EndTime,
runTime: .RunTime,
failureMessage: .Failure.Message,
failureLocation: (.Failure.Location.FileName + ":" + (.Failure.Location.LineNumber | tostring)),
ginkgoOutput: .CapturedGinkgoWriterOutput
}' e2e.json
Extract SpecEvents timeline for a failed test:
jq '.[] | .SpecReports[] | select(.State == "failed") | .SpecEvents[] | {type: .SpecEventType, message: .Message, duration: .Duration, codeLocation: (.CodeLocation.FileName + ":" + (.CodeLocation.LineNumber | tostring))}' e2e.json
| Field | Description |
|---|---|
.State | "passed", "failed", "skipped", "pending" |
.ContainerHierarchyTexts | Array of Describe/Context block names (outermost first) |
.LeafNodeText | The It block name |
.StartTime / .EndTime | ISO 8601 timestamps |
.RunTime | Duration in nanoseconds |
.Failure.Message | The assertion error message |
.Failure.Location | {FileName, LineNumber} of the failing assertion |
.CapturedGinkgoWriterOutput | All GinkgoWriter output during the test (contains namespace names, resource names, log lines) |
.SpecEvents[] | Timeline of By() steps, DeferCleanup calls, etc. Each has .SpecEventType, .Message, .Duration, .CodeLocation |
The test namespace is typically logged in CapturedGinkgoWriterOutput. Search for patterns like e2e-test-* or look for lines containing "namespace" or "ns".
jq -r '.[] | .SpecReports[] | select(.State == "failed") | .CapturedGinkgoWriterOutput' e2e.json | grep -oE 'e2e-test-[a-z0-9-]+'
Search test/e2e/ in the repository for the test name (the LeafNodeText or a unique substring from it):
grep -rn "LEAF_NODE_TEXT_SUBSTRING" test/e2e/
Read the test to understand:
Eventually/Consistently blocks — these are where timeouts cause failuresNavigate to e2e/cluster/namespaces/<test-namespace>/. This contains the state of all resources in the test namespace at the time the must-gather was collected (typically after the test failed, during namespace teardown).
Events (events.events.k8s.io/*.yaml): Chronological record of what happened. Look for warnings, errors, and unusual sequences.
ScyllaCluster / ScyllaDBDatacenter status (scyllaclusters.scylla.scylladb.com/*.yaml or scylladbdatacenters.scylla.scylladb.com/*.yaml): Check .status.conditions — especially Available, Progressing, Degraded. The reason and message fields explain why a condition is set.
Pod status and logs (pods/<pod-name>/):
<container>.current — current container logs<container>.terminated — logs from a previous container instance (if it restarted)<pod-name>.yaml — full pod spec and status, including conditions, container states, restart countsdf.log — disk usage (for Scylla data pods)nodetool-status.log — Scylla cluster membershipnodetool-gossipinfo.log — Scylla gossip stateJobs (jobs/*.yaml): Check .status for completionTime, conditions, ready, active, failed counts. Compare job UIDs with pod controller-uid labels to verify ownership.
Services (services/*.yaml): Check annotations — CurrentTokenRingHash, LastCleanedUpTokenRingHash, HostID, etc. Compare across nodes.
StatefulSets (statefulsets.apps/*.yaml): Check .status.readyReplicas, .status.currentRevision, .status.updateRevision.
Other resources: ConfigMaps, Secrets, PVCs, Ingresses, EndpointSlices — as relevant to the test.
Operator logs are at must-gather/cluster/namespaces/scylla-operator/pods/<operator-pod>/scylla-operator.current.
These are structured JSON logs (one JSON object per line). Key fields:
"ts" — timestamp"msg" — log message"controller" — which controller emitted the log"namespace" / "name" — the resource being reconciled"err" — error detailsFilter by the test namespace to find relevant reconciliation activity:
grep '<test-namespace>' scylla-operator.current
Look for:
Depending on the test, check logs from infrastructure components:
must-gather/cluster/namespaces/haproxy-ingress/): Backend configuration, reload events, connection logsmust-gather/cluster/namespaces/scylla-manager/): Task scheduling, repair/backup operationsscylla-manager-agent.current): API calls, health checksBuild a chronological timeline from all log sources, correlating timestamps. Include:
CapturedGinkgoWriterOutput and SpecEvents)This timeline is the core artifact for identifying the root cause. It should make the causal chain visible.
Trace the causal chain from the failure backward:
./
├── e2e.json # Ginkgo JSON test report
├── junit.e2e.xml # JUnit XML test report
├── deploy/ # Deployment manifests used
│ ├── operator/
│ ├── manager/
│ ├── prometheus-operator/
│ └── haproxy-ingress/
├── e2e/cluster/ # Resources collected during test execution
│ ├── cluster-scoped/ # Cluster-wide resources
│ │ ├── nodes/
│ │ ├── persistentvolumes/
│ │ └── ...
│ └── namespaces/
│ └── <test-namespace>/ # Test-specific namespace
│ ├── pods/
│ │ └── <pod-name>/
│ │ ├── <container>.current # Container logs
│ │ ├── <container>.terminated # Previous container logs
│ │ ├── df.log # Disk usage
│ │ ├── nodetool-gossipinfo.log # Scylla gossip info
│ │ └── nodetool-status.log # Scylla cluster status
│ ├── events.events.k8s.io/
│ ├── statefulsets.apps/
│ ├── jobs/
│ ├── services/
│ ├── configmaps/
│ ├── secrets/
│ ├── scyllaclusters.scylla.scylladb.com/
│ └── scylladbdatacenters.scylla.scylladb.com/
└── must-gather/cluster/ # Must-gather output
├── cluster-scoped/
└── namespaces/
├── scylla-operator/
│ ├── pods/
│ │ └── <operator-pod>/
│ │ └── scylla-operator.current # Operator logs
│ ├── events.events.k8s.io/
│ └── ...
├── scylla-manager/
├── haproxy-ingress/
└── ...