en un clic
diagnose-sunbeam
// Use when diagnosing failed Sunbeam CI runs, analyzing sosreport tarballs, juju status files, sunbeam CLI logs, or investigating multi-node OpenStack deployment failures on Canonical K8s
// Use when diagnosing failed Sunbeam CI runs, analyzing sosreport tarballs, juju status files, sunbeam CLI logs, or investigating multi-node OpenStack deployment failures on Canonical K8s
| name | diagnose-sunbeam |
| description | Use when diagnosing failed Sunbeam CI runs, analyzing sosreport tarballs, juju status files, sunbeam CLI logs, or investigating multi-node OpenStack deployment failures on Canonical K8s |
Prove less, qualify more. Symptoms are not causes. A diagnosis must separate what the artifacts show from what you infer. Name a root cause only when direct evidence supports it. When the artifacts prove only the failure surface, say so.
Every diagnosis MUST follow these steps in order. Do not skip steps or reorder them.
1. Extract archives
2. Identify deployment phase from artifacts
3. Collect observed facts with file:line references
4. Run false-negative checklist (MANDATORY before declaring real failure)
5. Identify the immediate failure surface (what directly failed)
6. Evaluate candidate mechanisms (what might have caused it)
7. Check counter-evidence for each candidate
8. Classify each claim: Confirmed / Supported / Speculative
9. Write DIAGNOSTICS.md
When artifacts conflict, higher-ranked evidence wins:
| Rank | Source | What it proves |
|---|---|---|
| 1 | Remote CLI completion log (sosreport-*/logs/sunbeam-*.log ending in ResultType.COMPLETED) | The remote operation succeeded, regardless of what the CI transport reports |
| 2 | Juju status + cluster list at collection time | Actual deployment state at a known timestamp |
| 3 | CI output log (generated-sunbeam-output.log) exit codes and stderr | What the CI runner observed — may not reflect remote reality |
| 4 | Juju debug logs | Agent lifecycle events — useful for sequencing, not for proving external causes |
| 5 | Pod logs, sosreport system logs | Supporting detail — confirms or weakens a hypothesis |
Key rule: A CI exit code 255 with "Broken pipe" is a Rank 3 observation. A remote CLI log showing ResultType.COMPLETED is Rank 1. Rank 1 overrides Rank 3. Always check before concluding the operation failed.
Partial log caveat: ResultType.COMPLETED on the last step of a CLI log confirms full completion. ResultType.COMPLETED on an intermediate step only confirms that step succeeded. Check which step is last before declaring the entire operation succeeded.
| Observation | Why it's a red herring | What to do instead |
|---|---|---|
Interactive prompt in SSH stdout (Configure endpoint services? [y/n]) | Often appears in the output buffer BEFORE the real work ran; does not prove the prompt caused the hang | Check remote CLI logs for what actually executed |
Post-failure SSH host key change (REMOTE HOST IDENTIFICATION HAS CHANGED) | Proves the host identity changed AFTER the failure was recorded; does not prove MAAS re-provisioned DURING the operation | State: "host key changed post-failure; cause of the original failure is not established by this artifact" |
No route to host after an SSH session dies | Proves the node was unreachable at cleanup time; does not prove when or why it became unreachable | Report it as a post-failure observation, not a cause |
| Downstream blocked/waiting units | A unit in blocked after another unit failed is a consequence, not a cause | Trace back to the first unit that left active/idle |
generated-monitor.log showing "No new models found" on single-node | Normal for single-node — controller is on the remote LXD, not visible to the CI monitor | Ignore for single-node topology |
cd /path/to/run-directory
# Extract sosreport tarballs (mknod errors are expected and harmless)
for f in generated-sunbeam-sosreport-*.tar.xz; do
tar -xf "$f" 2>/dev/null
done
# Extract pod log tarballs
for f in generated-sunbeam-pods_*_logs.tgz; do
tar -xzf "$f" 2>/dev/null
done
Determine how far the deployment got:
| Artifacts Present | Phase |
|---|---|
| No juju status files, no sunbeam CLI logs in sosreports | Pre-bootstrap |
| Juju status files exist, only bootstrap node has CLI logs | Bootstrap (joins never started) |
| Juju status files exist, multiple nodes have CLI logs | Join phase |
Validation logs exist (generated-sunbeam-validation_*.log) | Post-deploy / test |
Plugin enable output visible in generated-sunbeam-output.log | Plugin enable |
Quick artifact inventory:
# Which key files exist?
ls generated-sunbeam-output.log generated-sunbeam-juju_status_*.txt \
generated-sunbeam-validation_*.log generated-foundation.log 2>/dev/null
# Count CLI logs per node (most logs = likely bootstrap node)
for sos in sosreport-*/; do
hostname=$(echo "$sos" | sed 's/sosreport-\(.*\)-20[0-9]*-.*/\1/')
count=$(ls "$sos/home/ubuntu/snap/openstack/common/logs"/*.log 2>/dev/null | wc -l)
echo "$hostname: $count CLI logs"
done
Topology: Check config-sunbeam.yaml or the run directory name for SKU:
single_node: one node, LXD-backed controller (monitor "No new models" is normal)external_juju: pre-existing controller, generated-foundation.log covers MAAS layerRead the primary artifacts for the identified phase. Record only what the file directly shows, with file:line references.
For every phase — always check these first:
# SSH transport failures
grep -n "client_loop: send disconnect: Broken pipe" generated-sunbeam-output.log
grep -n "Connection to .* closed by remote host" generated-sunbeam-output.log
grep -n "REMOTE HOST IDENTIFICATION HAS CHANGED" generated-sunbeam-output.log generated-github-runner-run.log
Phase-specific checks:
generated-sunbeam-output.log — check: wipefs errors, CalledProcessError, ERROR: No external connectivitygenerated-foundation.log — check: MAAS API errors (ServerError: 500, 400 Bad Request); does the error body name a specific issue or is it generic? Also check pg_dump failures, Python tracebacksgenerated-github-runner-run.log — check: ssh-keyscan failures (timeout vs connection refused?), Testflinger provisioning errors, Mirror sync in progress? (if multiple nodes fail the same curtin step simultaneously, mirror issue is a candidate)sosreport-*/sos_commands/block/lsblk — check: does the node have the expected disks, or only an OS drive?generated-sunbeam-output.log: exit codes, stdout/stderr of bootstrap commandsosreport-<bootstrap>/home/ubuntu/snap/openstack/common/logs/sunbeam-*.log — search for final ResultType entrypexpect.exceptions.TIMEOUT: juju add-machine connectivity issuegenerated-sunbeam-output.log: per-node join exit codes, stdout, stderrResultType entrygenerated-sunbeam-juju_status_openstack-machines.txt: unit states, scalegenerated-sunbeam-sunbeam_cluster_list.txt: which nodes are presentgenerated-sunbeam-output.log: which command failed, exit code, timeout messagesgenerated-sunbeam-juju_status_openstack.txt: unit statesgenerated-sunbeam-validation_*.log: test results, HTTP errorsgenerated/sunbeam/logs-openstack-*.txt): service-level errorsYou MUST run this checklist before declaring any failure real. This is not optional.
| Check | How | If true |
|---|---|---|
CI stderr shows client_loop: send disconnect: Broken pipe? | Grep output.log | SSH transport died; remote may have succeeded |
Remote CLI log ends with ResultType.COMPLETED? | Read last 20 lines of sosreport sunbeam-*.log for the failing node | Remote operation succeeded — CI failure is false negative |
sunbeam_cluster_list.txt shows all expected nodes? | Compare node count to config-nodes.yaml | Cluster formed successfully |
All juju units active/idle? | Scan both juju status files | Deployment is healthy |
All K8s pods Running with 0 restarts? | Check kubectl_get_pod.txt | K8s layer is healthy |
If all five checks pass: the failure is a false negative. State this directly. Do not speculate about what "might have" gone wrong.
If the remote CLI log is not available (no sosreport, or sosreport from wrong node): state that the false-negative check is inconclusive, not that the operation failed.
State what directly failed using only the facts collected in Step 2. This is the narrowest true statement you can make.
Examples of good failure surface statements:
Broken pipe for the SSH session running sunbeam cluster join on node X at timestamp T."wipefs -a /dev/disk/by-dname/disk1 on node X returned No such file or directory."sunbeam enable loadbalancer exited with wait timed out after 900s at timestamp T."Examples of over-claiming (avoid these):
For each plausible explanation, apply this template:
Candidate: [mechanism name]
Suggests it: [what symptom points here]
Would confirm it: [what artifact would prove this]
Would contradict it: [what artifact would disprove this]
Artifact check result: [what you actually found]
Status: Confirmed / Supported / Speculative
Status definitions:
ResultType.COMPLETED proving the operation succeeded; lsblk shows no disk1 proving the disk is missing)Counter-evidence requirement: For every candidate you evaluate, you MUST check for at least one artifact that would weaken the claim. If you skip this, the diagnosis is incomplete.
Missing remote evidence rule: If no sosreport exists for the failing node (node was down at collection time, or sosreport is from a different host like a Juju controller LXD), then no claim about what happened on that node can be Confirmed. You may list candidates as Supported or Speculative, but the diagnosis MUST state: "Remote-side evidence unavailable for node X; mechanism not established."
These are patterns observed in prior CI failures. Use them as hypotheses to test, not conclusions to assume. Each requires the evidence listed under "Confirms it" before you can claim it.
Broken Pipe (false negative candidate)
client_loop: send disconnect: Broken pipe in stderrResultType.COMPLETEDERROR/Exception, or no remote CLI log existsPTY Allocation Failure (false negative candidate)
Pseudo-terminal will not be allocated in stderr, empty stdout, exit 255Missing disk
wipefs: error: /dev/disk/by-dname/disk1: No such file or directorylsblk in sosreport shows no secondary disk on that nodelsblk shows the disk exists (maybe a symlink issue instead)MAAS API failure
ServerError: 500 or 400 Bad Request in generated-foundation.logmaasserver_routable_pairs does not exist)VM boot timeout
ssh-keyscan retries exhausted in generated-github-runner-run.logAgent loss after model migration
unknown/lost stateQUIESCE + agent.conf left unchanged or invalid entity name or password for that specific unitTerraform state lock contention
Error acquiring the state lock in CLI logsTimeout near-miss
wait timed out after N in output.logTraefik hook backlog after TLS
maintenance after sunbeam enable tlsingress-relation-changed hook executions on that unit spanning the timeout windowMySQL connection exhaustion
MySQL Error (1040) in any pod logValidation test failures
validation_*.logNode became unreachable mid-run
Connection to ... closed by remote host followed by No route to hostuptime shows reboot during the run, OR sosreport shows fresh image (snap list = only snapd+core, missing juju dir)| File | Contents | Priority |
|---|---|---|
generated-sunbeam-output.log | CI execution log (SSH commands, exit codes, stdout/stderr per node) | 1 |
sosreport-*/home/ubuntu/snap/openstack/common/logs/sunbeam-*.log | Sunbeam CLI logs per node (remote-side execution — outranks CI transport) | 1 |
generated-sunbeam-juju_status_openstack.txt | K8s model: OpenStack service unit states | 2 |
generated-sunbeam-juju_status_openstack-machines.txt | Machine model: bare-metal unit states | 2 |
generated-sunbeam-sunbeam_cluster_list.txt | Cluster membership | 2 |
generated-foundation.log | MAAS/Terraform infra provisioning | 2 (pre-bootstrap) |
generated-sunbeam-validation_*.log | Validation test results | 2 (post-deploy) |
generated-sunbeam-juju_debug_log_openstack-machines.txt | Machine model agent logs | 3 |
generated-sunbeam-juju_debug_log_openstack.txt | K8s model agent logs | 3 |
generated/sunbeam/logs-openstack-*.txt | Extracted pod logs per service | 3 (post-deploy) |
generated-github-runner-run.log | CI orchestrator, Python tracebacks | 4 |
generated-sunbeam-show_units_openstack.txt | Unit relation data | 4 |
generated-sunbeam-juju_debug_log_controller.txt | Controller agent logs | 4 |
generated-monitor.log | Deployment monitor (normal="No new models" on single-node) | 5 |
generated-sunbeam-manifest.yaml | Manifest (database topology, channel) | 5 |
generated-sunbeam-kubectl_get_pod.txt | K8s pod status | 5 |
config-sunbeam.yaml | Deployment config (osd_devices, roles) | 5 |
generated-lastlines.txt | Last CI output lines + tracebacks | 5 |
sosreport-*/sos_commands/block/lsblk | Block devices per node | On-demand |
sosreport-*/sos_commands/snap/snap_list.txt | Installed snaps (detect fresh image) | On-demand |
Produce a DIAGNOSTICS.md in the CI run directory:
# Diagnostics: <run-id>
## Verdict
**FALSE NEGATIVE / REAL FAILURE / INCONCLUSIVE** — one sentence stating what is proven.
## Failure Phase
Which phase: pre-bootstrap / bootstrap / join / plugin-enable / validation
## Immediate Failure Surface
What directly failed, stated narrowly with file:line references. No mechanism claims here.
## Observed Facts
Bullet list of what the artifacts show, each with a file:line reference.
Include facts that support AND facts that weaken any candidate explanation.
## Candidate Causes
### 1. [Name] — Status: CONFIRMED / SUPPORTED / SPECULATIVE
What suggests it, what confirms it, what was checked as counter-evidence.
**Evidence for:**
- `file:line` — what it shows
**Evidence against / not found:**
- What was checked and what it showed (or "not available")
### 2. ...
## What Is Not Established
Explicitly list plausible-sounding claims that the artifacts do NOT prove.
Example: "The artifacts do not establish whether the node was re-provisioned by MAAS
or crashed for another reason."
## Juju Status Summary
Units in error/blocked/lost, or "All healthy", or "N/A".
## Key Log Files Examined
Table of files checked with one-line findings.
When diagnosing multiple CI runs, dispatch one subagent per run directory. Each agent follows the full algorithm independently.
Sunbeam/OpenStack lab networking knowledge for new joiners. Use when: explaining NICs, bonds, fabrics, VLANs, switches, routers, bridges; configuring MAAS interfaces for a new node; debugging missing Juju spaces; onboarding to the Sunbeam lab network; creating network diagrams or PPTs; understanding why a charm deployment fails due to network bindings. Covers: lab rack layout, management-fabric vs data-fabric, bondm/bondd/bond2/br-bond2 structure, VLAN IDs 3400-3409, Juju space to subnet mapping, step-by-step MAAS CLI commands.
Generate a 4-panel matplotlib chart from rally-ci-churn fio benchmark result files. Use when you have result files in the results/ directory and want to visualize throughput scaling, IOPS, latency (avg/P99), and scaling efficiency. Triggers: plot results, chart fio, visualize benchmark, fio chart, throughput chart.
Extracts oslo.config options from OpenStack Cinder volume driver Python source code and generates driver-spec YAML files for sunbeam-cinder-factory. Use when adding new storage vendor support, updating existing driver specs from upstream Cinder, or batch-generating specs for all Cinder drivers. Handles AST parsing, type classification, secret detection, protocol detection, enum mapping, and type_override generation.