| name | cost-investigation |
| description | Use when a cloud bill already spiked and nobody knows why — storage or egress jumped, a specific bucket's cost ballooned, a cost-anomaly alert fired, or finance is asking "what changed this week?" Queries the billing export by service/bucket/region, diffs the suspect window against a prior baseline, and ranks the top movers. Insists on accounting for the 24-48h billing lag before declaring a trend, and points the investigation at the usual culprit: cross-region egress. Not for cost forecasting/estimation, general spend optimization, pricing questions, budget-alert setup, or latency "expense" — only for diagnosing an unexpected spike that already happened. |
Cloud Cost Spike Investigation
Overview
A bill spike is almost always one dimension moving, hidden inside an aggregate that
only shows the total going up. This skill turns "the bill jumped" into "bucket
X in region Y egressed 4 TB cross-region starting Tuesday" by querying the
billing export, diffing the suspect window against a matched baseline window, and
ranking what actually moved. It is read-only — it explains the spike, it does not
change anything.
When to Use
Use this skill when you see symptoms like:
- The monthly/weekly cloud bill is up sharply with no obvious release behind it.
- Storage or egress line items jumped and you need to localize the source.
- Finance or a cost dashboard flags an anomaly and asks "what changed?"
- A specific bucket, project, or region's cost ballooned.
- Someone asks "why is our egress bill so high this week?"
Do NOT use this skill when:
- You are reclaiming idle/orphaned GPU resources — use
gpu-node-orphans.
- The "spike" is the most recent 1-2 days only; that is likely billing lag, not a
real trend (see Gotchas).
- There is no billing export available to query — get the export enabled first.
Investigation flow
-
Pick windows. Choose a suspect window (where the spike appears) and a
baseline window of the same length immediately before it. Equal-length windows
make the diff meaningful.
-
Diff and rank. Run scripts/cost_query.py to group cost by service, bucket,
and region, diff suspect vs baseline, and rank the top movers by absolute delta.
python scripts/cost_query.py --input export.csv --window 3
python scripts/cost_query.py --input export.json \
--suspect 2026-06-18:2026-06-20 --baseline 2026-06-15:2026-06-17 \
--service-filter egress storage
-
Localize. Read the top movers. The biggest absolute delta is your lead.
Cross-region egress and a single hot bucket are the two most common findings.
-
Explain, don't just total. A spike in egress with flat storage points to a
backfill/replication job, not new data at rest. Separate per-request cost from
storage cost before concluding.
Gotchas
- ALWAYS account for the 24-48h billing lag. Billing exports land a day or two
late, so today and yesterday read artificially low. Never declare a "drop" near
the edge of the data, and never use an incomplete final day as a baseline — it
will manufacture a fake spike or hide a real one.
- ALWAYS check cross-region egress first. It is the most common cause of a
surprise bill. Same-region traffic is often free or cheap; cross-region and
internet egress are not. Group by region pair before anything else.
- ALWAYS separate per-request cost from storage cost. A bucket can spike on
request volume (millions of GETs) while storage stays flat — that is a workload
change, not data growth. Treating them as one number hides the actual cause.
- ALWAYS consider an invisible backfill job. A one-off migration or replication
backfill can drive enormous egress for a few days and then vanish, leaving a spike
with no standing footprint. If egress jumped but storage and request baselines
look normal afterward, look for a job that ran in the window.
Files
SKILL.md — this file; the investigation flow and cost gotchas.
scripts/cost_query.py — queries a billing export by service/bucket/region, diffs
the suspect window against a baseline, and ranks the top movers. Read-only.