| name | keeper-stress-analysis |
| description | Analyze ClickHouse Keeper stress-test results from play.clickhouse.com / keeper_stress_tests data warehouse. Use whenever the user asks about Keeper performance, validates Keeper PRs against stress dashboards, investigates regressions or improvements in Keeper nightlies, asks about specific date windows / SHAs / PR-sets in Keeper stress tests, wants per-PR or window-vs-window comparisons, asks "did this PR break Keeper", asks "what changed in Keeper between dates", or wants a summary report of Keeper stress runs. Triggers on terms like "keeper stress", "keeper PR", "keeper p99", "keeper memory", "keeper rps", "keeper nightly", "keeper-stress-tests", "keeper validation", "keeper regression", or any question referencing the keeper-stress Grafana dashboard. ALWAYS prefer this skill over re-deriving the workflow from scratch — it captures hard-learned lessons about cgroup-vs-Keeper memory, bench-harness confounds, noise floors, and per-PR attribution limits. |
| argument-hint | ["<date-window>|<pr-list>|<question>"] |
| disable-model-invocation | false |
| allowed-tools | Bash(curl:*), Bash(python3:*), Bash(awk:*), Bash(mkdir:*), Bash(ls:*), Bash(wc:*), Bash(grep:*), Bash(sort:*), Bash(cat:*), Bash(gh:*), Bash(realpath:*), Bash(cp:*), Bash(chmod:*), Bash(sed:*), Read, Write, Edit, Glob, Grep |
Keeper Stress-Test Analysis Skill
Analyse ClickHouse Keeper stress-test results from keeper_stress_tests.keeper_metrics_ts on play.clickhouse.com (the same data warehouse the Grafana keeper-stress-run-details dashboard reads from). The skill captures a tested end-to-end workflow plus hard-earned methodology lessons — use it instead of re-deriving the analysis each time.
When to use
This skill triggers on these kinds of requests:
- "Validate these Keeper PRs against the stress tests"
- "What changed in Keeper between 2026-04-01 and 2026-05-01?"
- "Did PR #X cause any regression in Keeper stress?"
- "Why did p99 spike on date Y?"
- "Give me a summary report of Keeper stress runs"
- "How are the Keeper PRs performing on the dashboard?"
The skill does NOT touch any other CI data — it's specific to the Keeper stress framework.
Skill home + working dir convention
- Skill home: the directory containing this SKILL.md. May be either
- user-level:
~/.claude/skills/keeper-stress-analysis/, or
- project-level:
<repo>/.claude/skills/keeper-stress-analysis/.
Both work; scripts/rebuild.sh resolves the home from its own location.
- Working dir (default):
tmp/keeper_stress_skill/ under the user's current directory
- The orchestrator
scripts/rebuild.sh accepts a working-dir argument as $1, a lower-bound TS-filter as $2, and an optional upper-bound TS-filter as $3 (default 9999-12-31 = unbounded)
Workflow — five phases
Phase 1 — Capture intent
Parse the user's request into one of these shapes:
| Request shape | Indicators | Pipeline to run |
|---|
| Date-range window | "between A and B", "since X", "last N weeks" | Cumulative-gains pipeline (Method B in references/methodology.md) |
| PR set | List of PR numbers, "validate these PRs", "33 PRs" | Per-PR + per-nightly pipeline |
| Single PR drill-down | One PR number, "did #X cause", "regression from #X" | Per-PR card with adjacent-nightly + PR-branch isolation |
| Free-form analytical | "Why did metric M change on date D?" | Time-series check + cross-reference references/known_confounds.md |
If the user hasn't given a window, ask (date range OR PR set OR specific question). Default window: from 2026-03-25 (when the current framework began) to today.
Phase 2 — Pull staging data
Always run all 6 SQL queries first into <work_dir>/staging/:
SKILL_HOME="$(find ~/.claude/skills .claude/skills -maxdepth 2 -type d -name keeper-stress-analysis 2>/dev/null | head -1)"
"$SKILL_HOME/scripts/rebuild.sh" tmp/keeper_stress_skill 2026-03-25
rebuild.sh takes three optional args: $1 work dir, $2 lower-bound timestamp filter (default 2026-03-25, when the current keeper-stress framework went live), and $3 upper-bound timestamp filter (default 9999-12-31, i.e. unbounded — same behaviour as a single-arg invocation). The bounds are injected into the SQL via {{TS_FILTER}} / {{TS_FILTER_END}} placeholders and exported to the Python pipeline as KEEPER_SKILL_THRESHOLD / KEEPER_SKILL_THRESHOLD_END (read by build_pr_nightly_map.py and compute_deltas.py for the in-window vs out-of-window split). The upper bound is exclusive (ts < $3), so for "between A and B" inclusive on B, pass B+1 day. To analyse a different window, pass new values — no source edits needed.
For a closed date-range analysis like "what changed between 2026-04-01 and 2026-05-01?", invoke:
"$SKILL_HOME/scripts/rebuild.sh" tmp/keeper_stress_skill 2026-04-01 2026-05-01
The rebuild.sh script:
- Copies
queries/*.sql and scripts/*.py into the work dir.
- Runs each query against
https://play.clickhouse.com/?user=play via curl --data-urlencode.
- Substitutes
{{TS_FILTER}} placeholder if present in the SQL.
- Builds
merged_metrics.tsv (one row per scenario × backend × commit; ~95+ columns covering bench, prom, mntr, container metrics).
- If
<work_dir>/../pr_meta.tsv exists, builds the per-PR pipeline too.
- Builds
cumulative_gains.tsv and cumulative_gains_summary.tsv.
The 6 staging files dropped under staging/:
bench_summary.tsv — bench-side summary (rps, p99, errors, ops, mem)
prom_rates.tsv — Keeper prom counters as rates per node
prom_gauges.tsv — Keeper prom gauges + cumulative-failure counters
mntr.tsv — ZK 4LW mntr outputs
container.tsv — cgroup CPU + memory
pr_branches.tsv — PR-branch smoke stress runs (3 scenarios per branch)
Phase 3 — Build derived tables
Pick the right script(s) based on Phase 1's intent:
All Python steps are dispatched through keeper_stress.py:
| Intent | Run |
|---|
| Date-range window | python3 keeper_stress.py cumulative |
| PR set | python3 keeper_stress.py prmap + deltas + prmetrics (requires pr_meta.tsv) |
| PR-branch isolation | python3 keeper_stress.py prisol (reads PR list + branch from pr_to_nightly.tsv) |
| Two-commit Δ | python3 keeper_stress.py diff <shaA> <shaB> (8-char prefix or full; reads merged_metrics.tsv) |
| Noise calibration | python3 keeper_stress.py noise (per-(scenario, backend, metric) median / stddev / cv / p95 across the window) |
| Free-form | none — query merged_metrics.tsv directly with awk/python |
The individual step files (build_metrics_table.py, build_pr_nightly_map.py, etc.) are still importable modules; keeper_stress.py dispatches to their main() entrypoints.
The per-PR markdown matrix is not generated by a script — Claude composes it directly from per_pr_metrics_long.tsv (long-form per-PR Δ rows) plus per_pr_summary.tsv (cumulative window numbers), using examples/sample_outputs/PR_PERF_TABLE.md as the structural template. Verdicts are Claude's judgment based on the data + Phase 5 caveats — there is no canonical lookup table.
For PR-set work, the user needs to provide a pr_meta.tsv mapping PR number → title, mergedAt, mergeCommit, base, headRefName. If absent, generate it via gh:
{
printf 'pr\ttitle\tmergedAt\tmergeCommit\tbase\theadRefName\n'
for pr in <numbers>
do
out=$(gh pr view "$pr" --repo ClickHouse/ClickHouse \
--json title,mergedAt,mergeCommit,baseRefName,headRefName \
-q '[.title,.mergedAt,.mergeCommit.oid,.baseRefName,.headRefName] | @tsv' 2>/dev/null)
printf '%s\t%s\n' "$pr" "$out"
done
} > tmp/keeper_stress_skill/../pr_meta.tsv
(The pr_meta.tsv lives one level above the work dir so all scripts can find it. headRefName is required for build_pr_branch_isolated.py — without it, that step silently produces no rows.)
Phase 4 — Generate output
Pick the deliverable that matches the request:
| User wants | Where to look |
|---|
| Summary report | references/report_templates.md (full / tight / one-liner monospace templates with placeholders) |
| Per-PR Markdown table | examples/sample_outputs/PR_PERF_TABLE.md (canonical example: data-backed per-PR table with co-merge attribution and noise-floor caveats) |
| Cumulative-gains write-up | Build from cumulative_gains_summary.tsv using references/report_templates.md formatting |
| Per-PR mover matrix / progress attribution | Build from per_pr_metrics_long.tsv (produced by build_per_pr_metrics_tsv.py) and per_pr_summary.tsv (from compute_deltas.py) |
| Full validation report | Compose from the per-PR table + cumulative-gains + caveats sections; mirror the structure of examples/sample_outputs/PR_PERF_TABLE.md |
Cross-reference the canonical example when filling templates. Never invent prose without a backing data source.
Free-form recipes
Common questions that don't need a new pipeline — just awk over an existing TSV:
awk -F'\t' '$1=="<PR>" && $3 ~ /-fault\[/' tmp/keeper_stress_skill/per_pr_scenario_deltas.tsv
awk -F'\t' '$1=="<PR>" && $3=="prod-mix-no-fault[default]"' tmp/keeper_stress_skill/per_pr_scenario_deltas.tsv
awk -F'\t' 'NR==1 || $1=="<PR>"' tmp/keeper_stress_skill/per_pr_summary.tsv
awk -F'\t' 'NR==1 {for(i=1;i<=NF;i++) if($i=="<METRIC>") c=i; next}
$1=="<scenario>" && $2=="<backend>" && $c!="" {print $5, $4, $c}' \
tmp/keeper_stress_skill/merged_metrics.tsv | sort
These three patterns cover the bulk of "drill into one slice of the data" requests; reach for the pipeline only when the slice doesn't already exist as a column or row in merged_metrics.tsv / per_pr_*.tsv.
Phase 5 — Apply learned-the-hard-way checks
Before quoting any number, run these checks:
Memory check (THE most common trap)
If a memory delta > 5 % is reported, separately query:
container_memory_bytes (the cgroup peak — sensitive to bench page cache)
KeeperApproximateDataSize (Keeper's own state report)
If the cgroup moved but KeeperApproximateDataSize did NOT, the delta is bench-side, not Keeper-side. See references/known_confounds.md for PR #100670 example.
awk -F'\t' '
NR==1 {next}
$1==SCENARIO && $2==BACKEND {
date=$5; gsub(/ .*/, "", date)
printf "%s sha=%s KeeperApproxDataSize=%5.2fGB container_peak=%5.2fGB\n",
date, $4, $7/1e9, $72+0
}' merged_metrics.tsv | sort
Step-change check
If a metric changes as a single-day step across multiple unrelated scenarios, it's almost certainly bench-side. Cross-reference references/known_confounds.md:
#100670 "keeper-bench: go faster" landed 2026-04-04 — affects read-heavy memory + multi-write error_pct
#101801 "keeper-bench: more features" landed 2026-04-11 — affects rocks-side write-multi memory
Noise-floor check
The single-nightly Δ noise floor is ±3-5 % on rps/p99. The typos PR #102739 (which cannot affect Keeper performance) shows ±5 % rps Δ via PR-branch isolation — that's the floor. Never claim sub-3 % per-PR effects without an isolation method.
The PR-branch isolation pool widens to ±1 ISO week if the same-week pool has fewer than 2 entries (using datetime arithmetic so year/W01/W52-53 boundaries are handled). Same-branch runs are excluded from the pool so prior WIP commits don't bias the median toward the PR's own change. Full method in references/methodology.md Method C and the docstring at the top of scripts/build_pr_branch_isolated.py.
CPU spike check
container_cpu_usage_usec rates can spike to spurious 18-38 cores from counter discontinuities. Always use p95_cpu_cores, never max_cpu_cores. See references/metric_glossary.md.
Server-side failure check
Always verify these four counters are ZERO across the entire window:
KeeperCommitsFailed
KeeperSnapshotCreationsFailed
KeeperSnapshotApplysFailed
KeeperRequestRejectedDueToSoftMemoryLimitCount
Any non-zero value overrides any positive verdict.
There are two scopes to check, with different sources:
- Per-PR (post-merge only):
compute_deltas.py already counts these in
the first nightly that includes the PR and emits server_failures_post
in per_pr_summary.tsv. A non-zero value flips the PR's verdict to
regression(server-failure).
- Across the entire window: run the
awk recipe below against
staging/prom_gauges.tsv to confirm zero across every master nightly in
the window. This is the gate; the per-PR check is the per-PR symptom.
awk -F'\t' '
NR>1 && $4 ~ /(CommitsFailed|SnapshotCreationsFailed|SnapshotApplysFailed|RejectedSoftMemoryLimit)/ && $5+0 > 0 {
print
}' staging/prom_gauges.tsv
Co-merge contamination check
When the user provides a PR list and you compute master adjacent-nightly Δs, the same Δ is jointly attributable to all PRs that landed in the same nightly window. Always include a co_merged column in per-PR tables. Never credit a single PR for joint-window deltas at >5 % effect size.
Reference files (load on demand)
When you need deeper guidance, read these into context:
references/methodology.md — comparison-method choice (adjacent-nightly vs median-of-3 vs PR-branch isolation), significance bands, environment-offset correction.
references/known_confounds.md — catalog of bench-harness PRs that move dashboard metrics; updated as new ones are observed.
references/metric_glossary.md — what every column in keeper_metrics_ts measures, and which ones to NOT use (e.g. max_cpu_cores, raw container_memory_bytes for "Keeper memory").
references/report_templates.md — three monospace templates (full / tight / one-liner) with placeholder format.
Verifying the analysis is correct
Spot-check three known data points (these are all baked into examples/sample_outputs/):
-
Master e02b59d7 (2026-04-02) on write-multi-no-fault[default] must show errors=0 (pre-bench-jump). Master 18dfe15a (2026-04-04) same scenario must show errors≈325k, error_pct≈3.67. If divergent, the bench-summary query is wrong.
-
All four hard-failure counters across all master nightlies since 2026-03-25 must be zero. If any non-zero, either the data is corrupt or there's a real failure to report.
-
740b4a5 (keeper-object-based-snapshots branch) prod-mix-no-fault[default] must show rps=5,764, read_p99=545 ms, write_p99=535 ms, errors=0. This was the canonical #99651 validation point.
Examples
Example 1 — single PR drill-down
User: "Did PR #99651 cause any Keeper regression?"
Process:
- Fetch PR meta:
gh pr view 99651 --repo ClickHouse/ClickHouse --json title,mergedAt,mergeCommit
- Run
rebuild.sh tmp/keeper_stress_skill 2026-03-25
- Filter merged_metrics for the pre-merge nightly (
fdf46ee1) vs post-merge nightly (e02b59d7) on prod-mix-no-fault[default] and write-multi-no-fault[default]
- Apply Phase 5 memory check: pull both
container_memory_bytes and KeeperApproximateDataSize. The prod-mix peak_mem 2.92→2.72 GB (-6.9%) shows up on cgroup but KeeperApproximateDataSize is flat → conclude this is bench-side noise OR snapshot-timing artifact, not real Keeper improvement.
- Confirm
KeeperSnapshotApplysFailed=0 across 18 follow-on nightlies.
- Output: per-PR card with the verdict "clean — no regression; the prod-mix peak_mem drop is a single-nightly cgroup artifact, not a Keeper-state reduction".
Example 2 — date-range window
User: "What changed in Keeper between 2026-04-01 and 2026-05-01?"
Process:
- Run
rebuild.sh tmp/keeper_stress_skill 2026-04-01 2026-05-01 — both bounds are required for a closed window; the third arg pins the upper bound (ts < 2026-05-01) so newer nightlies don't drift into the result.
- Run
build_cumulative_gains.py — produces cumulative_gains_summary.tsv with median-of-3 vs median-of-3 deltas.
- Apply Phase 5 checks — flag the bench-harness changes from
known_confounds.md that landed in this window. For 2026-04-01 → 2026-05-01 both #100670 (2026-04-04) and #101801 (2026-04-11) are in-window, so call them out as confounds for any read-heavy memory or rocks-side write-multi memory deltas.
- Output: a cumulative-gains write-up built from
cumulative_gains_summary.tsv using references/report_templates.md formatting, with conservative deltas + caveats (always include the bench-harness confound notes from references/known_confounds.md if any of those PR dates fall in the window).
Example 3 — Summary report
User: "Give me a summary of these PRs: ..."
Process:
- Build
pr_meta.tsv from the PR list using gh.
- Run full pipeline (rebuild.sh + per-PR scripts).
- Categorize PRs by intent (perf cohort by code path, correctness, tooling, refactor, net-zero).
- Fill in
references/report_templates.md "full" template.
- Apply Phase 5 caveats — if any in-window bench-harness changes, mention by PR number.
Output discipline
When the user is asking for analysis (not a templated report), produce:
- Headline finding — 1-2 sentences. State the verdict directly.
- Backing table — every claim has a specific scenario+metric+number.
- Caveats — note any noise-floor, co-merge, or bench-harness limitations.
Never produce confident per-PR percentages below 5 % effect size without explicit isolation evidence.
When the user has been pushing for rigor, default to the conservative method (median-of-3 + PR-branch isolation) and report ranges, not point estimates.
Editing this skill
If you change anything in scripts/_common.py — particularly classify, iso_week, CLASSIFY_BANDS, or HEADLINE_METRICS — run the unit-test harness before pushing:
cd <skill_home>/scripts && python3 -m unittest tests.test_common -v
The 29 cases gate the per-metric significance bands, the ISO-year-boundary widening, the SHA-prefix matcher, the KEEPER_SKILL_THRESHOLD_END default, and the unmerged-PR guard. If they fail, methodology and code have drifted apart — fix one to match the other (the references/methodology.md rubric is the binding contract).