| name | cohort-compare |
| description | Use when someone wants to compare two cohorts' retention or conversion, asks whether a difference between segments is "real" or "significant", wants retention curves for an A/B group or a launch vs control, or says one cohort "looks better" and needs the delta flagged with a p-value. Reach for this whenever the question is two-group comparison plus significance — and especially before eyeballing two percentages and declaring a winner, which ignores sample size and observation-window mismatch. |
cohort-compare
Overview
This skill compares two cohorts on retention or conversion and tells you whether the difference is statistically meaningful, not just numerically different. It hands Claude scripts/cohorts.py — build_cohort, retention_curve, and a significance test (compare runs a two-proportion z-test, with chi-square available for the contingency form) returning the delta, the p-value, and a confidence interval.
The job is to stop "Cohort A is 42% and Cohort B is 39%, so A is better" from shipping as a conclusion. With 80 users per arm, that 3-point gap is noise. This skill makes the significance call explicit and surfaces the traps (small samples, survivorship, mismatched windows, segment drift) that make naïve comparisons wrong.
When to Use
Use this skill when the request is to:
- Compare retention between two cohorts (e.g. May signups vs June signups, treatment vs control).
- Compare a single conversion rate between two segments and decide if the delta is real.
- Produce a retention curve (D1/D7/D30…) for one or both cohorts.
- Sanity-check a claim that "segment X converts better than Y."
- Decide whether a launch moved a metric or whether the change is within noise.
Do NOT use this for: building a single funnel or figuring out which events/tables to join — that is funnel-query (this skill consumes cohorts of users, it does not discover the event→step mapping). Do NOT use it for infra/ops dashboards — that is grafana. This skill stops at the comparison + significance verdict; it does not run multi-arm experiment analysis or causal inference.
Cohorts and where their definitions live
A cohort is a set of canonical user_ids plus a per-user cohort_ts (the moment they entered, e.g. signup time) used to age each user for retention. Resolve users to the canonical user_id the same way funnel-query does — retention fragments badly if a user is split across anonymous ids.
Segment definitions are not free-text. They live in the segment registry (analytics.segment_definitions, keyed by segment_id, each row carrying the SQL predicate and a version). When you compare "power users" vs "casual users", pull the cohort from a registered segment_id and record its version in your output, so the comparison is reproducible and you can detect if the definition changed under you (see Gotchas: segment-definition drift). Do not paraphrase a segment inline.
The library
scripts/cohorts.py exposes:
build_cohort(segment_id=None, *, user_ids=None, start, end, version=None) — materialize a cohort of canonical user_ids with their cohort_ts, either from a registered segment_id (pinned to a version) or from an explicit list. Records the segment version used.
retention_curve(cohort, horizons=(1, 7, 14, 30)) — for each horizon day, the fraction of the cohort still active, counting only users old enough to have had the chance (no survivorship inflation). Returns counts and rates per horizon.
compare(a, b, *, metric, horizon=None) — compare cohort A vs B on a conversion or a retention horizon. Runs a two-proportion z-test, returns Delta(p_value, ci_low, ci_high, significant, ...). Use chi_square(...) for the 2×2 contingency form when you prefer it; both agree at large N.
Claude composes these. Example: "did June signups retain better than May at D7?"
from scripts.cohorts import build_cohort, retention_curve, compare
may = build_cohort(start="2026-05-01", end="2026-06-01")
june = build_cohort(start="2026-06-01", end="2026-07-01")
print(retention_curve(may))
print(retention_curve(june))
verdict = compare(may, june, metric="retention", horizon=7)
print(verdict)
For two registered segments on a conversion metric:
power = build_cohort(segment_id="seg_power_users", start="2026-06-01", end="2026-06-23")
casual = build_cohort(segment_id="seg_casual_users", start="2026-06-01", end="2026-06-23")
print(compare(power, casual, metric="paid_conversion"))
Don't railroad to compare() when the right move is to show the whole curve — sometimes the answer is "they diverge only after D14," which a single-horizon test hides.
Gotchas
ALWAYS treat these as real, observed failure modes — each has turned noise into a "result" that someone acted on.
-
Small samples make almost any delta look like a win — let the p-value, not the point estimate, decide. A 3–5 point retention gap on ~100 users per arm is routinely non-significant. compare() returns significant and a confidence interval precisely so you don't report a direction the data can't support. If significant is False, say "no detectable difference," not "A is slightly better." Report the CI; "+3pp (95% CI −6 to +12)" is honest, "+3pp" is not.
-
Survivorship bias: retention denominators must include users who could have churned, not just the ones who stuck around. Computing D30 retention over only the users who are still active is circular — it'll read ~100%. retention_curve denominators are the full cohort that is old enough to have reached the horizon. Never filter the cohort to "active users" before measuring retention.
-
Mismatched observation windows make comparisons meaningless. If Cohort A has had 60 days to retain and Cohort B only 20, their D30 numbers aren't comparable — B's recent users haven't had the chance yet. Only compare horizons both cohorts have fully aged into. retention_curve excludes users younger than each horizon; when you compare, make sure both cohorts are fully aged at that horizon, or the younger cohort is under-counted. Prefer comparing cohorts of equal maturity, or cap the horizon to the younger cohort's age.
-
Segment-definition drift silently changes what you're comparing. "Power users" today may be a different SQL predicate than last month if someone edited the registry. Always build cohorts from a pinned segment_id + version and print the version in your output. If you compare a cohort built last week (old version) against one built today (new version), the delta may be definitional, not behavioral. When in doubt, rebuild both at the same version.
-
A two-proportion z-test assumes independent users and a real binary outcome. Don't apply it to averages (e.g. mean sessions per user) — that needs a t-test/Mann-Whitney, not compare(metric=...). Don't double-count a user who appears in both cohorts; cohorts should be disjoint for a clean comparison. If a user is in both (e.g. overlapping segments), decide on an assignment rule before testing.
-
Multiple horizons = multiple tests; one of them "winning" at p<0.05 is expected by chance. If you test D1, D7, D14, and D30 and only D14 is significant, be skeptical — that's roughly a coin-flip's worth of false positives across four tests. Either pre-register the horizon you care about, or correct for multiple comparisons before celebrating a single significant cell.
Files
scripts/cohorts.py — build_cohort, retention_curve, compare (two-proportion z-test), chi_square (2×2 contingency), and the Delta result type with p-value + confidence interval. Compose these; do not reimplement the significance math inline.