Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

cohort-compare

Estrellas0

Forks0

Actualizado24 de junio de 2026, 03:26

Use when someone wants to compare two cohorts' retention or conversion, asks whether a difference between segments is "real" or "significant", wants retention curves for an A/B group or a launch vs control, or says one cohort "looks better" and needs the delta flagged with a p-value. Reach for this whenever the question is two-group comparison plus significance — and especially before eyeballing two percentages and declaring a winner, which ignores sample size and observation-window mismatch.

Instalación

Instalar con Codex o Claude Copia este prompt, pégalo en Codex, Claude u otro asistente, y deja que revise la página de la skill y la instale por ti.

Ejecutar en Manus

Fuente

az9713

az9713/skill-best-practices

Abrir repositorio de GitHub Ver repositorios del creador

Descarga

Ejecutar en Manus

Explorador de archivos

2 archivos

SKILL.md

readonly

name

cohort-compare

description

cohort-compare

Overview

This skill compares two cohorts on retention or conversion and tells you whether the difference is statistically meaningful, not just numerically different. It hands Claude scripts/cohorts.py — build_cohort, retention_curve, and a significance test (compare runs a two-proportion z-test, with chi-square available for the contingency form) returning the delta, the p-value, and a confidence interval.

The job is to stop "Cohort A is 42% and Cohort B is 39%, so A is better" from shipping as a conclusion. With 80 users per arm, that 3-point gap is noise. This skill makes the significance call explicit and surfaces the traps (small samples, survivorship, mismatched windows, segment drift) that make naïve comparisons wrong.

When to Use

Use this skill when the request is to:

Compare retention between two cohorts (e.g. May signups vs June signups, treatment vs control).
Compare a single conversion rate between two segments and decide if the delta is real.
Produce a retention curve (D1/D7/D30…) for one or both cohorts.
Sanity-check a claim that "segment X converts better than Y."
Decide whether a launch moved a metric or whether the change is within noise.

Do NOT use this for: building a single funnel or figuring out which events/tables to join — that is funnel-query (this skill consumes cohorts of users, it does not discover the event→step mapping). Do NOT use it for infra/ops dashboards — that is grafana. This skill stops at the comparison + significance verdict; it does not run multi-arm experiment analysis or causal inference.

Cohorts and where their definitions live

A cohort is a set of canonical user_ids plus a per-user cohort_ts (the moment they entered, e.g. signup time) used to age each user for retention. Resolve users to the canonical user_id the same way funnel-query does — retention fragments badly if a user is split across anonymous ids.

Segment definitions are not free-text. They live in the segment registry (analytics.segment_definitions, keyed by segment_id, each row carrying the SQL predicate and a version). When you compare "power users" vs "casual users", pull the cohort from a registered segment_id and record its version in your output, so the comparison is reproducible and you can detect if the definition changed under you (see Gotchas: segment-definition drift). Do not paraphrase a segment inline.

The library

scripts/cohorts.py exposes:

build_cohort(segment_id=None, *, user_ids=None, start, end, version=None) — materialize a cohort of canonical user_ids with their cohort_ts, either from a registered segment_id (pinned to a version) or from an explicit list. Records the segment version used.
retention_curve(cohort, horizons=(1, 7, 14, 30)) — for each horizon day, the fraction of the cohort still active, counting only users old enough to have had the chance (no survivorship inflation). Returns counts and rates per horizon.
compare(a, b, *, metric, horizon=None) — compare cohort A vs B on a conversion or a retention horizon. Runs a two-proportion z-test, returns Delta(p_value, ci_low, ci_high, significant, ...). Use chi_square(...) for the 2×2 contingency form when you prefer it; both agree at large N.

Claude composes these. Example: "did June signups retain better than May at D7?"

from scripts.cohorts import build_cohort, retention_curve, compare

may  = build_cohort(start="2026-05-01", end="2026-06-01")
june = build_cohort(start="2026-06-01", end="2026-07-01")

# Show the curves for context, then test the specific horizon.
print(retention_curve(may))
print(retention_curve(june))

verdict = compare(may, june, metric="retention", horizon=7)
print(verdict)  # Delta(rate_a=..., rate_b=..., p_value=..., significant=...)

For two registered segments on a conversion metric:

power  = build_cohort(segment_id="seg_power_users",  start="2026-06-01", end="2026-06-23")
casual = build_cohort(segment_id="seg_casual_users", start="2026-06-01", end="2026-06-23")
print(compare(power, casual, metric="paid_conversion"))

Don't railroad to compare() when the right move is to show the whole curve — sometimes the answer is "they diverge only after D14," which a single-horizon test hides.

Gotchas

ALWAYS treat these as real, observed failure modes — each has turned noise into a "result" that someone acted on.

Small samples make almost any delta look like a win — let the p-value, not the point estimate, decide. A 3–5 point retention gap on ~100 users per arm is routinely non-significant. compare() returns significant and a confidence interval precisely so you don't report a direction the data can't support. If significant is False, say "no detectable difference," not "A is slightly better." Report the CI; "+3pp (95% CI −6 to +12)" is honest, "+3pp" is not.
Survivorship bias: retention denominators must include users who could have churned, not just the ones who stuck around. Computing D30 retention over only the users who are still active is circular — it'll read ~100%. retention_curve denominators are the full cohort that is old enough to have reached the horizon. Never filter the cohort to "active users" before measuring retention.
Mismatched observation windows make comparisons meaningless. If Cohort A has had 60 days to retain and Cohort B only 20, their D30 numbers aren't comparable — B's recent users haven't had the chance yet. Only compare horizons both cohorts have fully aged into. retention_curve excludes users younger than each horizon; when you compare, make sure both cohorts are fully aged at that horizon, or the younger cohort is under-counted. Prefer comparing cohorts of equal maturity, or cap the horizon to the younger cohort's age.
Segment-definition drift silently changes what you're comparing. "Power users" today may be a different SQL predicate than last month if someone edited the registry. Always build cohorts from a pinned segment_id + version and print the version in your output. If you compare a cohort built last week (old version) against one built today (new version), the delta may be definitional, not behavioral. When in doubt, rebuild both at the same version.
A two-proportion z-test assumes independent users and a real binary outcome. Don't apply it to averages (e.g. mean sessions per user) — that needs a t-test/Mann-Whitney, not compare(metric=...). Don't double-count a user who appears in both cohorts; cohorts should be disjoint for a clean comparison. If a user is in both (e.g. overlapping segments), decide on an assignment rule before testing.
Multiple horizons = multiple tests; one of them "winning" at p<0.05 is expected by chance. If you test D1, D7, D14, and D30 and only D14 is significant, be skeptical — that's roughly a coin-flip's worth of false positives across four tests. Either pre-register the horizon you care about, or correct for multiple comparisons before celebrating a single significant cell.

Files

scripts/cohorts.py — build_cohort, retention_curve, compare (two-proportion z-test), chi_square (2×2 contingency), and the Delta result type with p-value + confidence interval. Compose these; do not reimplement the significance math inline.

Más de este repositorio

mismo repositorio

adversarial-review

az9713/skill-best-practices

Use when a change is written and "looks done" but has not had a hostile second pass before merge — especially diffs touching auth, money, migrations, concurrency, or anything the author is quietly unsure about. Spawns a fresh-eyes reviewer subagent that sees ONLY the diff and the spec, collects findings, drives fixes, and re-dispatches until findings degrade to nitpicks. Reach for this instead of self-reviewing; the author is the worst reviewer of their own diff.

2026-06-240

babysit-pr

az9713/skill-best-practices

Use when a PR is open and green-but-blocked, or red on CI for reasons that smell like flake — a timed-out test runner, a transient network 500 in a setup step, a check that passed locally but failed in CI. Reach for this whenever someone says "this PR keeps failing CI but the test is flaky", "can you babysit this PR to merge", "it's just a flaky check, retry it", or wants a PR shepherded through retries, conflict resolution, and auto-merge without sitting on it manually. Prefer this over hand-clicking "Re-run failed jobs" in the GitHub UI, which gives up no signal on flaky-vs-real and forgets to enable auto-merge.

2026-06-240

billing-lib

az9713/skill-best-practices

Use when writing or reviewing code that meters API token usage, bills accounts, issues invoices, applies credit grants, or computes balances with the internal `billing` library — especially around retries, mid-cycle plan changes, cache-read vs cache-write token pricing, or any place where double-billing or rounding drift would be a problem.

2026-06-240

checkout-verifier

az9713/skill-best-practices

Use when an API-credits checkout or paid-plan upgrade needs to be proven end-to-end against Stripe test mode — confirming a card charge actually creates the invoice and subscription in the right state, reproducing a "I paid but my credits didn't show up" report, checking that a declined or 3DS card fails the way the UI claims, or wiring a billing smoke test into CI so a checkout regression is caught before a customer's money is.

2026-06-240

cherry-pick-prod

az9713/skill-best-practices

Use when a specific fix that's already on main needs to land on a production/release branch without dragging along everything else — a hotfix to backport, a "cherry-pick this commit onto release-2.4", a "we need just that one PR on prod" request. Reach for this whenever someone wants to port one or a few commits to a release branch and open a PR for it, especially before doing it by hand in their main checkout, which pollutes their working tree and routinely leaves conflict markers committed or loses the original commit's provenance.

2026-06-240

code-style

az9713/skill-best-practices

Use when writing or editing code in this org's Python or JS/TS, especially before committing or opening a PR — and proactively the moment a diff adds an import, an except/catch, or any logging. Enforces the style rules Claude gets wrong by default: import grouping, error-wrapping (no bare except / empty catch), no leftover debug prints, explicit over clever. Runs scripts/check_style.sh (ruff, mypy --strict, eslint + grep guards) which exits nonzero so it drops into a pre-commit hook or CI.

2026-06-240

name

cohort-compare

description

cohort-compare

Overview

When to Use

Use this skill when the request is to:

Compare retention between two cohorts (e.g. May signups vs June signups, treatment vs control).
Compare a single conversion rate between two segments and decide if the delta is real.
Produce a retention curve (D1/D7/D30…) for one or both cohorts.
Sanity-check a claim that "segment X converts better than Y."
Decide whether a launch moved a metric or whether the change is within noise.

Cohorts and where their definitions live

The library

scripts/cohorts.py exposes:

build_cohort(segment_id=None, *, user_ids=None, start, end, version=None) — materialize a cohort of canonical user_ids with their cohort_ts, either from a registered segment_id (pinned to a version) or from an explicit list. Records the segment version used.
retention_curve(cohort, horizons=(1, 7, 14, 30)) — for each horizon day, the fraction of the cohort still active, counting only users old enough to have had the chance (no survivorship inflation). Returns counts and rates per horizon.
compare(a, b, *, metric, horizon=None) — compare cohort A vs B on a conversion or a retention horizon. Runs a two-proportion z-test, returns Delta(p_value, ci_low, ci_high, significant, ...). Use chi_square(...) for the 2×2 contingency form when you prefer it; both agree at large N.

Claude composes these. Example: "did June signups retain better than May at D7?"

from scripts.cohorts import build_cohort, retention_curve, compare

may  = build_cohort(start="2026-05-01", end="2026-06-01")
june = build_cohort(start="2026-06-01", end="2026-07-01")

# Show the curves for context, then test the specific horizon.
print(retention_curve(may))
print(retention_curve(june))

verdict = compare(may, june, metric="retention", horizon=7)
print(verdict)  # Delta(rate_a=..., rate_b=..., p_value=..., significant=...)

For two registered segments on a conversion metric:

power  = build_cohort(segment_id="seg_power_users",  start="2026-06-01", end="2026-06-23")
casual = build_cohort(segment_id="seg_casual_users", start="2026-06-01", end="2026-06-23")
print(compare(power, casual, metric="paid_conversion"))

Don't railroad to compare() when the right move is to show the whole curve — sometimes the answer is "they diverge only after D14," which a single-horizon test hides.

Gotchas

ALWAYS treat these as real, observed failure modes — each has turned noise into a "result" that someone acted on.

Small samples make almost any delta look like a win — let the p-value, not the point estimate, decide. A 3–5 point retention gap on ~100 users per arm is routinely non-significant. compare() returns significant and a confidence interval precisely so you don't report a direction the data can't support. If significant is False, say "no detectable difference," not "A is slightly better." Report the CI; "+3pp (95% CI −6 to +12)" is honest, "+3pp" is not.
Survivorship bias: retention denominators must include users who could have churned, not just the ones who stuck around. Computing D30 retention over only the users who are still active is circular — it'll read ~100%. retention_curve denominators are the full cohort that is old enough to have reached the horizon. Never filter the cohort to "active users" before measuring retention.
Mismatched observation windows make comparisons meaningless. If Cohort A has had 60 days to retain and Cohort B only 20, their D30 numbers aren't comparable — B's recent users haven't had the chance yet. Only compare horizons both cohorts have fully aged into. retention_curve excludes users younger than each horizon; when you compare, make sure both cohorts are fully aged at that horizon, or the younger cohort is under-counted. Prefer comparing cohorts of equal maturity, or cap the horizon to the younger cohort's age.
Segment-definition drift silently changes what you're comparing. "Power users" today may be a different SQL predicate than last month if someone edited the registry. Always build cohorts from a pinned segment_id + version and print the version in your output. If you compare a cohort built last week (old version) against one built today (new version), the delta may be definitional, not behavioral. When in doubt, rebuild both at the same version.
A two-proportion z-test assumes independent users and a real binary outcome. Don't apply it to averages (e.g. mean sessions per user) — that needs a t-test/Mann-Whitney, not compare(metric=...). Don't double-count a user who appears in both cohorts; cohorts should be disjoint for a clean comparison. If a user is in both (e.g. overlapping segments), decide on an assignment rule before testing.
Multiple horizons = multiple tests; one of them "winning" at p<0.05 is expected by chance. If you test D1, D7, D14, and D30 and only D14 is significant, be skeptical — that's roughly a coin-flip's worth of false positives across four tests. Either pre-register the horizon you care about, or correct for multiple comparisons before celebrating a single significant cell.

Files

scripts/cohorts.py — build_cohort, retention_curve, compare (two-proportion z-test), chi_square (2×2 contingency), and the Delta result type with p-value + confidence interval. Compose these; do not reimplement the significance math inline.