| name | funnel-query |
| description | Use when someone asks which events to join to see a funnel like signup → activation → paid, where the canonical user_id lives, how to count conversions between two product events, or why their event counts look inflated or don't match between steps. Reach for this whenever a question involves the raw event stream, joining events to users, deduping events, or measuring step-to-step conversion — especially before hand-writing SQL against the event tables, which almost always gets the user_id join wrong. |
funnel-query
Overview
This skill answers "which events do I join, and on what key, to measure a funnel" and hands Claude a small Python library (scripts/events.py) for fetching events and composing funnels on the fly. The single most important fact it encodes: raw events do not carry the canonical user_id. They carry an anonymous_id. The canonical user_id lives in identity.users, and you must join through it or your funnel silently fragments one user into many.
The library is credentials-aware (it reads warehouse creds from the environment) and exposes composable primitives — fetch_events, join_on_canonical_user_id, funnel — so Claude writes a short script per question rather than reconstructing boilerplate each time.
When to Use
Use this skill when the request is to:
- Build a funnel across product events (e.g.
signup_completed → workspace_activated → subscription_started).
- Figure out which event names map to a product step, and which table holds them.
- Measure step-to-step conversion or drop-off between two events.
- Resolve "my counts are inflated / don't add up / a user shows up twice" on the event stream.
- Attribute pre-login (anonymous) activity to the user who eventually signed up.
Do NOT use this for: aggregate retention curves or comparing two cohorts with significance testing — that is cohort-compare. Do NOT use it for ops/infra dashboards or "is the pipeline down" questions — that is grafana. This skill stops at producing funnel/conversion numbers from the event stream; it does not do statistics or alerting.
Canonical schema (the part Claude gets wrong by default)
- Raw events live in
analytics.events (one append-only row per event). Each row has event_id (unique per emission), event_name, anonymous_id, an optional user_id that is NULL for pre-login events, ts (UTC), and a JSON properties blob.
- The identity map lives in
identity.users. It maps anonymous_id → user_id (the canonical, stable id) and is the ONLY trustworthy source of user_id. Never trust events.user_id directly; always resolve through this table.
- A single human can have many
anonymous_ids (multiple devices/browsers) all pointing at one user_id. Counting by anonymous_id over-counts users; counting by raw events.user_id drops everyone pre-login.
So every funnel step is: fetch events by name → resolve to canonical user_id via identity.users → dedupe → count distinct users reaching each step.
The library
scripts/events.py exposes:
fetch_events(name, start, end) — pull all events of one event_name in a UTC time window. Returns a DataFrame with event_id, anonymous_id, raw user_id, ts, properties.
join_on_canonical_user_id(df) — join a fetched frame against identity.users and attach the canonical user_id. This is where anonymous pre-login rows get stitched to the eventual account. Drops rows that never resolve (truly anonymous, never-signed-up traffic).
funnel(steps) — given an ordered list of step specs, fetch + resolve + dedupe each step and return the ordered-conversion table (users reaching each step, step-to-step %, overall %). Enforces step ordering by timestamp so a later step only counts if it happened after the earlier one for that user.
Claude composes these. Example for "signup → activation → paid, last 30 days":
from scripts.events import funnel
result = funnel([
{"name": "signup_completed", "label": "Signup"},
{"name": "workspace_activated", "label": "Activation"},
{"name": "subscription_started", "label": "Paid"},
], start="2026-05-24", end="2026-06-23")
print(result.to_string())
For an ad-hoc two-event conversion you can drop to the primitives instead of funnel:
from scripts.events import fetch_events, join_on_canonical_user_id
a = join_on_canonical_user_id(fetch_events("signup_completed", "2026-06-01", "2026-06-23"))
b = join_on_canonical_user_id(fetch_events("workspace_activated", "2026-06-01", "2026-06-23"))
converted = b[b.user_id.isin(a.user_id)].user_id.nunique() / a.user_id.nunique()
Don't railroad to funnel() when the question is irregular (non-linear paths, "did X ever happen after Y", multi-touch) — compose the primitives directly.
Gotchas
ALWAYS treat these as real, observed failure modes — each has produced a wrong funnel number that shipped to someone before.
-
Raw events carry anonymous_id, not the canonical user_id. analytics.events.user_id is NULL for every pre-login event and is not guaranteed stable even when present. Joining funnels on it splits one user across signup (anonymous) and paid (logged-in) and tanks your conversion rate. Always pass each fetched frame through join_on_canonical_user_id before counting. If a step's user count looks suspiciously low, this is the first thing to check.
-
Dedupe on event_id, not on (user, event_name). The pipeline is at-least-once: the same emission can land twice with the same event_id. fetch_events does NOT pre-dedupe (so you can audit raw volume). Deduping on event_id removes pipeline duplicates while preserving genuinely repeated actions; deduping on (user, event_name) would also collapse a user who legitimately did the action twice. funnel() dedupes on event_id for you — if you hand-roll, do the same.
-
All timestamps in the warehouse are UTC; report windows are often local. events.ts is UTC. If a stakeholder asks for "yesterday" they mean their local day. Convert the window to UTC before passing start/end, and state the timezone in your answer. A naïve local-vs-UTC mismatch shifts every count by up to a day and is the most common cause of "the dashboard and your number disagree."
-
Late-arriving events make recent windows under-count. Events can land hours (occasionally a day) after they occurred, especially from mobile/offline clients. A funnel for "today" or the last few hours will under-report later steps and overstate drop-off. For trustworthy conversion, end the window at least 24–48h before now, or explicitly caveat that the tail is still filling in.
-
Funnel steps must be time-ordered per user. A user counts at step N only if their step-N event happened after their step-(N−1) event. funnel() enforces this; if you compose primitives with a plain isin, you'll wrongly credit users who did the "later" step first (e.g. reactivated before re-signup). Order matters whenever the events aren't naturally monotonic.
-
An event name is not always one product step. Some steps are an OR of several event names (e.g. activation = workspace_activated OR first_query_run), and some event names are emitted in multiple contexts. Confirm the name→step mapping with properties filters rather than assuming the name is canonical. Pass a where predicate into the step spec when a name needs narrowing.
Files
scripts/events.py — credentials-aware helper library: fetch_events, join_on_canonical_user_id, funnel, plus the identity.users resolution logic. Compose these into per-question scripts; do not re-implement the user-id join inline.