By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
Hyper-practical, zero-fluff guide for real projects & certifications (PL-300, Google Data Analytics, etc.)
Cohort Analysis is the process of grouping users (or entities) by a shared characteristic (e.g., sign-up month, first purchase date) and tracking their behavior over time. A Retention Table quantifies how many users from each cohort return in subsequent periods (e.g., "30% of January sign-ups were still active in March").
Why it matters in production:- Without it, you’re flying blind on user engagement. You might see "total active users" growing, but if retention is plummeting, your growth is unsustainable (like filling a leaky bucket).- With it, you can: - Identify which acquisition channels bring loyal users (not just one-time visitors). - Measure the impact of product changes (e.g., "Did our onboarding redesign improve Day-7 retention?"). - Predict churn and allocate resources (e.g., "Cohorts from Q2 2023 have 20% lower retention—let’s investigate").
Real-world scenario:You’re a data analyst at a SaaS company. The CEO asks, "Are our recent marketing campaigns bringing in users who stick around, or are we just burning cash on one-time signups?" A retention table answers this in seconds. Without it, you’d waste weeks building ad-hoc queries or (worse) rely on gut feelings.
? Cohort Definition: A group of users who share a common event (e.g., first app launch, subscription start) within a specific time period (e.g., "January 2024 sign-ups"). Production insight: Cohorts are not static. If you define them too broadly (e.g., "all 2023 users"), you’ll miss critical trends. Aim for granularity (e.g., weekly or monthly cohorts).
? Retention Period Definition: The time window after the cohort’s initial event (e.g., "Day 1," "Week 2," "Month 3"). Production insight: Retention periods must align with your business cycle. For a food delivery app, "Day 7" retention matters more than "Month 6." For a B2B tool, "Month 12" is critical.
? Retention Rate Definition: The percentage of a cohort that returns in a given retention period (e.g., "45% of January sign-ups were active in February"). Production insight: Retention rates decline over time—this is normal. The key is spotting abnormal drops (e.g., a 30% drop in Week 2 retention after a UI change).
? First-Touch vs. Last-Touch Attribution Definition:
? Active User Definition Definition: The criteria for counting a user as "active" (e.g., "logged in," "made a purchase," "used a feature"). Production insight: This must match your business goals. If you define "active" as "logged in," but your product is a meditation app, you’ll overcount users who open the app but don’t use it.
? Rolling vs. Fixed Retention Windows Definition:
? Survival Analysis (Bonus) Definition: A statistical method to estimate the probability of a user "surviving" (remaining active) over time. Production insight: Use this for predictive retention (e.g., "What % of this cohort will churn by Month 6?"). Requires Python/R (e.g., lifelines library) but is gold for executive reporting.
lifelines
Prerequisites:- A SQL database with user activity data (e.g., user_id, event_date, event_type).- Basic SQL skills (joins, window functions, date arithmetic).- Example schema: sql -- Table: user_events user_id | event_date | event_type --------+-------------+------------ 101 | 2024-01-15 | signup 101 | 2024-01-16 | login 102 | 2024-01-20 | signup 101 | 2024-02-01 | purchase
user_id
event_date
event_type
sql -- Table: user_events user_id | event_date | event_type --------+-------------+------------ 101 | 2024-01-15 | signup 101 | 2024-01-16 | login 102 | 2024-01-20 | signup 101 | 2024-02-01 | purchase
WITH user_cohorts AS ( SELECT user_id, DATE_TRUNC('month', MIN(event_date)) AS cohort_month FROM user_events WHERE event_type = 'signup' GROUP BY user_id )
Output:
user_id | cohort_month --------+------------- 101 | 2024-01-01 102 | 2024-01-01
WITH user_activity AS ( SELECT u.user_id, u.cohort_month, DATE_TRUNC('month', e.event_date) AS activity_month, -- Calculate months since cohort (0 = cohort month, 1 = next month, etc.) EXTRACT(MONTH FROM e.event_date) - EXTRACT(MONTH FROM u.cohort_month) AS period FROM user_cohorts u JOIN user_events e ON u.user_id = e.user_id WHERE e.event_type IN ('login', 'purchase') )
user_id | cohort_month | activity_month | period --------+--------------+----------------+------- 101 | 2024-01-01 | 2024-01-01 | 0 101 | 2024-01-01 | 2024-02-01 | 1 102 | 2024-01-01 | 2024-01-01 | 0
WITH retention_counts AS ( SELECT cohort_month, period, COUNT(DISTINCT user_id) AS active_users FROM user_activity GROUP BY cohort_month, period )
cohort_month | period | active_users -------------+--------+------------- 2024-01-01 | 0 | 2 2024-01-01 | 1 | 1
WITH cohort_sizes AS ( SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size FROM user_cohorts GROUP BY cohort_month ) SELECT r.cohort_month, r.period, r.active_users, c.cohort_size, ROUND(100.0 * r.active_users / c.cohort_size, 2) AS retention_rate FROM retention_counts r JOIN cohort_sizes c ON r.cohort_month = c.cohort_month ORDER BY r.cohort_month, r.period;
Final Output:
cohort_month | period | active_users | cohort_size | retention_rate -------------+--------+--------------+-------------+--------------- 2024-01-01 | 0 | 2 | 2 | 100.00 2024-01-01 | 1 | 1 | 2 | 50.00
matplotlib
seaborn
# Load the retention table (from SQL output) df = pd.read_csv("retention_table.csv")
# Pivot for heatmap heatmap_data = df.pivot(index="cohort_month", columns="period", values="retention_rate")
# Plot plt.figure(figsize=(10, 6)) sns.heatmap(heatmap_data, annot=True, fmt=".1f", cmap="YlGnBu") plt.title("Retention Rate by Cohort and Period (%)") plt.show() ``` - Expected output: A heatmap where darker colors = higher retention.
user_events
user_cohorts
retention_counts
retention_table_v2.sql
sql -- Active user = login OR purchase -- Cohort = first signup month -- Retention period = months since signup
cohort_size
Typical question patterns:1. "Which SQL function is best for assigning users to cohorts?" - Trap: GROUP BY vs. DATE_TRUNC. - Answer: DATE_TRUNC('month', event_date) to group by month.
GROUP BY
DATE_TRUNC
DATE_TRUNC('month', event_date)
Answer: (Active users in Month 3) / (Cohort size) * 100.
(Active users in Month 3) / (Cohort size) * 100
"Why might retention rates drop in December?"
Answer: Seasonality (e.g., users are busy with holidays). Compare to previous Decembers.
"What’s the difference between first-touch and last-touch attribution?"
Key trap distinctions:- Fixed vs. rolling windows: Fixed windows compare cohorts to the same calendar periods (e.g., "January vs. February"). Rolling windows compare each cohort to its own timeline (e.g., "Day 7 for January vs. Day 7 for February"). Always use rolling for trend analysis. - Active user definition: "Logged in" ≠ "engaged." Define "active" based on business impact (e.g., "used a paid feature").
Challenge:Using the user_events table below, write a SQL query to calculate weekly retention rates for the January 2024 cohort. Define "active" as "logged in at least once in the week."
-- user_events table user_id | event_date | event_type --------+-------------+------------ 101 | 2024-01-01 | signup 101 | 2024-01-02 | login 101 | 2024-01-08 | login 102 | 2024-01-05 | signup 102 | 2024-01-06 | login 103 | 2024-01-10 | signup
Solution:
WITH user_cohorts AS ( SELECT user_id, DATE_TRUNC('week', MIN(event_date)) AS cohort_week FROM user_events WHERE event_type = 'signup' GROUP BY user_id ), user_activity AS ( SELECT u.user_id, u.cohort_week, DATE_TRUNC('week', e.event_date) AS activity_week, EXTRACT(WEEK FROM e.event_date) - EXTRACT(WEEK FROM u.cohort_week) AS period FROM user_cohorts u JOIN user_events e ON u.user_id = e.user_id WHERE e.event_type = 'login' ), retention_counts AS ( SELECT cohort_week, period, COUNT(DISTINCT user_id) AS active_users FROM user_activity GROUP BY cohort_week, period ), cohort_sizes AS ( SELECT cohort_week, COUNT(DISTINCT user_id) AS cohort_size FROM user_cohorts GROUP BY cohort_week ) SELECT r.cohort_week, r.period, r.active_users, c.cohort_size, ROUND(100.0 * r.active_users / c.cohort_size, 2) AS retention_rate FROM retention_counts r JOIN cohort_sizes c ON r.cohort_week = c.cohort_week ORDER BY r.cohort_week, r.period;
Why it works:- DATE_TRUNC('week', ...) groups events by week.- EXTRACT(WEEK FROM ...) calculates the period (0 = cohort week, 1 = next week, etc.).- The final query joins active users to cohort sizes to compute retention rates.
DATE_TRUNC('week', ...)
EXTRACT(WEEK FROM ...)
DATE_TRUNC('month', MIN(event_date)) AS cohort_month
EXTRACT(MONTH FROM e.event_date) - EXTRACT(MONTH FROM u.cohort_month) AS period
COUNT(DISTINCT user_id) AS active_users
ROUND(100.0 * active_users / cohort_size, 2)
HAVING COUNT(DISTINCT user_id) >= 50
AT TIME ZONE 'UTC'
sns.heatmap(df.pivot(index="cohort", columns="period", values="retention_rate"))
EXTRACT
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.