Fatskills
Practice. Master. Repeat.
Study Guide: TECH **SQL for Data Analysis: Cohort Analysis with Retention Tables**
Source: https://www.fatskills.com/introdution-to-engineering/chapter/tech-sql-for-data-analysis-cohort-analysis-with-retention-tables

TECH **SQL for Data Analysis: Cohort Analysis with Retention Tables**

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~10 min read

SQL for Data Analysis: Cohort Analysis with Retention Tables

Hyper-practical, zero-fluff guide for real projects & certifications (PL-300, Google Data Analytics, etc.)


1. What This Is & Why It Matters

Cohort Analysis is the process of grouping users (or entities) by a shared characteristic (e.g., sign-up month, first purchase date) and tracking their behavior over time. A Retention Table quantifies how many users from each cohort return in subsequent periods (e.g., "30% of January sign-ups were still active in March").

Why it matters in production:
- Without it, you’re flying blind on user engagement. You might see "total active users" growing, but if retention is plummeting, your growth is unsustainable (like filling a leaky bucket).
- With it, you can: - Identify which acquisition channels bring loyal users (not just one-time visitors).
- Measure the impact of product changes (e.g., "Did our onboarding redesign improve Day-7 retention?").
- Predict churn and allocate resources (e.g., "Cohorts from Q2 2023 have 20% lower retention—let’s investigate").

Real-world scenario:
You’re a data analyst at a SaaS company. The CEO asks, "Are our recent marketing campaigns bringing in users who stick around, or are we just burning cash on one-time signups?" A retention table answers this in seconds. Without it, you’d waste weeks building ad-hoc queries or (worse) rely on gut feelings.


2. Core Concepts & Components

  • ? Cohort
    Definition: A group of users who share a common event (e.g., first app launch, subscription start) within a specific time period (e.g., "January 2024 sign-ups").
    Production insight: Cohorts are not static. If you define them too broadly (e.g., "all 2023 users"), you’ll miss critical trends. Aim for granularity (e.g., weekly or monthly cohorts).

  • ? Retention Period
    Definition: The time window after the cohort’s initial event (e.g., "Day 1," "Week 2," "Month 3").
    Production insight: Retention periods must align with your business cycle. For a food delivery app, "Day 7" retention matters more than "Month 6." For a B2B tool, "Month 12" is critical.

  • ? Retention Rate
    Definition: The percentage of a cohort that returns in a given retention period (e.g., "45% of January sign-ups were active in February").
    Production insight: Retention rates decline over time—this is normal. The key is spotting abnormal drops (e.g., a 30% drop in Week 2 retention after a UI change).

  • ? First-Touch vs. Last-Touch Attribution
    Definition:


    • First-touch: Credit the cohort’s initial event (e.g., "first app open").
    • Last-touch: Credit the most recent event (e.g., "last purchase").
      Production insight: First-touch is standard for retention tables. Last-touch is useful for revenue analysis (e.g., "Which cohort drove the most revenue in Month 3?").
  • ? Active User Definition
    Definition: The criteria for counting a user as "active" (e.g., "logged in," "made a purchase," "used a feature").
    Production insight: This must match your business goals. If you define "active" as "logged in," but your product is a meditation app, you’ll overcount users who open the app but don’t use it.

  • ? Rolling vs. Fixed Retention Windows
    Definition:


    • Fixed: Compare all cohorts to the same calendar periods (e.g., "January sign-ups in January vs. February").
    • Rolling: Compare each cohort to its own timeline (e.g., "Day 7 for January sign-ups vs. Day 7 for February sign-ups").
      Production insight: Rolling windows are far more useful for spotting trends (e.g., "Is Day-7 retention improving over time?").
  • ? Survival Analysis (Bonus)
    Definition: A statistical method to estimate the probability of a user "surviving" (remaining active) over time.
    Production insight: Use this for predictive retention (e.g., "What % of this cohort will churn by Month 6?"). Requires Python/R (e.g., lifelines library) but is gold for executive reporting.


3. Step-by-Step: Build a Retention Table in SQL

Prerequisites:
- A SQL database with user activity data (e.g., user_id, event_date, event_type).
- Basic SQL skills (joins, window functions, date arithmetic).
- Example schema: sql -- Table: user_events user_id | event_date | event_type --------+-------------+------------ 101 | 2024-01-15 | signup 101 | 2024-01-16 | login 102 | 2024-01-20 | signup 101 | 2024-02-01 | purchase

Step 1: Define Your Cohort and Active User Criteria

  • Cohort: Users who signed up in January 2024.
  • Active user: Logged in or made a purchase.

Step 2: Assign Each User to a Cohort

WITH user_cohorts AS (
  SELECT
user_id,
DATE_TRUNC('month', MIN(event_date)) AS cohort_month FROM user_events WHERE event_type = 'signup' GROUP BY user_id )

Output:


user_id | cohort_month
--------+-------------
101     | 2024-01-01
102     | 2024-01-01

Step 3: Calculate Retention Periods for Each User

WITH user_activity AS (
  SELECT
u.user_id,
u.cohort_month,
DATE_TRUNC('month', e.event_date) AS activity_month,
-- Calculate months since cohort (0 = cohort month, 1 = next month, etc.)
EXTRACT(MONTH FROM e.event_date) - EXTRACT(MONTH FROM u.cohort_month) AS period FROM user_cohorts u JOIN user_events e ON u.user_id = e.user_id WHERE e.event_type IN ('login', 'purchase') )

Output:


user_id | cohort_month | activity_month | period
--------+--------------+----------------+-------
101     | 2024-01-01   | 2024-01-01     | 0
101     | 2024-01-01   | 2024-02-01     | 1
102     | 2024-01-01   | 2024-01-01     | 0

Step 4: Count Active Users per Cohort per Period

WITH retention_counts AS (
  SELECT
cohort_month,
period,
COUNT(DISTINCT user_id) AS active_users FROM user_activity GROUP BY cohort_month, period )

Output:


cohort_month | period | active_users
-------------+--------+-------------
2024-01-01   | 0      | 2
2024-01-01   | 1      | 1

Step 5: Calculate Retention Rates

WITH cohort_sizes AS (
  SELECT
cohort_month,
COUNT(DISTINCT user_id) AS cohort_size FROM user_cohorts GROUP BY cohort_month ) SELECT r.cohort_month, r.period, r.active_users, c.cohort_size, ROUND(100.0 * r.active_users / c.cohort_size, 2) AS retention_rate FROM retention_counts r JOIN cohort_sizes c ON r.cohort_month = c.cohort_month ORDER BY r.cohort_month, r.period;

Final Output:


cohort_month | period | active_users | cohort_size | retention_rate
-------------+--------+--------------+-------------+---------------
2024-01-01   | 0      | 2            | 2           | 100.00
2024-01-01   | 1      | 1            | 2           | 50.00

Step 6: Visualize the Retention Table

  • Tool: Use Excel, Tableau, or Python (matplotlib/seaborn).
  • Example Python code:
    ```python import pandas as pd import seaborn as sns import matplotlib.pyplot as plt

# Load the retention table (from SQL output) df = pd.read_csv("retention_table.csv")

# Pivot for heatmap heatmap_data = df.pivot(index="cohort_month", columns="period", values="retention_rate")

# Plot plt.figure(figsize=(10, 6)) sns.heatmap(heatmap_data, annot=True, fmt=".1f", cmap="YlGnBu") plt.title("Retention Rate by Cohort and Period (%)") plt.show() ``` - Expected output: A heatmap where darker colors = higher retention.


4. ? Production-Ready Best Practices


? Security

  • Mask PII: If your user_events table includes emails or names, use user_id only in retention tables. Never expose raw PII in dashboards.
  • Access control: Restrict retention table access to analysts and executives. Use row-level security (RLS) in tools like Power BI or Snowflake.

? Cost Optimization

  • Materialize intermediate tables: For large datasets, pre-compute user_cohorts and retention_counts as tables (not CTEs) to avoid recomputing.
  • Partition by date: If your user_events table is huge, partition it by event_date to speed up queries.

? Reliability & Maintainability

  • Automate updates: Schedule retention table refreshes (e.g., daily at 2 AM) using Airflow, dbt, or cron.
  • Version control: Store SQL queries in Git (e.g., retention_table_v2.sql). Tag versions when business logic changes (e.g., "active user definition updated").
  • Document assumptions: Add a comment at the top of your query: sql -- Active user = login OR purchase -- Cohort = first signup month -- Retention period = months since signup

? Observability

  • Monitor query performance: If retention queries take >5 minutes, optimize with indexes or materialized views.
  • Alert on anomalies: Set up alerts (e.g., Slack + Metabase) for retention drops >10% week-over-week.
  • Log changes: Track when the "active user" definition changes (e.g., "2024-03-01: Added 'feature_used' to active criteria").


5. ⚠️ Common Mistakes & Traps

Mistake Symptom Fix/Prevention
Using calendar months for periods Retention rates fluctuate wildly (e.g., 31-day months vs. 28-day months). Use rolling periods (e.g., "7 days after signup") instead of calendar months.
Ignoring cohort size Tiny cohorts (e.g., 5 users) skew results. Filter out cohorts with <50 users or add a cohort_size column to flag small groups.
Defining "active" too loosely Retention rates look artificially high. Align "active" with business goals (e.g., "made a purchase" > "logged in").
Not accounting for time zones Users in different time zones split across cohorts. Standardize all dates to UTC or a single time zone.
Comparing fixed vs. rolling windows Misleading trends (e.g., "January cohort had lower retention because February is shorter"). Always use rolling windows for trend analysis.


6. ? Exam/Certification Focus

Typical question patterns:
1. "Which SQL function is best for assigning users to cohorts?"
- Trap: GROUP BY vs. DATE_TRUNC.
- Answer: DATE_TRUNC('month', event_date) to group by month.


  1. "How do you calculate retention rate for Month 3?"
  2. Trap: Forgetting to divide by the cohort size.
  3. Answer: (Active users in Month 3) / (Cohort size) * 100.

  4. "Why might retention rates drop in December?"

  5. Trap: Assuming it’s a product issue.
  6. Answer: Seasonality (e.g., users are busy with holidays). Compare to previous Decembers.

  7. "What’s the difference between first-touch and last-touch attribution?"

  8. Trap: Confusing them with marketing attribution.
  9. Answer:
    • First-touch: Tracks the cohort’s initial event (e.g., signup).
    • Last-touch: Tracks the most recent event (e.g., last purchase).

Key trap distinctions:
- Fixed vs. rolling windows: Fixed windows compare cohorts to the same calendar periods (e.g., "January vs. February"). Rolling windows compare each cohort to its own timeline (e.g., "Day 7 for January vs. Day 7 for February"). Always use rolling for trend analysis. - Active user definition: "Logged in" ≠ "engaged." Define "active" based on business impact (e.g., "used a paid feature").


7. ? Hands-On Challenge

Challenge:
Using the user_events table below, write a SQL query to calculate weekly retention rates for the January 2024 cohort. Define "active" as "logged in at least once in the week."


-- user_events table
user_id | event_date  | event_type
--------+-------------+------------
101     | 2024-01-01  | signup
101     | 2024-01-02  | login
101     | 2024-01-08  | login
102     | 2024-01-05  | signup
102     | 2024-01-06  | login
103     | 2024-01-10  | signup

Solution:


WITH user_cohorts AS (
  SELECT
user_id,
DATE_TRUNC('week', MIN(event_date)) AS cohort_week FROM user_events WHERE event_type = 'signup' GROUP BY user_id ), user_activity AS ( SELECT
u.user_id,
u.cohort_week,
DATE_TRUNC('week', e.event_date) AS activity_week,
EXTRACT(WEEK FROM e.event_date) - EXTRACT(WEEK FROM u.cohort_week) AS period FROM user_cohorts u JOIN user_events e ON u.user_id = e.user_id WHERE e.event_type = 'login' ), retention_counts AS ( SELECT
cohort_week,
period,
COUNT(DISTINCT user_id) AS active_users FROM user_activity GROUP BY cohort_week, period ), cohort_sizes AS ( SELECT
cohort_week,
COUNT(DISTINCT user_id) AS cohort_size FROM user_cohorts GROUP BY cohort_week ) SELECT r.cohort_week, r.period, r.active_users, c.cohort_size, ROUND(100.0 * r.active_users / c.cohort_size, 2) AS retention_rate FROM retention_counts r JOIN cohort_sizes c ON r.cohort_week = c.cohort_week ORDER BY r.cohort_week, r.period;

Why it works:
- DATE_TRUNC('week', ...) groups events by week.
- EXTRACT(WEEK FROM ...) calculates the period (0 = cohort week, 1 = next week, etc.).
- The final query joins active users to cohort sizes to compute retention rates.


8. ? Rapid-Reference Crib Sheet

Task SQL Snippet
Assign users to cohorts DATE_TRUNC('month', MIN(event_date)) AS cohort_month
Calculate rolling periods EXTRACT(MONTH FROM e.event_date) - EXTRACT(MONTH FROM u.cohort_month) AS period
Count active users per period COUNT(DISTINCT user_id) AS active_users
Calculate retention rate ROUND(100.0 * active_users / cohort_size, 2)
Filter small cohorts HAVING COUNT(DISTINCT user_id) >= 50
⚠️ Time zone trap Always use AT TIME ZONE 'UTC' or a single time zone.
⚠️ Active user trap Define "active" before writing the query (e.g., "login OR purchase").
Heatmap visualization (Python) sns.heatmap(df.pivot(index="cohort", columns="period", values="retention_rate"))


9. ? Where to Go Next

  1. Official docs:
  2. PostgreSQL Date Functions (for DATE_TRUNC, EXTRACT).
  3. dbt Retention Analysis (for automation).
  4. Tutorials:
  5. Mode Analytics: Cohort Analysis in SQL (interactive examples).
  6. Towards Data Science: Retention Tables (advanced use cases).
  7. Books:
  8. Lean Analytics (Alistair Croll & Benjamin Yoskovitz) – Chapter 6 on retention.
  9. SQL for Data Analysis (O’Reilly) – Chapter 8 on time-series analysis.


ADVERTISEMENT