Fatskills
Practice. Master. Repeat.
Study Guide: TECH **SQL for Data Analysis: Cohort Analysis with Retention Tables**
Source: https://www.fatskills.com/introdution-to-engineering/chapter/tech-sql-for-data-analysis-cohort-analysis-with-retention-tables

TECH SQL for Data Analysis: Cohort Analysis with Retention Tables

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~10 min read

SQL for Data Analysis: Cohort Analysis with Retention Tables

Hyper-practical, zero-fluff guide for real projects & certifications (PL-300, Google Data Analytics, etc.)

1. What This Is & Why It Matters

Cohort Analysis is the process of grouping users (or entities) by a shared characteristic (e.g., sign-up month, first purchase date) and tracking their behavior over time. A Retention Table quantifies how many users from each cohort return in subsequent periods (e.g., "30% of January sign-ups were still active in March").

Why it matters in production:
- Without it, you’re flying blind on user engagement. You might see "total active users" growing, but if retention is plummeting, your growth is unsustainable (like filling a leaky bucket).
- With it, you can: - Identify which acquisition channels bring loyal users (not just one-time visitors).
- Measure the impact of product changes (e.g., "Did our onboarding redesign improve Day-7 retention?").
- Predict churn and allocate resources (e.g., "Cohorts from Q2 2023 have 20% lower retention—let’s investigate").

Real-world scenario:
You’re a data analyst at a SaaS company. The CEO asks, "Are our recent marketing campaigns bringing in users who stick around, or are we just burning cash on one-time signups?" A retention table answers this in seconds. Without it, you’d waste weeks building ad-hoc queries or (worse) rely on gut feelings.

2. Core Concepts & Components

? Cohort
Definition: A group of users who share a common event (e.g., first app launch, subscription start) within a specific time period (e.g., "January 2024 sign-ups").
Production insight: Cohorts are not static. If you define them too broadly (e.g., "all 2023 users"), you’ll miss critical trends. Aim for granularity (e.g., weekly or monthly cohorts).
? Retention Period
Definition: The time window after the cohort’s initial event (e.g., "Day 1," "Week 2," "Month 3").
Production insight: Retention periods must align with your business cycle. For a food delivery app, "Day 7" retention matters more than "Month 6." For a B2B tool, "Month 12" is critical.
? Retention Rate
Definition: The percentage of a cohort that returns in a given retention period (e.g., "45% of January sign-ups were active in February").
Production insight: Retention rates decline over time—this is normal. The key is spotting abnormal drops (e.g., a 30% drop in Week 2 retention after a UI change).
? First-Touch vs. Last-Touch Attribution
Definition:
- First-touch: Credit the cohort’s initial event (e.g., "first app open").
- Last-touch: Credit the most recent event (e.g., "last purchase").
  Production insight: First-touch is standard for retention tables. Last-touch is useful for revenue analysis (e.g., "Which cohort drove the most revenue in Month 3?").
? Active User Definition
Definition: The criteria for counting a user as "active" (e.g., "logged in," "made a purchase," "used a feature").
Production insight: This must match your business goals. If you define "active" as "logged in," but your product is a meditation app, you’ll overcount users who open the app but don’t use it.
? Rolling vs. Fixed Retention Windows
Definition:
- Fixed: Compare all cohorts to the same calendar periods (e.g., "January sign-ups in January vs. February").
- Rolling: Compare each cohort to its own timeline (e.g., "Day 7 for January sign-ups vs. Day 7 for February sign-ups").
  Production insight: Rolling windows are far more useful for spotting trends (e.g., "Is Day-7 retention improving over time?").
? Survival Analysis (Bonus)
Definition: A statistical method to estimate the probability of a user "surviving" (remaining active) over time.
Production insight: Use this for predictive retention (e.g., "What % of this cohort will churn by Month 6?"). Requires Python/R (e.g., lifelines library) but is gold for executive reporting.

3. Step-by-Step: Build a Retention Table in SQL

Step 1: Define Your Cohort and Active User Criteria

Cohort: Users who signed up in January 2024.
Active user: Logged in or made a purchase.

Step 2: Assign Each User to a Cohort

WITH user_cohorts AS (
  SELECT
    user_id,
    DATE_TRUNC('month', MIN(event_date)) AS cohort_month
  FROM user_events
  WHERE event_type = 'signup'
  GROUP BY user_id
)

Output:

user_id | cohort_month
--------+-------------
101     | 2024-01-01
102     | 2024-01-01

Step 3: Calculate Retention Periods for Each User

WITH user_activity AS (
  SELECT
    u.user_id,
    u.cohort_month,
    DATE_TRUNC('month', e.event_date) AS activity_month,
    -- Calculate months since cohort (0 = cohort month, 1 = next month, etc.)
    EXTRACT(MONTH FROM e.event_date) - EXTRACT(MONTH FROM u.cohort_month) AS period
  FROM user_cohorts u
  JOIN user_events e ON u.user_id = e.user_id
  WHERE e.event_type IN ('login', 'purchase')
)

Output:

user_id | cohort_month | activity_month | period
--------+--------------+----------------+-------
101     | 2024-01-01   | 2024-01-01     | 0
101     | 2024-01-01   | 2024-02-01     | 1
102     | 2024-01-01   | 2024-01-01     | 0

Step 4: Count Active Users per Cohort per Period

WITH retention_counts AS (
  SELECT
    cohort_month,
    period,
    COUNT(DISTINCT user_id) AS active_users
  FROM user_activity
  GROUP BY cohort_month, period
)

Output:

cohort_month | period | active_users
-------------+--------+-------------
2024-01-01   | 0      | 2
2024-01-01   | 1      | 1

Step 5: Calculate Retention Rates

WITH cohort_sizes AS (
  SELECT
    cohort_month,
    COUNT(DISTINCT user_id) AS cohort_size
  FROM user_cohorts
  GROUP BY cohort_month
)
SELECT
  r.cohort_month,
  r.period,
  r.active_users,
  c.cohort_size,
  ROUND(100.0 * r.active_users / c.cohort_size, 2) AS retention_rate
FROM retention_counts r
JOIN cohort_sizes c ON r.cohort_month = c.cohort_month
ORDER BY r.cohort_month, r.period;

Final Output:

cohort_month | period | active_users | cohort_size | retention_rate
-------------+--------+--------------+-------------+---------------
2024-01-01   | 0      | 2            | 2           | 100.00
2024-01-01   | 1      | 1            | 2           | 50.00

Step 6: Visualize the Retention Table

Tool: Use Excel, Tableau, or Python (matplotlib/seaborn).
Example Python code:
```python import pandas as pd import seaborn as sns import matplotlib.pyplot as plt

# Load the retention table (from SQL output) df = pd.read_csv("retention_table.csv")

# Pivot for heatmap heatmap_data = df.pivot(index="cohort_month", columns="period", values="retention_rate")

# Plot plt.figure(figsize=(10, 6)) sns.heatmap(heatmap_data, annot=True, fmt=".1f", cmap="YlGnBu") plt.title("Retention Rate by Cohort and Period (%)") plt.show() ``` - Expected output: A heatmap where darker colors = higher retention.

4. ? Production-Ready Best Practices

? Security

Mask PII: If your user_events table includes emails or names, use user_id only in retention tables. Never expose raw PII in dashboards.
Access control: Restrict retention table access to analysts and executives. Use row-level security (RLS) in tools like Power BI or Snowflake.

? Cost Optimization

Materialize intermediate tables: For large datasets, pre-compute user_cohorts and retention_counts as tables (not CTEs) to avoid recomputing.
Partition by date: If your user_events table is huge, partition it by event_date to speed up queries.

? Reliability & Maintainability

Automate updates: Schedule retention table refreshes (e.g., daily at 2 AM) using Airflow, dbt, or cron.
Version control: Store SQL queries in Git (e.g., retention_table_v2.sql). Tag versions when business logic changes (e.g., "active user definition updated").
Document assumptions: Add a comment at the top of your query: sql -- Active user = login OR purchase -- Cohort = first signup month -- Retention period = months since signup

? Observability

Monitor query performance: If retention queries take >5 minutes, optimize with indexes or materialized views.
Alert on anomalies: Set up alerts (e.g., Slack + Metabase) for retention drops >10% week-over-week.
Log changes: Track when the "active user" definition changes (e.g., "2024-03-01: Added 'feature_used' to active criteria").

5. ⚠️ Common Mistakes & Traps

Mistake	Symptom	Fix/Prevention
Using calendar months for periods	Retention rates fluctuate wildly (e.g., 31-day months vs. 28-day months).	Use rolling periods (e.g., "7 days after signup") instead of calendar months.
Ignoring cohort size	Tiny cohorts (e.g., 5 users) skew results.	Filter out cohorts with <50 users or add a `cohort_size` column to flag small groups.
Defining "active" too loosely	Retention rates look artificially high.	Align "active" with business goals (e.g., "made a purchase" > "logged in").
Not accounting for time zones	Users in different time zones split across cohorts.	Standardize all dates to UTC or a single time zone.
Comparing fixed vs. rolling windows	Misleading trends (e.g., "January cohort had lower retention because February is shorter").	Always use rolling windows for trend analysis.

6. ? Exam/Certification Focus

Typical question patterns:
1. "Which SQL function is best for assigning users to cohorts?"
- Trap: GROUP BY vs. DATE_TRUNC.
- Answer: DATE_TRUNC('month', event_date) to group by month.

"How do you calculate retention rate for Month 3?"
Trap: Forgetting to divide by the cohort size.
Answer: (Active users in Month 3) / (Cohort size) * 100.
"Why might retention rates drop in December?"
Trap: Assuming it’s a product issue.
Answer: Seasonality (e.g., users are busy with holidays). Compare to previous Decembers.
"What’s the difference between first-touch and last-touch attribution?"
Trap: Confusing them with marketing attribution.
Answer:
- First-touch: Tracks the cohort’s initial event (e.g., signup).
- Last-touch: Tracks the most recent event (e.g., last purchase).

Key trap distinctions:
- Fixed vs. rolling windows: Fixed windows compare cohorts to the same calendar periods (e.g., "January vs. February"). Rolling windows compare each cohort to its own timeline (e.g., "Day 7 for January vs. Day 7 for February"). Always use rolling for trend analysis. - Active user definition: "Logged in" ≠ "engaged." Define "active" based on business impact (e.g., "used a paid feature").

7. ? Hands-On Challenge

Challenge:
Using the user_events table below, write a SQL query to calculate weekly retention rates for the January 2024 cohort. Define "active" as "logged in at least once in the week."

-- user_events table
user_id | event_date  | event_type
--------+-------------+------------
101     | 2024-01-01  | signup
101     | 2024-01-02  | login
101     | 2024-01-08  | login
102     | 2024-01-05  | signup
102     | 2024-01-06  | login
103     | 2024-01-10  | signup

Solution:

WITH user_cohorts AS (
  SELECT
    user_id,
    DATE_TRUNC('week', MIN(event_date)) AS cohort_week
  FROM user_events
  WHERE event_type = 'signup'
  GROUP BY user_id
),
user_activity AS (
  SELECT
    u.user_id,
    u.cohort_week,
    DATE_TRUNC('week', e.event_date) AS activity_week,
    EXTRACT(WEEK FROM e.event_date) - EXTRACT(WEEK FROM u.cohort_week) AS period
  FROM user_cohorts u
  JOIN user_events e ON u.user_id = e.user_id
  WHERE e.event_type = 'login'
),
retention_counts AS (
  SELECT
    cohort_week,
    period,
    COUNT(DISTINCT user_id) AS active_users
  FROM user_activity
  GROUP BY cohort_week, period
),
cohort_sizes AS (
  SELECT
    cohort_week,
    COUNT(DISTINCT user_id) AS cohort_size
  FROM user_cohorts
  GROUP BY cohort_week
)
SELECT
  r.cohort_week,
  r.period,
  r.active_users,
  c.cohort_size,
  ROUND(100.0 * r.active_users / c.cohort_size, 2) AS retention_rate
FROM retention_counts r
JOIN cohort_sizes c ON r.cohort_week = c.cohort_week
ORDER BY r.cohort_week, r.period;

Why it works:
- DATE_TRUNC('week', ...) groups events by week.
- EXTRACT(WEEK FROM ...) calculates the period (0 = cohort week, 1 = next week, etc.).
- The final query joins active users to cohort sizes to compute retention rates.

8. ? Rapid-Reference Crib Sheet

Task	SQL Snippet
Assign users to cohorts	`DATE_TRUNC('month', MIN(event_date)) AS cohort_month`
Calculate rolling periods	`EXTRACT(MONTH FROM e.event_date) - EXTRACT(MONTH FROM u.cohort_month) AS period`
Count active users per period	`COUNT(DISTINCT user_id) AS active_users`
Calculate retention rate	`ROUND(100.0 * active_users / cohort_size, 2)`
Filter small cohorts	`HAVING COUNT(DISTINCT user_id) >= 50`
⚠️ Time zone trap	Always use `AT TIME ZONE 'UTC'` or a single time zone.
⚠️ Active user trap	Define "active" before writing the query (e.g., "login OR purchase").
Heatmap visualization (Python)	`sns.heatmap(df.pivot(index="cohort", columns="period", values="retention_rate"))`

9. ? Where to Go Next

Official docs:
PostgreSQL Date Functions (for DATE_TRUNC, EXTRACT).
dbt Retention Analysis (for automation).
Tutorials:
Mode Analytics: Cohort Analysis in SQL (interactive examples).
Towards Data Science: Retention Tables (advanced use cases).
Books:
Lean Analytics (Alistair Croll & Benjamin Yoskovitz) – Chapter 6 on retention.
SQL for Data Analysis (O’Reilly) – Chapter 8 on time-series analysis.

⚡ Recently practiced quizzes in this class

Data Analytics Practice Test Big Data & Analytics NASSCOM Certification Practice Test PySpark Practice Test Questions Basic Data Analytics and Visualization Practice Test (Tableau) Data Science Glossary Data Analysis with Python Data Science Exam #1 Data Analytics and Visualization Practice Test Pega Certified System Architect (PCSA) Study Guide Data Science Basics / Data Scientist Toolbox

➡️ Next Study Guide

TECH SQL for Data Analysis: Cohort Analysis with Retention Tables

SQL for Data Analysis: Cohort Analysis with Retention Tables

1. What This Is & Why It Matters

2. Core Concepts & Components

3. Step-by-Step: Build a Retention Table in SQL

Step 1: Define Your Cohort and Active User Criteria

Step 2: Assign Each User to a Cohort

Step 3: Calculate Retention Periods for Each User

Step 4: Count Active Users per Cohort per Period

Step 5: Calculate Retention Rates

Step 6: Visualize the Retention Table

4. ? Production-Ready Best Practices

? Security

? Cost Optimization

? Reliability & Maintainability

? Observability

5. ⚠️ Common Mistakes & Traps

6. ? Exam/Certification Focus

7. ? Hands-On Challenge

8. ? Rapid-Reference Crib Sheet

9. ? Where to Go Next

❤ If you liked Fatskills, consider supporting us by checking out The Life Manuals You Never Got.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | OSHA Basics Quiz | What Should We Know?
Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson
© 2026 Fatskills.com

All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement.

TECH **SQL for Data Analysis: Cohort Analysis with Retention Tables**

SQL for Data Analysis: Cohort Analysis with Retention Tables

1. What This Is & Why It Matters

2. Core Concepts & Components

3. Step-by-Step: Build a Retention Table in SQL

Step 1: Define Your Cohort and Active User Criteria

Step 2: Assign Each User to a Cohort

Step 3: Calculate Retention Periods for Each User

Step 4: Count Active Users per Cohort per Period

Step 5: Calculate Retention Rates

Step 6: Visualize the Retention Table

4. ? Production-Ready Best Practices

? Security

? Cost Optimization

? Reliability & Maintainability

? Observability

5. ⚠️ Common Mistakes & Traps

6. ? Exam/Certification Focus

7. ? Hands-On Challenge

8. ? Rapid-Reference Crib Sheet

9. ? Where to Go Next

❤ If you liked Fatskills, consider supporting us by checking out The Life Manuals You Never Got.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | OSHA Basics Quiz | What Should We Know? Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson© 2026 Fatskills.com

All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement.

TECH SQL for Data Analysis: Cohort Analysis with Retention Tables

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | OSHA Basics Quiz | What Should We Know?
Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson
© 2026 Fatskills.com