Fatskills
Practice. Master. Repeat.
Study Guide: TECH **SQL for Data Analysis: DISTINCT & COUNT(DISTINCT) – Zero-Fluff Study Guide**
Source: https://www.fatskills.com/introdution-to-engineering/chapter/tech-sql-for-data-analysis-distinct-countdistinct-zero-fluff-study-guide

TECH **SQL for Data Analysis: DISTINCT & COUNT(DISTINCT) – Zero-Fluff Study Guide**

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~9 min read

SQL for Data Analysis: DISTINCT & COUNT(DISTINCT) – Zero-Fluff Study Guide


1. What This Is & Why It Matters

You’re analyzing customer data for an e-commerce company. Your boss asks: - "How many unique customers made a purchase last month?" - "Are there duplicate product IDs in our inventory table?" - "What’s the count of distinct shipping cities in our orders?"

If you don’t know how to answer these fast and accurately, you’ll either: - Waste time manually deduplicating data in Excel (error-prone).
- Misreport metrics (e.g., counting the same customer twice).
- Slow down queries by fetching unnecessary rows.

DISTINCT and COUNT(DISTINCT) are your deduplication superpowers in SQL. They let you: ✅ Count unique values (not just rows).
Remove duplicates from query results.
Optimize performance by reducing data scanned.

Real-world scenario:
You inherit a messy sales database where the same customer appears multiple times due to tracking errors. Your job is to: 1. Find the true number of unique customers.
2. Identify duplicate product entries.
3. Report distinct shipping regions for logistics planning.

If you don’t use DISTINCT or COUNT(DISTINCT), your reports will be wrong, and your stakeholders will lose trust in your data.


2. Core Concepts & Components


1. DISTINCT

  • Definition: Returns unique values from a column or set of columns (removes duplicates).
  • Production insight: If you don’t use DISTINCT when needed, you’ll overcount (e.g., counting the same customer multiple times in a sales report).

2. COUNT(DISTINCT column)

  • Definition: Counts the number of unique values in a column (ignores duplicates).
  • Production insight: If you use COUNT(column) instead of COUNT(DISTINCT column), you’ll misreport metrics (e.g., counting duplicate orders as unique).

3. DISTINCT vs. GROUP BY

  • DISTINCT → Removes duplicates from the result set.
  • GROUP BY → Groups rows by a column and applies aggregations (e.g., SUM, AVG).
  • Production insight: DISTINCT is faster for simple deduplication, while GROUP BY is better for aggregating data.

4. DISTINCT on Multiple Columns

  • Definition: Returns unique combinations of values across multiple columns.
  • Example: SELECT DISTINCT customer_id, product_id FROM orders → Finds unique customer-product pairs.
  • Production insight: Useful for identifying duplicate transactions (e.g., same customer buying the same product multiple times).

5. Performance Impact of DISTINCT

  • Definition: DISTINCT requires sorting and deduplication, which can slow down queries on large tables.
  • Production insight: If your table has millions of rows, DISTINCT can kill performance. Consider:
  • Indexing the column(s) you’re deduplicating.
  • Pre-aggregating data in a data warehouse (e.g., using GROUP BY instead).

6. COUNT(*) vs. COUNT(column) vs. COUNT(DISTINCT column)

Function Counts Ignores NULLs? Use Case
COUNT(*) All rows ❌ No Total row count
COUNT(column) Non-NULL values in column ✅ Yes Count of non-NULL entries
COUNT(DISTINCT column) Unique non-NULL values in column ✅ Yes Count of unique entries
  • Production insight: If you use COUNT(*) when you need COUNT(DISTINCT), you’ll overcount duplicates.

7. DISTINCT with JOIN

  • Definition: When joining tables, DISTINCT removes duplicates after the join.
  • Example:
    sql SELECT DISTINCT c.customer_id FROM customers c JOIN orders o ON c.customer_id = o.customer_id;
  • Production insight: If you don’t use DISTINCT after a JOIN, you might count the same customer multiple times (e.g., if they placed multiple orders).

8. DISTINCT in Subqueries

  • Definition: You can use DISTINCT inside a subquery to filter unique values.
  • Example:
    sql SELECT customer_id FROM orders WHERE product_id IN (SELECT DISTINCT product_id FROM best_sellers);
  • Production insight: Useful for filtering based on unique criteria (e.g., "Show me customers who bought any of our top 10 products").


3. Step-by-Step Hands-On Section


Prerequisites

  • A SQL database (PostgreSQL, MySQL, BigQuery, Snowflake, etc.).
  • A sample dataset (we’ll use an orders table with duplicates).

Task: Analyze Unique Customers, Products, and Shipping Locations

Step 1: Create a Sample Table with Duplicates

-- Create a sample orders table with duplicates
CREATE TABLE orders (
order_id INT,
customer_id INT,
product_id INT,
order_date DATE,
shipping_city VARCHAR(50) ); -- Insert sample data (with duplicates) INSERT INTO orders VALUES (1, 101, 501, '2023-10-01', 'New York'), (2, 102, 502, '2023-10-01', 'Chicago'), (3, 101, 501, '2023-10-02', 'New York'), -- Duplicate customer & product (4, 103, 503, '2023-10-03', 'Los Angeles'), (5, 102, 504, '2023-10-04', 'Chicago'), -- Duplicate customer (6, 104, 505, '2023-10-05', 'Seattle');

Step 2: Find Unique Customers

-- Count unique customers
SELECT COUNT(DISTINCT customer_id) AS unique_customers
FROM orders;

Expected Output:


unique_customers
----------------
4

Why? Even though customer_id=101 appears twice, COUNT(DISTINCT) counts it only once.


Step 3: List All Unique Shipping Cities

-- Get distinct shipping cities
SELECT DISTINCT shipping_city
FROM orders;

Expected Output:


shipping_city
-------------
New York
Chicago
Los Angeles
Seattle

Step 4: Find Duplicate Customer-Product Pairs

-- Find customers who bought the same product more than once
SELECT customer_id, product_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id, product_id
HAVING COUNT(*) > 1;

Expected Output:


customer_id | product_id | order_count
------------+------------+------------
101         | 501        | 2

Why? This identifies repeat purchases of the same product by the same customer.


Step 5: Count Unique Products per Customer

-- Count distinct products bought by each customer
SELECT
customer_id,
COUNT(DISTINCT product_id) AS unique_products_bought FROM orders GROUP BY customer_id;

Expected Output:


customer_id | unique_products_bought
------------+-----------------------
101         | 1
102         | 2
103         | 1
104         | 1

Why? This helps identify customer purchase diversity (e.g., do they buy multiple products or just one?).


Step 6: Optimize a Slow DISTINCT Query

Problem: Your DISTINCT query is slow on a large table.
Solution: Add an index on the column you’re deduplicating.


-- Create an index on customer_id (if it doesn't exist)
CREATE INDEX idx_customer_id ON orders(customer_id);

-- Now run the DISTINCT query (should be faster)
SELECT DISTINCT customer_id FROM orders;

Verification:
- Check query execution time before/after indexing.
- In PostgreSQL, use EXPLAIN ANALYZE SELECT DISTINCT customer_id FROM orders; to see the improvement.


4. ? Production-Ready Best Practices


Performance

Index columns used in DISTINCT → Speeds up deduplication.
Use GROUP BY instead of DISTINCT for aggregations → More efficient for large datasets.
Avoid DISTINCT * → Scans the entire table; specify columns instead.
Pre-aggregate data in a data warehouse → If you frequently need unique counts, materialize them in a summary table.

Accuracy

Always use COUNT(DISTINCT) when counting unique valuesCOUNT(column) will overcount duplicates.
Check for NULLsCOUNT(DISTINCT) ignores NULLs; if you need to count them, use COUNT(*) or COUNT(COALESCE(column, 'default')).
Validate results with GROUP BY → If COUNT(DISTINCT) seems off, cross-check with:
sql
SELECT column, COUNT(*) FROM table GROUP BY column;

Readability

Use column aliases for claritySELECT COUNT(DISTINCT customer_id) AS unique_customers.
Comment complex DISTINCT logic → Explain why deduplication is needed.
Avoid DISTINCT as a "just in case" fix → If you’re using it to hide duplicates, fix the data source instead.

Cost Optimization (Cloud Data Warehouses)

In BigQuery/Snowflake, COUNT(DISTINCT) is expensive → Use approximate functions (APPROX_COUNT_DISTINCT) for large datasets.
Partition tables by date → If querying recent data, limit scans with WHERE date > '2023-10-01'.


5. ⚠️ Common Mistakes & Traps

Mistake Symptom Fix/Prevention
Using COUNT(column) instead of COUNT(DISTINCT column) Overcounting duplicates (e.g., counting the same customer multiple times). Always use COUNT(DISTINCT) when counting unique values.
Forgetting DISTINCT after a JOIN Duplicate rows in results (e.g., same customer appearing multiple times). Add DISTINCT after joins if you only need unique values.
Using DISTINCT on a large table without indexing Slow query performance (full table scan). Index the column(s) used in DISTINCT.
Assuming DISTINCT removes all duplicates If duplicates exist across multiple columns, DISTINCT on one column won’t catch them. Use DISTINCT on all relevant columns or GROUP BY.
Using DISTINCT when GROUP BY is better Inefficient queries (e.g., SELECT DISTINCT customer_id, SUM(amount) FROM orders is invalid). Use GROUP BY for aggregations.


6. ? Exam/Certification Focus


Typical Question Patterns

  1. "What’s the difference between COUNT(*) and COUNT(DISTINCT column)?"
  2. Answer: COUNT(*) counts all rows; COUNT(DISTINCT column) counts unique non-NULL values.

  3. "How would you find duplicate customer IDs in a table?"

  4. Answer:
    sql
    SELECT customer_id, COUNT(*)
    FROM customers
    GROUP BY customer_id
    HAVING COUNT(*) > 1;

  5. "Which is faster: DISTINCT or GROUP BY for deduplication?"

  6. Answer: DISTINCT is usually faster for simple deduplication, but GROUP BY is better for aggregations.

  7. "What happens if you use DISTINCT on a column with NULLs?"

  8. Answer: DISTINCT treats NULLs as a single value (e.g., SELECT DISTINCT NULL, NULL returns one row).

⚠️ Trap Distinctions

Concept Trap Correct Approach
COUNT(DISTINCT) Forgetting that it ignores NULLs. Use COUNT(*) or COUNT(COALESCE(column, 'default')) if NULLs matter.
DISTINCT vs. GROUP BY Using DISTINCT for aggregations (e.g., SELECT DISTINCT customer_id, SUM(amount)). Use GROUP BY for aggregations.
DISTINCT on multiple columns Assuming DISTINCT column1 removes duplicates in column2. Use DISTINCT column1, column2 to deduplicate combinations.

Scenario-Based Question

"You need to count the number of unique visitors to a website per day. The visits table has visitor_id and visit_date. Which query is correct?"

Incorrect:


SELECT visit_date, COUNT(visitor_id) AS unique_visitors
FROM visits
GROUP BY visit_date;

(Counts all visits, not unique visitors.)

Correct:


SELECT visit_date, COUNT(DISTINCT visitor_id) AS unique_visitors
FROM visits
GROUP BY visit_date;

(Counts unique visitors per day.)


7. ? Hands-On Challenge (with Solution)


Challenge

You have a logins table with user_id and login_time. Write a query to: 1. Count the total number of logins.
2. Count the number of unique users who logged in.

Solution:


-- Total logins
SELECT COUNT(*) AS total_logins FROM logins;

-- Unique users
SELECT COUNT(DISTINCT user_id) AS unique_users FROM logins;

Why it works:
- COUNT(*) counts all rows (logins).
- COUNT(DISTINCT user_id) counts unique users (ignoring duplicates).


8. ? Rapid-Reference Crib Sheet

Command Purpose Example
SELECT DISTINCT column Remove duplicates from a column. SELECT DISTINCT city FROM users;
SELECT DISTINCT col1, col2 Remove duplicate combinations. SELECT DISTINCT user_id, product_id FROM orders;
COUNT(DISTINCT column) Count unique values in a column. SELECT COUNT(DISTINCT user_id) FROM logins;
COUNT(*) Count all rows. SELECT COUNT(*) FROM orders;
COUNT(column) Count non-NULL values in a column. SELECT COUNT(email) FROM users;
GROUP BY column Group rows by a column (for aggregations). SELECT user_id, COUNT(*) FROM orders GROUP BY user_id;
HAVING COUNT(*) > 1 Filter groups with duplicates. SELECT user_id FROM orders GROUP BY user_id HAVING COUNT(*) > 1;
⚠️ DISTINCT is not a function SELECT DISTINCT (column) is the same as SELECT DISTINCT column. Avoid parentheses.
⚠️ COUNT(DISTINCT) ignores NULLs If you need to count NULLs, use COUNT(*) or COUNT(COALESCE(column, 'default')).


9. ? Where to Go Next

  1. PostgreSQL DISTINCT Documentation – Official docs on DISTINCT behavior.
  2. BigQuery COUNT(DISTINCT) Optimization – Learn about APPROX_COUNT_DISTINCT for large datasets.
  3. SQLZoo DISTINCT Tutorial – Interactive exercises.
  4. "SQL Performance Explained" (Book) – Covers indexing and query optimization (including DISTINCT).

Final Thought

DISTINCT and COUNT(DISTINCT) are simple but powerful tools. Master them, and you’ll: ✔ Avoid embarrassing data mistakes (e.g., overcounting customers).
Write faster queries (by indexing and optimizing).
Impress stakeholders with accurate, deduplicated reports.

Now go apply this to your own data—find a table with duplicates and practice! ?



ADVERTISEMENT