Fatskills
Practice. Master. Repeat.
Study Guide: TECH **SQL for Data Analysis: DISTINCT & COUNT(DISTINCT) – Zero-Fluff Study Guide**
Source: https://www.fatskills.com/introdution-to-engineering/chapter/tech-sql-for-data-analysis-distinct-countdistinct-zero-fluff-study-guide

TECH SQL for Data Analysis: DISTINCT & COUNT(DISTINCT) – Zero-Fluff Study Guide

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~9 min read

SQL for Data Analysis: DISTINCT & COUNT(DISTINCT) – Zero-Fluff Study Guide

1. What This Is & Why It Matters

You’re analyzing customer data for an e-commerce company. Your boss asks: - "How many unique customers made a purchase last month?" - "Are there duplicate product IDs in our inventory table?" - "What’s the count of distinct shipping cities in our orders?"

If you don’t know how to answer these fast and accurately, you’ll either: - Waste time manually deduplicating data in Excel (error-prone).
- Misreport metrics (e.g., counting the same customer twice).
- Slow down queries by fetching unnecessary rows.

DISTINCT and COUNT(DISTINCT) are your deduplication superpowers in SQL. They let you: ✅ Count unique values (not just rows).
✅ Remove duplicates from query results.
✅ Optimize performance by reducing data scanned.

Real-world scenario:
You inherit a messy sales database where the same customer appears multiple times due to tracking errors. Your job is to: 1. Find the true number of unique customers.
2. Identify duplicate product entries.
3. Report distinct shipping regions for logistics planning.

If you don’t use DISTINCT or COUNT(DISTINCT), your reports will be wrong, and your stakeholders will lose trust in your data.

2. Core Concepts & Components

1. `DISTINCT`

Definition: Returns unique values from a column or set of columns (removes duplicates).
Production insight: If you don’t use DISTINCT when needed, you’ll overcount (e.g., counting the same customer multiple times in a sales report).

2. `COUNT(DISTINCT column)`

Definition: Counts the number of unique values in a column (ignores duplicates).
Production insight: If you use COUNT(column) instead of COUNT(DISTINCT column), you’ll misreport metrics (e.g., counting duplicate orders as unique).

3. `DISTINCT` vs. `GROUP BY`

DISTINCT → Removes duplicates from the result set.
GROUP BY → Groups rows by a column and applies aggregations (e.g., SUM, AVG).
Production insight: DISTINCT is faster for simple deduplication, while GROUP BY is better for aggregating data.

4. `DISTINCT` on Multiple Columns

Definition: Returns unique combinations of values across multiple columns.
Example: SELECT DISTINCT customer_id, product_id FROM orders → Finds unique customer-product pairs.
Production insight: Useful for identifying duplicate transactions (e.g., same customer buying the same product multiple times).

5. Performance Impact of `DISTINCT`

Definition: DISTINCT requires sorting and deduplication, which can slow down queries on large tables.
Production insight: If your table has millions of rows, DISTINCT can kill performance. Consider:
Indexing the column(s) you’re deduplicating.
Pre-aggregating data in a data warehouse (e.g., using GROUP BY instead).

**6. `COUNT(*)` vs. `COUNT(column)` vs. `COUNT(DISTINCT column)`**

Function	Counts	Ignores NULLs?	Use Case
`COUNT(*)`	All rows	❌ No	Total row count
`COUNT(column)`	Non-NULL values in `column`	✅ Yes	Count of non-NULL entries
`COUNT(DISTINCT column)`	Unique non-NULL values in `column`	✅ Yes	Count of unique entries

Production insight: If you use COUNT(*) when you need COUNT(DISTINCT), you’ll overcount duplicates.

7. `DISTINCT` with `JOIN`

Definition: When joining tables, DISTINCT removes duplicates after the join.
Example:
sql SELECT DISTINCT c.customer_id FROM customers c JOIN orders o ON c.customer_id = o.customer_id;
Production insight: If you don’t use DISTINCT after a JOIN, you might count the same customer multiple times (e.g., if they placed multiple orders).

8. `DISTINCT` in Subqueries

Definition: You can use DISTINCT inside a subquery to filter unique values.
Example:
sql SELECT customer_id FROM orders WHERE product_id IN (SELECT DISTINCT product_id FROM best_sellers);
Production insight: Useful for filtering based on unique criteria (e.g., "Show me customers who bought any of our top 10 products").

3. Step-by-Step Hands-On Section

Prerequisites

A SQL database (PostgreSQL, MySQL, BigQuery, Snowflake, etc.).
A sample dataset (we’ll use an orders table with duplicates).

Task: Analyze Unique Customers, Products, and Shipping Locations

Step 1: Create a Sample Table with Duplicates

-- Create a sample orders table with duplicates
CREATE TABLE orders (
    order_id INT,
    customer_id INT,
    product_id INT,
    order_date DATE,
    shipping_city VARCHAR(50)
);

-- Insert sample data (with duplicates)
INSERT INTO orders VALUES
(1, 101, 501, '2023-10-01', 'New York'),
(2, 102, 502, '2023-10-01', 'Chicago'),
(3, 101, 501, '2023-10-02', 'New York'),  -- Duplicate customer & product
(4, 103, 503, '2023-10-03', 'Los Angeles'),
(5, 102, 504, '2023-10-04', 'Chicago'),   -- Duplicate customer
(6, 104, 505, '2023-10-05', 'Seattle');

Step 2: Find Unique Customers

-- Count unique customers
SELECT COUNT(DISTINCT customer_id) AS unique_customers
FROM orders;

Expected Output:

unique_customers
----------------
4

Why? Even though customer_id=101 appears twice, COUNT(DISTINCT) counts it only once.

Step 3: List All Unique Shipping Cities

-- Get distinct shipping cities
SELECT DISTINCT shipping_city
FROM orders;

Expected Output:

shipping_city
-------------
New York
Chicago
Los Angeles
Seattle

Step 4: Find Duplicate Customer-Product Pairs

-- Find customers who bought the same product more than once
SELECT customer_id, product_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id, product_id
HAVING COUNT(*) > 1;

Expected Output:

customer_id | product_id | order_count
------------+------------+------------
101         | 501        | 2

Why? This identifies repeat purchases of the same product by the same customer.

Step 5: Count Unique Products per Customer

-- Count distinct products bought by each customer
SELECT
    customer_id,
    COUNT(DISTINCT product_id) AS unique_products_bought
FROM orders
GROUP BY customer_id;

Expected Output:

customer_id | unique_products_bought
------------+-----------------------
101         | 1
102         | 2
103         | 1
104         | 1

Why? This helps identify customer purchase diversity (e.g., do they buy multiple products or just one?).

Step 6: Optimize a Slow `DISTINCT` Query

Problem: Your DISTINCT query is slow on a large table.
Solution: Add an index on the column you’re deduplicating.

-- Create an index on customer_id (if it doesn't exist)
CREATE INDEX idx_customer_id ON orders(customer_id);

-- Now run the DISTINCT query (should be faster)
SELECT DISTINCT customer_id FROM orders;

Verification:
- Check query execution time before/after indexing.
- In PostgreSQL, use EXPLAIN ANALYZE SELECT DISTINCT customer_id FROM orders; to see the improvement.

4. ? Production-Ready Best Practices

Performance

✅ Index columns used in DISTINCT → Speeds up deduplication.
✅ Use GROUP BY instead of DISTINCT for aggregations → More efficient for large datasets.
✅ Avoid DISTINCT * → Scans the entire table; specify columns instead.
✅ Pre-aggregate data in a data warehouse → If you frequently need unique counts, materialize them in a summary table.

Accuracy

✅ Always use COUNT(DISTINCT) when counting unique values → COUNT(column) will overcount duplicates.
✅ Check for NULLs → COUNT(DISTINCT) ignores NULLs; if you need to count them, use COUNT(*) or COUNT(COALESCE(column, 'default')).
✅ Validate results with GROUP BY → If COUNT(DISTINCT) seems off, cross-check with:
sql SELECT column, COUNT(*) FROM table GROUP BY column;

Readability

✅ Use column aliases for clarity → SELECT COUNT(DISTINCT customer_id) AS unique_customers.
✅ Comment complex DISTINCT logic → Explain why deduplication is needed.
✅ Avoid DISTINCT as a "just in case" fix → If you’re using it to hide duplicates, fix the data source instead.

Cost Optimization (Cloud Data Warehouses)

✅ In BigQuery/Snowflake, COUNT(DISTINCT) is expensive → Use approximate functions (APPROX_COUNT_DISTINCT) for large datasets.
✅ Partition tables by date → If querying recent data, limit scans with WHERE date > '2023-10-01'.

5. ⚠️ Common Mistakes & Traps

Mistake	Symptom	Fix/Prevention
Using `COUNT(column)` instead of `COUNT(DISTINCT column)`	Overcounting duplicates (e.g., counting the same customer multiple times).	Always use `COUNT(DISTINCT)` when counting unique values.
Forgetting `DISTINCT` after a `JOIN`	Duplicate rows in results (e.g., same customer appearing multiple times).	Add `DISTINCT` after joins if you only need unique values.
Using `DISTINCT` on a large table without indexing	Slow query performance (full table scan).	Index the column(s) used in `DISTINCT`.
Assuming `DISTINCT` removes all duplicates	If duplicates exist across multiple columns, `DISTINCT` on one column won’t catch them.	Use `DISTINCT` on all relevant columns or `GROUP BY`.
Using `DISTINCT` when `GROUP BY` is better	Inefficient queries (e.g., `SELECT DISTINCT customer_id, SUM(amount) FROM orders` is invalid).	Use `GROUP BY` for aggregations.

6. ? Exam/Certification Focus

Typical Question Patterns

"What’s the difference between COUNT(*) and COUNT(DISTINCT column)?"
Answer: COUNT(*) counts all rows; COUNT(DISTINCT column) counts unique non-NULL values.
"How would you find duplicate customer IDs in a table?"
Answer:
sql SELECT customer_id, COUNT(*) FROM customers GROUP BY customer_id HAVING COUNT(*) > 1;
"Which is faster: DISTINCT or GROUP BY for deduplication?"
Answer: DISTINCT is usually faster for simple deduplication, but GROUP BY is better for aggregations.
"What happens if you use DISTINCT on a column with NULLs?"
Answer: DISTINCT treats NULLs as a single value (e.g., SELECT DISTINCT NULL, NULL returns one row).

⚠️ Trap Distinctions

Concept	Trap	Correct Approach
`COUNT(DISTINCT)`	Forgetting that it ignores NULLs.	Use `COUNT(*)` or `COUNT(COALESCE(column, 'default'))` if NULLs matter.
`DISTINCT` vs. `GROUP BY`	Using `DISTINCT` for aggregations (e.g., `SELECT DISTINCT customer_id, SUM(amount)`).	Use `GROUP BY` for aggregations.
`DISTINCT` on multiple columns	Assuming `DISTINCT column1` removes duplicates in `column2`.	Use `DISTINCT column1, column2` to deduplicate combinations.

Scenario-Based Question

"You need to count the number of unique visitors to a website per day. The visits table has visitor_id and visit_date. Which query is correct?"

❌ Incorrect:

SELECT visit_date, COUNT(visitor_id) AS unique_visitors
FROM visits
GROUP BY visit_date;

(Counts all visits, not unique visitors.)

✅ Correct:

SELECT visit_date, COUNT(DISTINCT visitor_id) AS unique_visitors
FROM visits
GROUP BY visit_date;

(Counts unique visitors per day.)

7. ? Hands-On Challenge (with Solution)

Challenge

You have a logins table with user_id and login_time. Write a query to: 1. Count the total number of logins.
2. Count the number of unique users who logged in.

Solution:

-- Total logins
SELECT COUNT(*) AS total_logins FROM logins;

-- Unique users
SELECT COUNT(DISTINCT user_id) AS unique_users FROM logins;

Why it works:
- COUNT(*) counts all rows (logins).
- COUNT(DISTINCT user_id) counts unique users (ignoring duplicates).

8. ? Rapid-Reference Crib Sheet

Command	Purpose	Example
`SELECT DISTINCT column`	Remove duplicates from a column.	`SELECT DISTINCT city FROM users;`
`SELECT DISTINCT col1, col2`	Remove duplicate combinations.	`SELECT DISTINCT user_id, product_id FROM orders;`
`COUNT(DISTINCT column)`	Count unique values in a column.	`SELECT COUNT(DISTINCT user_id) FROM logins;`
`COUNT(*)`	Count all rows.	`SELECT COUNT(*) FROM orders;`
`COUNT(column)`	Count non-NULL values in a column.	`SELECT COUNT(email) FROM users;`
`GROUP BY column`	Group rows by a column (for aggregations).	`SELECT user_id, COUNT(*) FROM orders GROUP BY user_id;`
`HAVING COUNT(*) > 1`	Filter groups with duplicates.	`SELECT user_id FROM orders GROUP BY user_id HAVING COUNT(*) > 1;`
⚠️ `DISTINCT` is not a function	`SELECT DISTINCT (column)` is the same as `SELECT DISTINCT column`.	Avoid parentheses.
⚠️ `COUNT(DISTINCT)` ignores NULLs	If you need to count NULLs, use `COUNT(*)` or `COUNT(COALESCE(column, 'default'))`.

9. ? Where to Go Next

PostgreSQL DISTINCT Documentation – Official docs on DISTINCT behavior.
BigQuery COUNT(DISTINCT) Optimization – Learn about APPROX_COUNT_DISTINCT for large datasets.
SQLZoo DISTINCT Tutorial – Interactive exercises.
"SQL Performance Explained" (Book) – Covers indexing and query optimization (including DISTINCT).

Final Thought

DISTINCT and COUNT(DISTINCT) are simple but powerful tools. Master them, and you’ll: ✔ Avoid embarrassing data mistakes (e.g., overcounting customers).
✔ Write faster queries (by indexing and optimizing).
✔ Impress stakeholders with accurate, deduplicated reports.

Now go apply this to your own data—find a table with duplicates and practice! ?

⚡ Recently practiced quizzes in this class

Data Analytics Practice Test Big Data & Analytics NASSCOM Certification Practice Test PySpark Practice Test Questions Basic Data Analytics and Visualization Practice Test (Tableau) Data Science Glossary Data Analysis with Python Data Science Exam #1 Data Analytics and Visualization Practice Test Pega Certified System Architect (PCSA) Study Guide Data Science Basics / Data Scientist Toolbox

➡️ Next Study Guide

TECH SQL for Data Analysis: DISTINCT & COUNT(DISTINCT) – Zero-Fluff Study Guide

SQL for Data Analysis: DISTINCT & COUNT(DISTINCT) – Zero-Fluff Study Guide

1. What This Is & Why It Matters

2. Core Concepts & Components

1. `DISTINCT`

2. `COUNT(DISTINCT column)`

3. `DISTINCT` vs. `GROUP BY`

4. `DISTINCT` on Multiple Columns

5. Performance Impact of `DISTINCT`

**6. `COUNT(*)` vs. `COUNT(column)` vs. `COUNT(DISTINCT column)`**

7. `DISTINCT` with `JOIN`

8. `DISTINCT` in Subqueries

3. Step-by-Step Hands-On Section

Prerequisites

Task: Analyze Unique Customers, Products, and Shipping Locations

Step 1: Create a Sample Table with Duplicates

Step 2: Find Unique Customers

Step 3: List All Unique Shipping Cities

Step 4: Find Duplicate Customer-Product Pairs

Step 5: Count Unique Products per Customer

Step 6: Optimize a Slow `DISTINCT` Query

4. ? Production-Ready Best Practices

Performance

Accuracy

Readability

Cost Optimization (Cloud Data Warehouses)

5. ⚠️ Common Mistakes & Traps

6. ? Exam/Certification Focus

Typical Question Patterns

⚠️ Trap Distinctions

Scenario-Based Question

7. ? Hands-On Challenge (with Solution)

Challenge

8. ? Rapid-Reference Crib Sheet

9. ? Where to Go Next

Final Thought

❤ If you liked Fatskills, consider supporting us by checking out The Life Manuals You Never Got.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | OSHA Basics Quiz | What Should We Know?
Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson
© 2026 Fatskills.com

All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement.

TECH **SQL for Data Analysis: DISTINCT & COUNT(DISTINCT) – Zero-Fluff Study Guide**

SQL for Data Analysis: DISTINCT & COUNT(DISTINCT) – Zero-Fluff Study Guide

1. What This Is & Why It Matters

2. Core Concepts & Components

1. DISTINCT

2. COUNT(DISTINCT column)

3. DISTINCT vs. GROUP BY

4. DISTINCT on Multiple Columns

5. Performance Impact of DISTINCT

6. COUNT(*) vs. COUNT(column) vs. COUNT(DISTINCT column)

7. DISTINCT with JOIN

8. DISTINCT in Subqueries

3. Step-by-Step Hands-On Section

Prerequisites

Task: Analyze Unique Customers, Products, and Shipping Locations

Step 1: Create a Sample Table with Duplicates

Step 2: Find Unique Customers

Step 3: List All Unique Shipping Cities

Step 4: Find Duplicate Customer-Product Pairs

Step 5: Count Unique Products per Customer

Step 6: Optimize a Slow DISTINCT Query

4. ? Production-Ready Best Practices

Performance

Accuracy

Readability

Cost Optimization (Cloud Data Warehouses)

5. ⚠️ Common Mistakes & Traps

6. ? Exam/Certification Focus

Typical Question Patterns

⚠️ Trap Distinctions

Scenario-Based Question

7. ? Hands-On Challenge (with Solution)

Challenge

8. ? Rapid-Reference Crib Sheet

9. ? Where to Go Next

Final Thought

❤ If you liked Fatskills, consider supporting us by checking out The Life Manuals You Never Got.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | OSHA Basics Quiz | What Should We Know? Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson© 2026 Fatskills.com

All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement.

TECH SQL for Data Analysis: DISTINCT & COUNT(DISTINCT) – Zero-Fluff Study Guide

1. `DISTINCT`

2. `COUNT(DISTINCT column)`

3. `DISTINCT` vs. `GROUP BY`

4. `DISTINCT` on Multiple Columns

5. Performance Impact of `DISTINCT`

**6. `COUNT(*)` vs. `COUNT(column)` vs. `COUNT(DISTINCT column)`**

7. `DISTINCT` with `JOIN`

8. `DISTINCT` in Subqueries

Step 6: Optimize a Slow `DISTINCT` Query

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | OSHA Basics Quiz | What Should We Know?
Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson
© 2026 Fatskills.com