By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
You’re analyzing customer data for an e-commerce company. Your boss asks: - "How many unique customers made a purchase last month?" - "Are there duplicate product IDs in our inventory table?" - "What’s the count of distinct shipping cities in our orders?"
If you don’t know how to answer these fast and accurately, you’ll either: - Waste time manually deduplicating data in Excel (error-prone).- Misreport metrics (e.g., counting the same customer twice).- Slow down queries by fetching unnecessary rows.
DISTINCT and COUNT(DISTINCT) are your deduplication superpowers in SQL. They let you: ✅ Count unique values (not just rows).✅ Remove duplicates from query results.✅ Optimize performance by reducing data scanned.
Real-world scenario:You inherit a messy sales database where the same customer appears multiple times due to tracking errors. Your job is to: 1. Find the true number of unique customers.2. Identify duplicate product entries.3. Report distinct shipping regions for logistics planning.
If you don’t use DISTINCT or COUNT(DISTINCT), your reports will be wrong, and your stakeholders will lose trust in your data.
DISTINCT
COUNT(DISTINCT)
COUNT(DISTINCT column)
COUNT(column)
GROUP BY
SUM
AVG
SELECT DISTINCT customer_id, product_id FROM orders
COUNT(*)
column
JOIN
sql SELECT DISTINCT c.customer_id FROM customers c JOIN orders o ON c.customer_id = o.customer_id;
sql SELECT customer_id FROM orders WHERE product_id IN (SELECT DISTINCT product_id FROM best_sellers);
orders
-- Create a sample orders table with duplicates CREATE TABLE orders ( order_id INT, customer_id INT, product_id INT, order_date DATE, shipping_city VARCHAR(50) ); -- Insert sample data (with duplicates) INSERT INTO orders VALUES (1, 101, 501, '2023-10-01', 'New York'), (2, 102, 502, '2023-10-01', 'Chicago'), (3, 101, 501, '2023-10-02', 'New York'), -- Duplicate customer & product (4, 103, 503, '2023-10-03', 'Los Angeles'), (5, 102, 504, '2023-10-04', 'Chicago'), -- Duplicate customer (6, 104, 505, '2023-10-05', 'Seattle');
-- Count unique customers SELECT COUNT(DISTINCT customer_id) AS unique_customers FROM orders;
Expected Output:
unique_customers ---------------- 4
Why? Even though customer_id=101 appears twice, COUNT(DISTINCT) counts it only once.
customer_id=101
-- Get distinct shipping cities SELECT DISTINCT shipping_city FROM orders;
shipping_city ------------- New York Chicago Los Angeles Seattle
-- Find customers who bought the same product more than once SELECT customer_id, product_id, COUNT(*) AS order_count FROM orders GROUP BY customer_id, product_id HAVING COUNT(*) > 1;
customer_id | product_id | order_count ------------+------------+------------ 101 | 501 | 2
Why? This identifies repeat purchases of the same product by the same customer.
-- Count distinct products bought by each customer SELECT customer_id, COUNT(DISTINCT product_id) AS unique_products_bought FROM orders GROUP BY customer_id;
customer_id | unique_products_bought ------------+----------------------- 101 | 1 102 | 2 103 | 1 104 | 1
Why? This helps identify customer purchase diversity (e.g., do they buy multiple products or just one?).
Problem: Your DISTINCT query is slow on a large table.Solution: Add an index on the column you’re deduplicating.
-- Create an index on customer_id (if it doesn't exist) CREATE INDEX idx_customer_id ON orders(customer_id); -- Now run the DISTINCT query (should be faster) SELECT DISTINCT customer_id FROM orders;
Verification:- Check query execution time before/after indexing.- In PostgreSQL, use EXPLAIN ANALYZE SELECT DISTINCT customer_id FROM orders; to see the improvement.
EXPLAIN ANALYZE SELECT DISTINCT customer_id FROM orders;
✅ Index columns used in DISTINCT → Speeds up deduplication.✅ Use GROUP BY instead of DISTINCT for aggregations → More efficient for large datasets.✅ Avoid DISTINCT * → Scans the entire table; specify columns instead.✅ Pre-aggregate data in a data warehouse → If you frequently need unique counts, materialize them in a summary table.
DISTINCT *
✅ Always use COUNT(DISTINCT) when counting unique values → COUNT(column) will overcount duplicates.✅ Check for NULLs → COUNT(DISTINCT) ignores NULLs; if you need to count them, use COUNT(*) or COUNT(COALESCE(column, 'default')).✅ Validate results with GROUP BY → If COUNT(DISTINCT) seems off, cross-check with: sql SELECT column, COUNT(*) FROM table GROUP BY column;
COUNT(COALESCE(column, 'default'))
sql SELECT column, COUNT(*) FROM table GROUP BY column;
✅ Use column aliases for clarity → SELECT COUNT(DISTINCT customer_id) AS unique_customers.✅ Comment complex DISTINCT logic → Explain why deduplication is needed.✅ Avoid DISTINCT as a "just in case" fix → If you’re using it to hide duplicates, fix the data source instead.
SELECT COUNT(DISTINCT customer_id) AS unique_customers
✅ In BigQuery/Snowflake, COUNT(DISTINCT) is expensive → Use approximate functions (APPROX_COUNT_DISTINCT) for large datasets.✅ Partition tables by date → If querying recent data, limit scans with WHERE date > '2023-10-01'.
APPROX_COUNT_DISTINCT
WHERE date > '2023-10-01'
SELECT DISTINCT customer_id, SUM(amount) FROM orders
Answer: COUNT(*) counts all rows; COUNT(DISTINCT column) counts unique non-NULL values.
"How would you find duplicate customer IDs in a table?"
Answer: sql SELECT customer_id, COUNT(*) FROM customers GROUP BY customer_id HAVING COUNT(*) > 1;
sql SELECT customer_id, COUNT(*) FROM customers GROUP BY customer_id HAVING COUNT(*) > 1;
"Which is faster: DISTINCT or GROUP BY for deduplication?"
Answer: DISTINCT is usually faster for simple deduplication, but GROUP BY is better for aggregations.
"What happens if you use DISTINCT on a column with NULLs?"
SELECT DISTINCT NULL, NULL
SELECT DISTINCT customer_id, SUM(amount)
DISTINCT column1
column2
DISTINCT column1, column2
"You need to count the number of unique visitors to a website per day. The visits table has visitor_id and visit_date. Which query is correct?"
visits
visitor_id
visit_date
❌ Incorrect:
SELECT visit_date, COUNT(visitor_id) AS unique_visitors FROM visits GROUP BY visit_date;
(Counts all visits, not unique visitors.)
✅ Correct:
SELECT visit_date, COUNT(DISTINCT visitor_id) AS unique_visitors FROM visits GROUP BY visit_date;
(Counts unique visitors per day.)
You have a logins table with user_id and login_time. Write a query to: 1. Count the total number of logins.2. Count the number of unique users who logged in.
logins
user_id
login_time
Solution:
-- Total logins SELECT COUNT(*) AS total_logins FROM logins; -- Unique users SELECT COUNT(DISTINCT user_id) AS unique_users FROM logins;
Why it works:- COUNT(*) counts all rows (logins).- COUNT(DISTINCT user_id) counts unique users (ignoring duplicates).
COUNT(DISTINCT user_id)
SELECT DISTINCT column
SELECT DISTINCT city FROM users;
SELECT DISTINCT col1, col2
SELECT DISTINCT user_id, product_id FROM orders;
SELECT COUNT(DISTINCT user_id) FROM logins;
SELECT COUNT(*) FROM orders;
SELECT COUNT(email) FROM users;
GROUP BY column
SELECT user_id, COUNT(*) FROM orders GROUP BY user_id;
HAVING COUNT(*) > 1
SELECT user_id FROM orders GROUP BY user_id HAVING COUNT(*) > 1;
SELECT DISTINCT (column)
DISTINCT and COUNT(DISTINCT) are simple but powerful tools. Master them, and you’ll: ✔ Avoid embarrassing data mistakes (e.g., overcounting customers).✔ Write faster queries (by indexing and optimizing).✔ Impress stakeholders with accurate, deduplicated reports.
Now go apply this to your own data—find a table with duplicates and practice! ?
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.