By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
Window Functions, Complex Joins, Query Optimization
As an FDE, you’ll often land in environments where the customer’s data is messy, undocumented, or locked behind security constraints (e.g., air-gapped networks, classified systems). SQL is your Swiss Army knife—whether you’re debugging a failed ML pipeline during a disaster response, optimizing a query that’s grinding a customer’s production database to a halt, or stitching together disparate datasets to answer a high-stakes question (e.g., "Why did our supply chain model fail during the hurricane?").Example: You’re on-site at a defense contractor when their logistics dashboard crashes during a live exercise. The customer’s DBA is on vacation, and the only way to fix it is to rewrite a 10-minute query that’s now timing out. You need to diagnose the bottleneck, refactor with window functions, and push a hotfix—all while the customer watches over your shoulder.
Window Functions (OVER()) Perform calculations across a set of rows related to the current row (e.g., running totals, rankings) without collapsing the result set like GROUP BY. Critical for time-series analysis, cohort tracking, or anomaly detection in constrained environments (e.g., "Show me the 30-day rolling average of equipment failures by base").
OVER()
GROUP BY
Common Table Expressions (CTEs, WITH clauses) Temporary result sets you can reference within a query. Useful for breaking down complex logic into readable steps (e.g., "First filter to active users, then join to their purchase history"). Field tip: CTEs are often faster than subqueries in PostgreSQL and Redshift.
WITH
Self-Joins Joining a table to itself (e.g., "Find all employees who share the same manager"). Common in hierarchical data (org charts, supply chains) or when comparing rows (e.g., "Show me all pairs of sensors that reported anomalies within 5 minutes of each other").
LATERAL Joins A subquery in the FROM clause that can reference columns from preceding tables (e.g., "For each customer, fetch their top 3 orders by value"). Tools: PostgreSQL, Redshift, and BigQuery support this; MySQL does not.
FROM
Query Execution Plan (EXPLAIN) The database’s "recipe" for running your query. Field command: sql EXPLAIN ANALYZE SELECT * FROM large_table WHERE slow_condition; Look for Seq Scan (bad) vs. Index Scan (good), and high-cost operations (e.g., Hash Join on a 100M-row table).
EXPLAIN
sql EXPLAIN ANALYZE SELECT * FROM large_table WHERE slow_condition;
Seq Scan
Index Scan
Hash Join
Indexing Strategies
WHERE date > '2023-01-01'
CREATE INDEX idx_name ON table (col1, col2)
Partial indexes: Indexes on a subset of data (e.g., CREATE INDEX idx_active_users ON users (id) WHERE is_active = true). Saves space and speeds up queries.
CREATE INDEX idx_active_users ON users (id) WHERE is_active = true
Materialized Views Pre-computed query results stored as tables. Use case: When a complex query (e.g., "Monthly revenue by region") runs too slowly, materialize it and refresh nightly. Tools: PostgreSQL (CREATE MATERIALIZED VIEW), Redshift (CREATE TABLE AS).
CREATE MATERIALIZED VIEW
CREATE TABLE AS
Partitioning Splitting large tables into smaller, manageable chunks (e.g., by date). Field example: A 500M-row logs table partitioned by date lets you query only the last 7 days instead of scanning everything. Tools: PostgreSQL (PARTITION BY), BigQuery (automatic partitioning).
date
PARTITION BY
CTEs vs. Temp Tables
Temp tables: Persist for the session. Field use case: When you need to reuse a result set across multiple queries (e.g., "First clean the data, then run 3 different analyses").
NULL Handling NULL ≠ 0 or ''. Field trap: WHERE column = NULL is always false; use WHERE column IS NULL. For aggregations, COUNT(column) ignores NULL, but COUNT(*) doesn’t.
NULL
0
''
WHERE column = NULL
WHERE column IS NULL
COUNT(column)
COUNT(*)
Query Optimization Workflow
EXPLAIN ANALYZE
SELECT *
WHERE
JOIN
ORDER BY
Test: Validate in the customer’s environment—their data distribution may differ from your lab.
Air-Gapped SQL Tools
sql -- Load a CSV and query it directly SELECT * FROM 'data.csv';
pg_dump
pg_restore
Scenario: The customer’s dashboard is timing out. They say, "The query used to work, but now it’s slow." Actions:1. SSH into the bastion host (or RDP into their jump box) and connect to the database: bash psql -h customer-db.internal -U readonly_user -d analytics 2. Reproduce the issue: Run the query with EXPLAIN ANALYZE: sql EXPLAIN ANALYZE SELECT * FROM large_table WHERE complex_condition; - Look for Seq Scan (full table scan) or Hash Join on large tables. - Note the actual time and rows—are they way off from estimated? 3. Check for locks: If the query is hanging, run: sql SELECT * FROM pg_locks WHERE relation = 'large_table'::regclass; - If another process is locking the table, you’ll need to kill it or wait.
bash psql -h customer-db.internal -U readonly_user -d analytics
sql EXPLAIN ANALYZE SELECT * FROM large_table WHERE complex_condition;
actual time
rows
estimated
sql SELECT * FROM pg_locks WHERE relation = 'large_table'::regclass;
Goal: Reduce runtime from 10 minutes to <30 seconds.Actions:1. Break it into CTEs for readability and debugging: sql WITH filtered_data AS ( SELECT * FROM large_table WHERE date > '2023-01-01' ), joined_data AS ( SELECT f.*, d.dimension_value FROM filtered_data f JOIN dimensions d ON f.dimension_id = d.id ) SELECT * FROM joined_data ORDER BY date DESC LIMIT 1000; 2. Replace SELECT * with explicit columns: sql SELECT f.id, f.date, d.dimension_value -- Only what you need 3. Add a window function to avoid a self-join: ```sql -- Before: Slow self-join to get "previous value" SELECT a.id, a.value, b.value AS prev_value FROM data a LEFT JOIN data b ON a.id = b.id + 1;
sql WITH filtered_data AS ( SELECT * FROM large_table WHERE date > '2023-01-01' ), joined_data AS ( SELECT f.*, d.dimension_value FROM filtered_data f JOIN dimensions d ON f.dimension_id = d.id ) SELECT * FROM joined_data ORDER BY date DESC LIMIT 1000;
sql SELECT f.id, f.date, d.dimension_value -- Only what you need
-- After: Fast window function SELECT id, value, LAG(value) OVER (ORDER BY id) AS prev_value FROM data; 4. Optimize joins: - Ensure join columns are indexed. - Replace `LEFT JOIN` with `INNER JOIN` if you don’t need `NULL` rows. - Use `EXISTS` instead of `JOIN` for filtering:sql -- Before: Slow SELECT * FROM users u JOIN orders o ON u.id = o.user_id WHERE o.status = 'completed';
4. Optimize joins: - Ensure join columns are indexed. - Replace `LEFT JOIN` with `INNER JOIN` if you don’t need `NULL` rows. - Use `EXISTS` instead of `JOIN` for filtering:
-- After: Faster SELECT * FROM users u WHERE EXISTS ( SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.status = 'completed' ); ```
Actions:1. Run the new query with EXPLAIN ANALYZE and compare the actual time to the old query.2. Check the results match (e.g., COUNT(*) should be the same): ```sql -- Old query SELECT COUNT(*) FROM (old_query) t;
-- New query SELECT COUNT(*) FROM (new_query) t; 3. Test with the customer’s data volume:sql -- Simulate a large dataset EXPLAIN ANALYZE SELECT * FROM new_query LIMIT 100000; ```
3. Test with the customer’s data volume:
Actions:1. Backup the original query (e.g., save it to a file): bash pg_dump -t original_query_table -f backup.sql 2. Update the dashboard’s SQL (e.g., edit the query in Metabase, Tableau, or the customer’s custom app).3. Monitor for regressions: sql -- Check for long-running queries SELECT pid, query, now() - query_start AS duration FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '1 minute';
bash pg_dump -t original_query_table -f backup.sql
sql -- Check for long-running queries SELECT pid, query, now() - query_start AS duration FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '1 minute';
Actions:1. Write a 1-pager with: - The old query vs. new query. - Performance metrics (before/after). - Any assumptions (e.g., "This assumes the date column is indexed").2. Add a comment in the code (e.g., in Metabase or the dashboard’s SQL editor): sql -- OPTIMIZED 2023-10-01: Replaced self-join with LAG() window function. -- Performance: 10m → 15s. Assumes `date` is indexed. 3. Schedule a follow-up to check for new bottlenecks (e.g., "Let’s review this in 2 weeks to see if usage patterns have changed").
sql -- OPTIMIZED 2023-10-01: Replaced self-join with LAG() window function. -- Performance: 10m → 15s. Assumes `date` is indexed.
INSERT
UPDATE
OFFSET
WHERE id > last_seen_id LIMIT 100
Interviewer: "The customer’s CEO is in the room and demands a report on last quarter’s sales by region—now. The query is timing out. What do you do?" Answer:1. Ask for the business question (not the SQL). "What decision are they trying to make?" (e.g., "Are they looking for underperforming regions to cut costs?").2. Narrow the scope: "Can we start with just North America?" (Reduces data volume.) 3. Use a materialized view or temp table: "I’ll pre-aggregate the data for the last 3 quarters and cache it." 4. Deliver a partial answer fast: "Here’s the top 5 regions by revenue—let me refine the rest."
Why this works: Shows you prioritize business impact over technical perfection.
War Story: You deploy a query to production, and it runs 100x slower than in staging. The customer is furious.Root Cause: Staging had 10K rows; production has 100M. The query uses a CROSS JOIN that worked fine in dev but explodes in prod.Lesson:- Always test with production-scale data (or a sample that matches the distribution).- Add query timeouts in production (e.g., SET statement_timeout = '30s' in PostgreSQL).- Log slow queries (e.g., PostgreSQL’s log_min_duration_statement).
CROSS JOIN
SET statement_timeout = '30s'
log_min_duration_statement
Interviewer: "The customer’s DBA says they can’t add indexes because it’ll break their replication. How do you optimize the query?" Answer:1. Rewrite to avoid full scans: Use EXISTS instead of JOIN, or filter early with CTEs.2. Partition the table: If it’s not already partitioned, ask if you can add a date-based partition.3. Pre-aggregate data: Materialize a daily summary table and query that instead.4. Use a read replica: If the main DB is locked down, ask if you can query a replica with indexes.
EXISTS
Why this works: Shows you can work within constraints (a core FDE skill).
Answer: Check if the WHERE clause’s column is indexed. If not, add an index (or rewrite the query to use an existing index).Explanation: Seq Scan means the database is scanning every row—indexes are the fastest way to fix this.
Answer: Use a window function with COUNT over a date range:
COUNT
WITH user_purchases AS ( SELECT user_id, purchase_date, COUNT(*) OVER ( PARTITION BY user_id ORDER BY purchase_date RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW ) AS purchases_last_30_days, COUNT(*) OVER ( PARTITION BY user_id ORDER BY purchase_date RANGE BETWEEN INTERVAL '60 days' PRECEDING AND INTERVAL '30 days' PRECEDING ) AS purchases_prev_30_days FROM purchases ) SELECT DISTINCT user_id FROM user_purchases WHERE purchases_last_30_days > 0 AND purchases_prev_30_days > 0;
Explanation: Window functions avoid expensive self-joins and are easier to optimize.
Answer: DuckDB or SQLite. Both are lightweight, embeddable, and can query CSV files directly without a server.Explanation: DuckDB is faster for analytical queries, while SQLite is more widely supported.
ROW_NUMBER()
RANK()
DENSE_RANK()
LAG(col, 1)
LEAD(col, 1)
SUM(col) OVER (PARTITION BY group ORDER BY sort): Running total.
SUM(col) OVER (PARTITION BY group ORDER BY sort)
Indexing rules:
⚠️ Avoid indexing columns with low cardinality (e.g., gender).
gender
Query optimization commands:
EXPLAIN ANALYZE SELECT ...
EXPLAIN FORMAT=JSON SELECT ...
BigQuery: EXPLAIN ANALYZE SELECT ...
Partitioning syntax: ```sql -- PostgreSQL CREATE TABLE logs ( id SERIAL, date TIMESTAMP, message TEXT ) PARTITION BY RANGE (date);
-- Create a partition for each month CREATE TABLE logs_2023_10 PARTITION OF logs FOR VALUES FROM ('2023-10-01') TO ('2023-11-01'); ```
-- Refresh it REFRESH MATERIALIZED VIEW mv_daily_sales; ```
5432
3306
1433
⚠️ Always confirm with netstat -tulnp or ss -tulnp on the customer’s server.
netstat -tulnp
ss -tulnp
Air-gapped data transfer:
scp
rsync
--progress
split
Databases: pg_dump/pg_restore or mysqldump.
mysqldump
Field traps:
IS NOT DISTINCT FROM
⚠️ Timezones matter! Always use TIMESTAMP WITH TIME ZONE or AT TIME ZONE.
TIMESTAMP WITH TIME ZONE
AT TIME ZONE
Quick fixes for slow queries:
LIMIT 1000
Pre-aggregate data with a materialized view.
Acronyms to know:
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.