Fatskills
Practice. Master. Repeat.
Study Guide: Forward Deployed Engineer 101: SQL for Data Analysis (Window Functions, Complex Joins, Query Optimization)
Source: https://www.fatskills.com/forward-deployed-engineer-fde/chapter/forward-deployed-engineer-sql-for-data-analysis-window-functions-complex-joins-query-optimization

Forward Deployed Engineer 101: SQL for Data Analysis (Window Functions, Complex Joins, Query Optimization)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~11 min read

SQL for Data Analysis (Window Functions, Complex Joins, Query Optimization)



SQL for Data Analysis: Field-Ready Study Guide for Forward Deployed Engineers (FDEs)

Window Functions, Complex Joins, Query Optimization

What This Is

As an FDE, you’ll often land in environments where the customer’s data is messy, undocumented, or locked behind security constraints (e.g., air-gapped networks, classified systems). SQL is your Swiss Army knife—whether you’re debugging a failed ML pipeline during a disaster response, optimizing a query that’s grinding a customer’s production database to a halt, or stitching together disparate datasets to answer a high-stakes question (e.g., "Why did our supply chain model fail during the hurricane?").
Example: You’re on-site at a defense contractor when their logistics dashboard crashes during a live exercise. The customer’s DBA is on vacation, and the only way to fix it is to rewrite a 10-minute query that’s now timing out. You need to diagnose the bottleneck, refactor with window functions, and push a hotfix—all while the customer watches over your shoulder.


Key Terms & Concepts

  • Window Functions (OVER())
    Perform calculations across a set of rows related to the current row (e.g., running totals, rankings) without collapsing the result set like GROUP BY. Critical for time-series analysis, cohort tracking, or anomaly detection in constrained environments (e.g., "Show me the 30-day rolling average of equipment failures by base").

  • Common Table Expressions (CTEs, WITH clauses)
    Temporary result sets you can reference within a query. Useful for breaking down complex logic into readable steps (e.g., "First filter to active users, then join to their purchase history"). Field tip: CTEs are often faster than subqueries in PostgreSQL and Redshift.

  • Self-Joins
    Joining a table to itself (e.g., "Find all employees who share the same manager"). Common in hierarchical data (org charts, supply chains) or when comparing rows (e.g., "Show me all pairs of sensors that reported anomalies within 5 minutes of each other").

  • LATERAL Joins
    A subquery in the FROM clause that can reference columns from preceding tables (e.g., "For each customer, fetch their top 3 orders by value"). Tools: PostgreSQL, Redshift, and BigQuery support this; MySQL does not.

  • Query Execution Plan (EXPLAIN)
    The database’s "recipe" for running your query. Field command:
    sql EXPLAIN ANALYZE SELECT * FROM large_table WHERE slow_condition; Look for Seq Scan (bad) vs. Index Scan (good), and high-cost operations (e.g., Hash Join on a 100M-row table).

  • Indexing Strategies

  • B-tree indexes: Default for equality/range queries (e.g., WHERE date > '2023-01-01').
  • Composite indexes: Multi-column indexes (e.g., CREATE INDEX idx_name ON table (col1, col2)). Field trap: Order matters! Put the most selective column first.
  • Partial indexes: Indexes on a subset of data (e.g., CREATE INDEX idx_active_users ON users (id) WHERE is_active = true). Saves space and speeds up queries.

  • Materialized Views
    Pre-computed query results stored as tables. Use case: When a complex query (e.g., "Monthly revenue by region") runs too slowly, materialize it and refresh nightly. Tools: PostgreSQL (CREATE MATERIALIZED VIEW), Redshift (CREATE TABLE AS).

  • Partitioning
    Splitting large tables into smaller, manageable chunks (e.g., by date). Field example: A 500M-row logs table partitioned by date lets you query only the last 7 days instead of scanning everything. Tools: PostgreSQL (PARTITION BY), BigQuery (automatic partitioning).

  • CTEs vs. Temp Tables

  • CTEs: Ephemeral, scoped to a single query. Good for readability.
  • Temp tables: Persist for the session. Field use case: When you need to reuse a result set across multiple queries (e.g., "First clean the data, then run 3 different analyses").

  • NULL Handling
    NULL0 or ''. Field trap: WHERE column = NULL is always false; use WHERE column IS NULL. For aggregations, COUNT(column) ignores NULL, but COUNT(*) doesn’t.

  • Query Optimization Workflow

  • Profile: Use EXPLAIN ANALYZE to find bottlenecks.
  • Simplify: Remove unnecessary joins, subqueries, or SELECT *.
  • Index: Add indexes for high-cardinality columns in WHERE, JOIN, or ORDER BY.
  • Cache: Materialize or pre-aggregate results if the query runs often.
  • Test: Validate in the customer’s environment—their data distribution may differ from your lab.

  • Air-Gapped SQL Tools

  • SQLite: Embeddable, zero-config, great for offline analysis (e.g., "Analyze sensor data on a disconnected laptop in the field").
  • DuckDB: In-process OLAP database, supports Parquet/CSV, and runs on a Raspberry Pi. Field command:
    sql
    -- Load a CSV and query it directly
    SELECT * FROM 'data.csv';
  • PostgreSQL: The "Swiss Army knife" of SQL databases. Field tip: Use pg_dump/pg_restore to move data between air-gapped networks.


Step-by-Step / Field Process


1. Diagnose the Problem (5-Minute Triage)

Scenario: The customer’s dashboard is timing out. They say, "The query used to work, but now it’s slow." Actions:
1. SSH into the bastion host (or RDP into their jump box) and connect to the database:
bash
psql -h customer-db.internal -U readonly_user -d analytics
2. Reproduce the issue: Run the query with EXPLAIN ANALYZE:
sql
EXPLAIN ANALYZE SELECT * FROM large_table WHERE complex_condition;

- Look for Seq Scan (full table scan) or Hash Join on large tables.
- Note the actual time and rows—are they way off from estimated? 3. Check for locks: If the query is hanging, run:
sql
SELECT * FROM pg_locks WHERE relation = 'large_table'::regclass;

- If another process is locking the table, you’ll need to kill it or wait.

2. Rewrite the Query (30-Minute Fix)

Goal: Reduce runtime from 10 minutes to <30 seconds.
Actions:
1. Break it into CTEs for readability and debugging:
sql
WITH filtered_data AS (
SELECT * FROM large_table WHERE date > '2023-01-01'
),
joined_data AS (
SELECT f.*, d.dimension_value
FROM filtered_data f
JOIN dimensions d ON f.dimension_id = d.id
)
SELECT * FROM joined_data ORDER BY date DESC LIMIT 1000;
2. Replace SELECT * with explicit columns:
sql
SELECT f.id, f.date, d.dimension_value -- Only what you need
3. Add a window function to avoid a self-join:
```sql
-- Before: Slow self-join to get "previous value"
SELECT a.id, a.value, b.value AS prev_value
FROM data a
LEFT JOIN data b ON a.id = b.id + 1;

-- After: Fast window function
SELECT
id,
value,
LAG(value) OVER (ORDER BY id) AS prev_value
FROM data;
4. Optimize joins:
- Ensure join columns are indexed.
- Replace `LEFT JOIN` with `INNER JOIN` if you don’t need `NULL` rows.
- Use `EXISTS` instead of `JOIN` for filtering:
sql
-- Before: Slow
SELECT * FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.status = 'completed';


 -- After: Faster
 SELECT * FROM users u
 WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.user_id = u.id AND o.status = 'completed' ); ```

3. Validate the Fix (10-Minute Test)

Actions:
1. Run the new query with EXPLAIN ANALYZE and compare the actual time to the old query.
2. Check the results match (e.g., COUNT(*) should be the same):
```sql
-- Old query
SELECT COUNT(*) FROM (old_query) t;

-- New query
SELECT COUNT(*) FROM (new_query) t;
3. Test with the customer’s data volume:sql
-- Simulate a large dataset
EXPLAIN ANALYZE SELECT * FROM new_query LIMIT 100000;
```

4. Deploy the Fix (5-Minute Push)

Actions:
1. Backup the original query (e.g., save it to a file):
bash
pg_dump -t original_query_table -f backup.sql
2. Update the dashboard’s SQL (e.g., edit the query in Metabase, Tableau, or the customer’s custom app).
3. Monitor for regressions:
sql
-- Check for long-running queries
SELECT pid, query, now() - query_start AS duration
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '1 minute';

5. Document for the Customer (15-Minute Handoff)

Actions:
1. Write a 1-pager with:
- The old query vs. new query.
- Performance metrics (before/after).
- Any assumptions (e.g., "This assumes the date column is indexed").
2. Add a comment in the code (e.g., in Metabase or the dashboard’s SQL editor):
sql
-- OPTIMIZED 2023-10-01: Replaced self-join with LAG() window function.
-- Performance: 10m → 15s. Assumes `date` is indexed.
3. Schedule a follow-up to check for new bottlenecks (e.g., "Let’s review this in 2 weeks to see if usage patterns have changed").


Common Mistakes

Mistake Correction Why
Overusing SELECT * Explicitly list columns. Reduces I/O, memory usage, and network transfer (critical in air-gapped or low-bandwidth environments).
Ignoring EXPLAIN Always run EXPLAIN ANALYZE before optimizing. The database’s optimizer is smarter than you—let it show you the bottlenecks.
Assuming indexes are always good Drop unused indexes (they slow down INSERT/UPDATE). Indexes add write overhead. Field trap: In high-write environments (e.g., IoT sensors), too many indexes can kill performance.
Not testing with real data Test in the customer’s environment, not your lab. Their data distribution (e.g., skew, NULL rates) may differ wildly from your synthetic data.
Using OFFSET for pagination Use keyset pagination (e.g., WHERE id > last_seen_id LIMIT 100). OFFSET scans all skipped rows, making it slow for large datasets. Keyset pagination is O(1).


FDE Interview / War Story Insights


1. The "We Need This Yesterday" Query

Interviewer: "The customer’s CEO is in the room and demands a report on last quarter’s sales by region—now. The query is timing out. What do you do?" Answer:
1. Ask for the business question (not the SQL). "What decision are they trying to make?" (e.g., "Are they looking for underperforming regions to cut costs?").
2. Narrow the scope: "Can we start with just North America?" (Reduces data volume.) 3. Use a materialized view or temp table: "I’ll pre-aggregate the data for the last 3 quarters and cache it." 4. Deliver a partial answer fast: "Here’s the top 5 regions by revenue—let me refine the rest."

Why this works: Shows you prioritize business impact over technical perfection.

2. The "This Query Worked in Dev" Problem

War Story: You deploy a query to production, and it runs 100x slower than in staging. The customer is furious.
Root Cause: Staging had 10K rows; production has 100M. The query uses a CROSS JOIN that worked fine in dev but explodes in prod.
Lesson:
- Always test with production-scale data (or a sample that matches the distribution).
- Add query timeouts in production (e.g., SET statement_timeout = '30s' in PostgreSQL).
- Log slow queries (e.g., PostgreSQL’s log_min_duration_statement).

3. The "We Don’t Have Indexes" Constraint

Interviewer: "The customer’s DBA says they can’t add indexes because it’ll break their replication. How do you optimize the query?" Answer:
1. Rewrite to avoid full scans: Use EXISTS instead of JOIN, or filter early with CTEs.
2. Partition the table: If it’s not already partitioned, ask if you can add a date-based partition.
3. Pre-aggregate data: Materialize a daily summary table and query that instead.
4. Use a read replica: If the main DB is locked down, ask if you can query a replica with indexes.

Why this works: Shows you can work within constraints (a core FDE skill).


Quick Check Questions


1. You’re debugging a query that’s timing out in production. The EXPLAIN output shows a Seq Scan on a 50M-row table. What’s your first step?

Answer: Check if the WHERE clause’s column is indexed. If not, add an index (or rewrite the query to use an existing index).
Explanation: Seq Scan means the database is scanning every row—indexes are the fastest way to fix this.

2. A customer asks you to find "all users who made a purchase in the last 30 days and also made a purchase in the 30 days before that." How do you write this query efficiently?

Answer: Use a window function with COUNT over a date range:


WITH user_purchases AS (
  SELECT
user_id,
purchase_date,
COUNT(*) OVER (
PARTITION BY user_id
ORDER BY purchase_date
RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW
) AS purchases_last_30_days,
COUNT(*) OVER (
PARTITION BY user_id
ORDER BY purchase_date
RANGE BETWEEN INTERVAL '60 days' PRECEDING AND INTERVAL '30 days' PRECEDING
) AS purchases_prev_30_days FROM purchases ) SELECT DISTINCT user_id FROM user_purchases WHERE purchases_last_30_days > 0 AND purchases_prev_30_days > 0;

Explanation: Window functions avoid expensive self-joins and are easier to optimize.

3. You’re on-site at a classified facility, and the customer’s database is air-gapped. You need to analyze a 10GB CSV file. What tool do you use?

Answer: DuckDB or SQLite. Both are lightweight, embeddable, and can query CSV files directly without a server.
Explanation: DuckDB is faster for analytical queries, while SQLite is more widely supported.


Last-Minute Cram Sheet

  1. Window functions cheat sheet:
  2. ROW_NUMBER(): Unique rank (1, 2, 3).
  3. RANK(): Ties get same rank (1, 1, 3).
  4. DENSE_RANK(): No gaps (1, 1, 2).
  5. LAG(col, 1): Previous row’s value.
  6. LEAD(col, 1): Next row’s value.
  7. SUM(col) OVER (PARTITION BY group ORDER BY sort): Running total.

  8. Indexing rules:

  9. Index columns in WHERE, JOIN, and ORDER BY.
  10. Composite indexes: Order by selectivity (most selective first).
  11. ⚠️ Avoid indexing columns with low cardinality (e.g., gender).

  12. Query optimization commands:

  13. PostgreSQL: EXPLAIN ANALYZE SELECT ...
  14. MySQL: EXPLAIN FORMAT=JSON SELECT ...
  15. BigQuery: EXPLAIN ANALYZE SELECT ...

  16. Partitioning syntax:
    ```sql
    -- PostgreSQL
    CREATE TABLE logs (
    id SERIAL,
    date TIMESTAMP,
    message TEXT
    ) PARTITION BY RANGE (date);

-- Create a partition for each month
CREATE TABLE logs_2023_10 PARTITION OF logs
FOR VALUES FROM ('2023-10-01') TO ('2023-11-01');
```


  1. Materialized view syntax:
    ```sql
    -- PostgreSQL
    CREATE MATERIALIZED VIEW mv_daily_sales AS
    SELECT date, SUM(amount) FROM sales GROUP BY date;

-- Refresh it
REFRESH MATERIALIZED VIEW mv_daily_sales;
```


  1. Common port numbers:
  2. PostgreSQL: 5432
  3. MySQL: 3306
  4. SQL Server: 1433
  5. ⚠️ Always confirm with netstat -tulnp or ss -tulnp on the customer’s server.

  6. Air-gapped data transfer:

  7. Small files: scp or USB drive.
  8. Large files: rsync with --progress or split into chunks.
  9. Databases: pg_dump/pg_restore or mysqldump.

  10. Field traps:

  11. ⚠️ NULLNULL in joins (use IS NOT DISTINCT FROM in PostgreSQL).
  12. ⚠️ COUNT(column) ignores NULL; COUNT(*) doesn’t.
  13. ⚠️ OFFSET is slow for large datasets—use keyset pagination.
  14. ⚠️ Timezones matter! Always use TIMESTAMP WITH TIME ZONE or AT TIME ZONE.

  15. Quick fixes for slow queries:

  16. Add LIMIT 1000 to test.
  17. Replace SELECT * with explicit columns.
  18. Use EXISTS instead of JOIN for filtering.
  19. Pre-aggregate data with a materialized view.

  20. Acronyms to know:


    • ACO: Authority to Operate (security approval for software).
    • ATO: Approval to Operate (similar to ACO, but often for systems).
    • IAM: Identity and Access Management (e.g., AWS IAM, Okta).
    • PII: Personally Identifiable Information (e.g., SSN, email).
    • PHI: Protected Health Information (HIPAA-regulated data).


ADVERTISEMENT