By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
As an FDE, you’ll often work with massive, mission-critical datasets—whether it’s processing satellite imagery for a defense intel feed, optimizing a real-time fraud detection pipeline for a bank, or sharding a disaster response database across edge devices with limited connectivity. Unlike lab environments, you’ll face hard constraints: air-gapped networks, strict security policies, unpredictable query patterns, and zero tolerance for downtime. This guide covers how to design, deploy, and debug large-scale data systems in the field, where "good enough" isn’t enough—it has to work now, under fire.
Field Example:You’re deployed to a classified site to optimize a 10TB PostgreSQL database powering a real-time threat detection system. The customer reports 5-minute query latencies during peak ops, but you can’t replicate the issue in your staging environment. The database is monolithic, running on a single VM with no indexing, and the customer refuses to allow schema changes during business hours. Your job: diagnose, shard, and tune the system without downtime, while documenting every change for an upcoming ATO (Authorization to Operate) audit.
user_id % 10
Tools: PostgreSQL (Citus), MongoDB (sharding), Vitess (MySQL), custom Python/Go sharding layers.
Partitioning: Dividing a table within a single database (e.g., by date, region) to improve query performance. Example: Partitioning a logs table by date to speed up time-range queries.
logs
date
Tools: PostgreSQL (declarative partitioning), MySQL (partitioning), BigQuery (partitioned tables).
Vertical vs. Horizontal Scaling:
Horizontal: More machines (e.g., adding read replicas, sharding). Field reality: Harder to implement but scales indefinitely.
Read Replicas: Copies of a database that handle read-only queries to offload the primary. Example: Using PostgreSQL streaming replication to scale analytics queries.
Tools: AWS RDS read replicas, PostgreSQL logical replication, Debezium (CDC).
Change Data Capture (CDC): Streaming database changes (inserts/updates/deletes) to other systems in real time. Example: Using Debezium to sync a PostgreSQL database to Kafka for downstream analytics.
Tools: Debezium, AWS DMS, PostgreSQL logical decoding.
Indexing Strategies:
user_id = 123
Field trap: Over-indexing slows down writes. Always test with real query patterns.
Query Optimization:
sql EXPLAIN ANALYZE SELECT * FROM users WHERE last_login > '2023-01-01';
Common Table Expressions (CTEs): Break complex queries into readable steps (but can be slow if misused).
Connection Pooling: Reusing database connections to avoid overhead. Example: PgBouncer for PostgreSQL, HikariCP for Java apps.
Field reality: Critical in high-latency environments (e.g., satellite links).
Batch vs. Streaming:
Field decision: Streaming is harder to debug but often required for mission-critical systems.
Data Skew: Uneven distribution of data across shards/partitions (e.g., 90% of queries hit one shard). Example: A users table where 80% of traffic is for US users.
users
Fix: Re-shard (e.g., user_id % 100 instead of % 10), or use consistent hashing.
user_id % 100
% 10
Cold vs. Hot Data:
Field trap: Customers often insist "all data is hot." Push back with metrics.
ACID vs. BASE:
Goal: Reproduce the issue and measure baseline performance.Actions:- SSH into the bastion host (or VPN into the customer’s network). bash ssh -J bastion-user@bastion-host customer-user@db-host - Check system metrics (CPU, RAM, disk I/O, network). bash top -c # Check CPU/RAM iostat -x 1 # Check disk I/O netstat -tulnp # Check network connections - Tail the database logs (PostgreSQL example): bash tail -f /var/log/postgresql/postgresql-14-main.log | grep "duration" - Reproduce the slow query (get the exact query from logs or customer): sql -- Run with EXPLAIN ANALYZE to see the execution plan EXPLAIN ANALYZE SELECT * FROM events WHERE user_id = 12345 AND timestamp > '2023-01-01'; - Check for locks (blocking queries): sql SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.query AS blocked_statement, blocking_activity.query AS blocking_statement FROM pg_catalog.pg_locks blocked_locks JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocking_locks.pid != blocked_locks.pid JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.GRANTED;
bash ssh -J bastion-user@bastion-host customer-user@db-host
bash top -c # Check CPU/RAM iostat -x 1 # Check disk I/O netstat -tulnp # Check network connections
bash tail -f /var/log/postgresql/postgresql-14-main.log | grep "duration"
sql -- Run with EXPLAIN ANALYZE to see the execution plan EXPLAIN ANALYZE SELECT * FROM events WHERE user_id = 12345 AND timestamp > '2023-01-01';
sql SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.query AS blocked_statement, blocking_activity.query AS blocking_statement FROM pg_catalog.pg_locks blocked_locks JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocking_locks.pid != blocked_locks.pid JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.GRANTED;
Goal: Align on a solution that meets technical, security, and operational constraints.Actions:- Present 2-3 options (e.g., "We can add an index, shard the table, or add read replicas").- Explain trade-offs (e.g., "Sharding will require downtime, but indexing won’t").- Get approval in writing (email or ticket) to avoid scope creep.- Document the change for the ATO (Authorization to Operate) process.
Goal: Deploy changes with minimal risk.Actions:- For indexing: sql -- Create an index (concurrently to avoid locks) CREATE INDEX CONCURRENTLY idx_events_user_timestamp ON events(user_id, timestamp); - For partitioning (PostgreSQL example): ```sql -- Step 1: Create a partitioned table CREATE TABLE events_partitioned ( id SERIAL, user_id INT, timestamp TIMESTAMPTZ, data JSONB ) PARTITION BY RANGE (timestamp);
sql -- Create an index (concurrently to avoid locks) CREATE INDEX CONCURRENTLY idx_events_user_timestamp ON events(user_id, timestamp);
-- Step 2: Create monthly partitions CREATE TABLE events_2023_01 PARTITION OF events_partitioned FOR VALUES FROM ('2023-01-01') TO ('2023-02-01'); CREATE TABLE events_2023_02 PARTITION OF events_partitioned FOR VALUES FROM ('2023-02-01') TO ('2023-03-01'); -- ... etc.
-- Step 3: Migrate data (in batches to avoid locks) INSERT INTO events_partitioned SELECT * FROM events WHERE timestamp >= '2023-01-01'; - For sharding (Python example with SQLAlchemy):python from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker
- For sharding (Python example with SQLAlchemy):
# Shard by user_id % 10 def get_shard(user_id): shard_id = user_id % 10 return create_engine(f"postgresql://user:pass@shard-{shard_id}:5432/db")
# Usage engine = get_shard(12345) Session = sessionmaker(bind=engine) session = Session() ``` - For read replicas: - Set up PostgreSQL streaming replication (or use AWS RDS read replicas). - Update application code to route read queries to replicas.
Goal: Confirm the change worked and didn’t break anything.Actions:- Re-run the slow query and compare execution time. sql EXPLAIN ANALYZE SELECT * FROM events WHERE user_id = 12345 AND timestamp > '2023-01-01'; - Check for regressions (e.g., other queries slowed down).- Monitor for 24-48 hours (some issues only appear under load).- Update dashboards (Grafana, Datadog) to track performance.
sql EXPLAIN ANALYZE SELECT * FROM events WHERE user_id = 12345 AND timestamp > '2023-01-01';
Goal: Ensure the change is reproducible, auditable, and maintainable.Actions:- Write a runbook (e.g., "How to add a new partition").- Update the ATO documentation (e.g., "Added index idx_events_user_timestamp to improve query performance").- Train the customer’s team (e.g., "Here’s how to check for locks").
idx_events_user_timestamp
CREATE INDEX CONCURRENTLY
SELECT pg_size_pretty(pg_total_relation_size('events'));
user_id
timestamp
user_id + timestamp
max_connections = 100
scoped_session
region
Answer: Ask for their approved alternatives (e.g., "Do you allow Docker Swarm, Nomad, or bare-metal deployments?").Why: Never assume you can use your preferred tools—always check the customer’s approved software list (ASL) first.
EXPLAIN ANALYZE
Answer: Add an index on the filtered column(s) (e.g., CREATE INDEX CONCURRENTLY idx_events_user_id ON events(user_id);).Why: Sequential scans are slow on large tables—indexes speed up lookups by orders of magnitude.
CREATE INDEX CONCURRENTLY idx_events_user_id ON events(user_id);
Answer: The current month’s partition is on slow storage (e.g., HDD instead of SSD), or the partition is too large.Why: Partitioning helps with old data, but hot data still needs fast storage and proper indexing.
Partitioning = within one machine (faster queries).
PostgreSQL Index Types:
B-tree
Hash
GIN
GiST: Geospatial, custom types.
GiST
Key Commands:
EXPLAIN ANALYZE [query]
pg_dump
pg_restore
iostat -x 1: Check disk I/O.
iostat -x 1
Ports to Know:
5432
3306
6379
Kafka: 9092
9092
Field Traps:
⚠️ Hot data ≠ all data—push back when customers say "everything is critical."
BASE = Eventual consistency (DynamoDB, Cassandra).
Connection Pooling Tools:
Python: SQLAlchemy connection pooling
CDC Tools:
PostgreSQL logical decoding
Streaming = Kafka, Flink, Kinesis
Quick Fixes for Slow Queries:
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.