Fatskills
Practice. Master. Repeat.
Study Guide: Forward Deployed Engineer 101: Working with Large‑Scale Data (Sharding, Partitioning, Performance Tuning)
Source: https://www.fatskills.com/forward-deployed-engineer-fde/chapter/forward-deployed-engineer-working-with-largescale-data-sharding-partitioning-performance-tuning

Forward Deployed Engineer 101: Working with Large‑Scale Data (Sharding, Partitioning, Performance Tuning)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~12 min read

Working with Large‑Scale Data (Sharding, Partitioning, Performance Tuning)


Forward Deployed Engineer (FDE) Study Guide: Working with Large-Scale Data (Sharding, Partitioning, Performance Tuning)


What This Is

As an FDE, you’ll often work with massive, mission-critical datasets—whether it’s processing satellite imagery for a defense intel feed, optimizing a real-time fraud detection pipeline for a bank, or sharding a disaster response database across edge devices with limited connectivity. Unlike lab environments, you’ll face hard constraints: air-gapped networks, strict security policies, unpredictable query patterns, and zero tolerance for downtime. This guide covers how to design, deploy, and debug large-scale data systems in the field, where "good enough" isn’t enough—it has to work now, under fire.

Field Example:
You’re deployed to a classified site to optimize a 10TB PostgreSQL database powering a real-time threat detection system. The customer reports 5-minute query latencies during peak ops, but you can’t replicate the issue in your staging environment. The database is monolithic, running on a single VM with no indexing, and the customer refuses to allow schema changes during business hours. Your job: diagnose, shard, and tune the system without downtime, while documenting every change for an upcoming ATO (Authorization to Operate) audit.


Key Terms & Concepts

  • Sharding: Splitting a database into horizontal partitions (shards) across multiple machines to distribute load. Example: Sharding a user table by user_id % 10 to spread 1B users across 10 servers.
  • Tools: PostgreSQL (Citus), MongoDB (sharding), Vitess (MySQL), custom Python/Go sharding layers.

  • Partitioning: Dividing a table within a single database (e.g., by date, region) to improve query performance. Example: Partitioning a logs table by date to speed up time-range queries.

  • Tools: PostgreSQL (declarative partitioning), MySQL (partitioning), BigQuery (partitioned tables).

  • Vertical vs. Horizontal Scaling:

  • Vertical: Bigger machine (e.g., upgrading from 16GB → 64GB RAM). Field trap: Often the first "easy" fix, but hits limits fast.
  • Horizontal: More machines (e.g., adding read replicas, sharding). Field reality: Harder to implement but scales indefinitely.

  • Read Replicas: Copies of a database that handle read-only queries to offload the primary. Example: Using PostgreSQL streaming replication to scale analytics queries.

  • Tools: AWS RDS read replicas, PostgreSQL logical replication, Debezium (CDC).

  • Change Data Capture (CDC): Streaming database changes (inserts/updates/deletes) to other systems in real time. Example: Using Debezium to sync a PostgreSQL database to Kafka for downstream analytics.

  • Tools: Debezium, AWS DMS, PostgreSQL logical decoding.

  • Indexing Strategies:

  • B-tree: Default for most databases (good for equality/range queries).
  • Hash: Fast for exact matches (e.g., user_id = 123), but useless for ranges.
  • GIN/GiST: For complex data (JSON, geospatial, full-text search).
  • Field trap: Over-indexing slows down writes. Always test with real query patterns.

  • Query Optimization:

  • EXPLAIN ANALYZE: PostgreSQL command to debug slow queries. Example:
    sql
    EXPLAIN ANALYZE SELECT * FROM users WHERE last_login > '2023-01-01';
  • Materialized Views: Pre-computed query results (trade storage for speed).
  • Common Table Expressions (CTEs): Break complex queries into readable steps (but can be slow if misused).

  • Connection Pooling: Reusing database connections to avoid overhead. Example: PgBouncer for PostgreSQL, HikariCP for Java apps.

  • Field reality: Critical in high-latency environments (e.g., satellite links).

  • Batch vs. Streaming:

  • Batch: Process data in chunks (e.g., nightly ETL jobs). Tools: Airflow, Spark, AWS Glue.
  • Streaming: Process data in real time (e.g., fraud detection). Tools: Kafka, Flink, Kinesis.
  • Field decision: Streaming is harder to debug but often required for mission-critical systems.

  • Data Skew: Uneven distribution of data across shards/partitions (e.g., 90% of queries hit one shard). Example: A users table where 80% of traffic is for US users.

  • Fix: Re-shard (e.g., user_id % 100 instead of % 10), or use consistent hashing.

  • Cold vs. Hot Data:

  • Hot: Frequently accessed (e.g., recent logs). Store in: Fast SSDs, in-memory caches (Redis).
  • Cold: Rarely accessed (e.g., 5-year-old audit logs). Store in: S3, Glacier, or tape.
  • Field trap: Customers often insist "all data is hot." Push back with metrics.

  • ACID vs. BASE:

  • ACID: Strong consistency (PostgreSQL, MySQL). Use when: Financial transactions, mission-critical ops.
  • BASE: Eventual consistency (DynamoDB, Cassandra). Use when: High availability, global scale.
  • Field decision: ACID is safer but slower; BASE is faster but requires careful design.


Step-by-Step / Field Process


1. Diagnose the Problem (Before You Touch Anything)

Goal: Reproduce the issue and measure baseline performance.
Actions:
- SSH into the bastion host (or VPN into the customer’s network).
bash ssh -J bastion-user@bastion-host customer-user@db-host - Check system metrics (CPU, RAM, disk I/O, network).
bash top -c # Check CPU/RAM iostat -x 1 # Check disk I/O netstat -tulnp # Check network connections - Tail the database logs (PostgreSQL example): bash tail -f /var/log/postgresql/postgresql-14-main.log | grep "duration" - Reproduce the slow query (get the exact query from logs or customer): sql -- Run with EXPLAIN ANALYZE to see the execution plan EXPLAIN ANALYZE SELECT * FROM events WHERE user_id = 12345 AND timestamp > '2023-01-01'; - Check for locks (blocking queries): sql SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement FROM pg_catalog.pg_locks blocked_locks JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.GRANTED;

2. Propose a Fix (With Customer Buy-In)

Goal: Align on a solution that meets technical, security, and operational constraints.
Actions:
- Present 2-3 options (e.g., "We can add an index, shard the table, or add read replicas").
- Explain trade-offs (e.g., "Sharding will require downtime, but indexing won’t").
- Get approval in writing (email or ticket) to avoid scope creep.
- Document the change for the ATO (Authorization to Operate) process.

3. Implement the Fix (Safely)

Goal: Deploy changes with minimal risk.
Actions:
- For indexing:
sql -- Create an index (concurrently to avoid locks) CREATE INDEX CONCURRENTLY idx_events_user_timestamp ON events(user_id, timestamp); - For partitioning (PostgreSQL example):
```sql -- Step 1: Create a partitioned table CREATE TABLE events_partitioned (
id SERIAL,
user_id INT,
timestamp TIMESTAMPTZ,
data JSONB ) PARTITION BY RANGE (timestamp);

-- Step 2: Create monthly partitions CREATE TABLE events_2023_01 PARTITION OF events_partitioned
FOR VALUES FROM ('2023-01-01') TO ('2023-02-01'); CREATE TABLE events_2023_02 PARTITION OF events_partitioned
FOR VALUES FROM ('2023-02-01') TO ('2023-03-01'); -- ... etc.

-- Step 3: Migrate data (in batches to avoid locks) INSERT INTO events_partitioned SELECT * FROM events WHERE timestamp >= '2023-01-01'; - For sharding (Python example with SQLAlchemy):python from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker

# Shard by user_id % 10 def get_shard(user_id):
shard_id = user_id % 10
return create_engine(f"postgresql://user:pass@shard-{shard_id}:5432/db")

# Usage engine = get_shard(12345) Session = sessionmaker(bind=engine) session = Session() ``` - For read replicas:
- Set up PostgreSQL streaming replication (or use AWS RDS read replicas).
- Update application code to route read queries to replicas.

4. Validate the Fix

Goal: Confirm the change worked and didn’t break anything.
Actions:
- Re-run the slow query and compare execution time.
sql EXPLAIN ANALYZE SELECT * FROM events WHERE user_id = 12345 AND timestamp > '2023-01-01'; - Check for regressions (e.g., other queries slowed down).
- Monitor for 24-48 hours (some issues only appear under load).
- Update dashboards (Grafana, Datadog) to track performance.

5. Document for the Customer & ATO

Goal: Ensure the change is reproducible, auditable, and maintainable.
Actions:
- Write a runbook (e.g., "How to add a new partition").
- Update the ATO documentation (e.g., "Added index idx_events_user_timestamp to improve query performance").
- Train the customer’s team (e.g., "Here’s how to check for locks").


Common Mistakes


Mistake 1: Adding Indexes Without Testing

  • What happens: You add an index to speed up a query, but it slows down writes or bloats the database.
  • Correction:
  • Always test with realistic data volumes (not just a small dev dataset).
  • Use CREATE INDEX CONCURRENTLY to avoid locking the table.
  • Monitor disk usage (SELECT pg_size_pretty(pg_total_relation_size('events'));).

Mistake 2: Sharding Without a Key

  • What happens: You shard by user_id, but 80% of queries are for timestamp ranges, causing hotspots.
  • Correction:
  • Analyze query patterns first (e.g., "Do most queries filter by user_id or timestamp?).
  • Use composite shard keys if needed (e.g., user_id + timestamp).
  • Consider consistent hashing if data is skewed.

Mistake 3: Ignoring Connection Overhead

  • What happens: Your app opens a new database connection for every request, causing high latency in high-traffic environments.
  • Correction:
  • Use connection pooling (PgBouncer, HikariCP).
  • Tune pool size (e.g., max_connections = 100 in PostgreSQL).
  • Reuse connections in your app (e.g., SQLAlchemy scoped_session).

Mistake 4: Partitioning Without a Maintenance Plan

  • What happens: You partition by date, but no one remembers to add new partitions, causing queries to fail.
  • Correction:
  • Automate partition creation (e.g., a cron job or Airflow DAG).
  • Document the process (e.g., "Run this script on the 1st of every month").
  • Set up alerts (e.g., "Partition for next month is missing").

Mistake 5: Assuming "It Works in Staging"

  • What happens: Your fix works in staging, but the customer’s firewall blocks port 5432, or their disk is full.
  • Correction:
  • Test in the exact customer environment (or a replica).
  • Check for hidden constraints (e.g., "No outbound internet access," "Must use FIPS-compliant encryption").
  • Have a rollback plan (e.g., "If this fails, revert to the old index").


FDE Interview / War Story Insights


1. "How Would You Shard a Database for a Global App?"

  • What they’re testing: Can you think about real-world constraints (latency, compliance, cost)?
  • How to answer:
  • Start with query patterns: "Do most queries filter by user_id or region?"
  • Consider compliance: "GDPR requires EU data to stay in the EU, so we’d shard by region."
  • Plan for failure: "We’d use a consistent hashing algorithm so adding/removing shards doesn’t require a full reshard."
  • Mention trade-offs: "Sharding by user_id is simple but can cause hotspots if some users are more active."

2. "The Customer Wants to Query 10 Years of Data in Real Time. How Do You Design This?"

  • What they’re testing: Can you push back on unrealistic requirements while offering alternatives?
  • How to answer:
  • Clarify the ask: "Do they need all 10 years, or just the last 30 days with occasional access to older data?"
  • Propose a tiered approach:
    • Hot data (last 30 days): PostgreSQL on SSDs.
    • Warm data (30 days - 2 years): S3 + Athena (cheaper, slower).
    • Cold data (2+ years): Glacier (very slow, very cheap).
  • Use materialized views for common aggregations.
  • Mention caching: "We’d cache frequent queries in Redis."

3. "You’re On-Site and the Database Is Down. The Customer Blames Your Last Change. What Do You Do?"

  • What they’re testing: Can you stay calm under pressure and debug systematically?
  • How to answer:
  • First, verify the issue: "Let me check the logs and metrics to confirm it’s related to my change."
  • Roll back if needed: "If it’s my change, I’ll revert it immediately and investigate offline."
  • Check for external factors: "Is the network down? Is the disk full? Did another team deploy something?"
  • Communicate clearly: "I’ll update you every 15 minutes with progress."
  • Post-mortem: "After we’re stable, I’ll write a root-cause analysis and propose a fix."

4. "How Do You Handle a Customer Who Insists on a Feature That Will Break the System?"

  • What they’re testing: Can you say no diplomatically while keeping the customer happy?
  • How to answer:
  • Acknowledge their concern: "I understand why this feature is important for your team."
  • Explain the risk: "Adding this query will slow down the entire system, which could impact mission-critical ops."
  • Offer alternatives: "Instead of querying all 10 years of data, we could pre-aggregate the results nightly."
  • Escalate if needed: "If this is a hard requirement, let’s discuss with the product team to prioritize it in the next sprint."
  • Document the decision: "I’ll send an email summarizing the risks and alternatives."


Quick Check Questions


1. You’re Deploying a Sharded Database to a Classified Network with No Internet Access. The Customer’s Security Team Rejects Your Plan to Use Kubernetes for Orchestration. What’s Your First Step?

Answer: Ask for their approved alternatives (e.g., "Do you allow Docker Swarm, Nomad, or bare-metal deployments?").
Why: Never assume you can use your preferred tools—always check the customer’s approved software list (ASL) first.

2. A Customer’s PostgreSQL Database Is Slow. EXPLAIN ANALYZE Shows a Sequential Scan on a 100M-Row Table. What’s the Fastest Fix?

Answer: Add an index on the filtered column(s) (e.g., CREATE INDEX CONCURRENTLY idx_events_user_id ON events(user_id);).
Why: Sequential scans are slow on large tables—indexes speed up lookups by orders of magnitude.

3. You’re Partitioning a Table by Date, but Queries for the Current Month Are Still Slow. What’s Likely Wrong?

Answer: The current month’s partition is on slow storage (e.g., HDD instead of SSD), or the partition is too large.
Why: Partitioning helps with old data, but hot data still needs fast storage and proper indexing.


Last-Minute Cram Sheet

  1. Sharding vs. Partitioning:
  2. Sharding = across machines (horizontal scaling).
  3. Partitioning = within one machine (faster queries).

  4. PostgreSQL Index Types:

  5. B-tree: Default (good for equality/range).
  6. Hash: Fast for exact matches (no ranges).
  7. GIN: JSON, full-text search.
  8. GiST: Geospatial, custom types.

  9. Key Commands:

  10. EXPLAIN ANALYZE [query]: Debug slow queries.
  11. CREATE INDEX CONCURRENTLY: Add index without locking.
  12. pg_dump / pg_restore: Backup/restore PostgreSQL.
  13. iostat -x 1: Check disk I/O.

  14. Ports to Know:

  15. PostgreSQL: 5432
  16. MySQL: 3306
  17. Redis: 6379
  18. Kafka: 9092

  19. Field Traps:

  20. ⚠️ Always test in the customer’s environment—what works in staging will break behind their firewall.
  21. ⚠️ Never assume you can use cloud services (e.g., AWS, GCP) in classified networks.
  22. ⚠️ Document every change for the ATO—unapproved changes can delay deployments for weeks.
  23. ⚠️ Hot data ≠ all data—push back when customers say "everything is critical."

  24. ACID vs. BASE:

  25. ACID = Strong consistency (PostgreSQL, MySQL).
  26. BASE = Eventual consistency (DynamoDB, Cassandra).

  27. Connection Pooling Tools:

  28. PostgreSQL: PgBouncer
  29. Java: HikariCP
  30. Python: SQLAlchemy connection pooling

  31. CDC Tools:

  32. Debezium (Kafka + PostgreSQL/MySQL)
  33. AWS DMS (Database Migration Service)
  34. PostgreSQL logical decoding

  35. Batch vs. Streaming:

  36. Batch = Airflow, Spark, AWS Glue
  37. Streaming = Kafka, Flink, Kinesis

  38. Quick Fixes for Slow Queries:


    • Add an index.
    • Partition the table.
    • Use a materialized view.
    • Add read replicas.


ADVERTISEMENT