Fatskills
Practice. Master. Repeat.
Study Guide: Forward Deployed Engineer 101: Data Pipelines (ETL/ELT, Spark, Airflow – Moving Customer Data into the Platform)
Source: https://www.fatskills.com/forward-deployed-engineer-fde/chapter/forward-deployed-engineer-data-pipelines-etlelt-spark-airflow-moving-customer-data-into-the-platform

Forward Deployed Engineer 101: Data Pipelines (ETL/ELT, Spark, Airflow – Moving Customer Data into the Platform)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~11 min read

Data Pipelines (ETL/ELT, Spark, Airflow – Moving Customer Data into the Platform)



Field-Ready Study Guide: Data Pipelines (ETL/ELT, Spark, Airflow)

Moving Customer Data into the Platform

What This Is

As an FDE, you’ll spend 30-50% of your time building, debugging, or rescuing data pipelines—often under tight deadlines, in constrained environments (air-gapped, classified, or legacy systems), and with incomplete requirements. These pipelines are the lifeblood of customer operations: feeding ML models, dashboards, or real-time alerts. Example: During a disaster response mission, you’re handed a 500GB CSV dump from a partner agency, a 24-hour deadline, and a requirement to "make it work" in a classified network with no internet access. Your pipeline must clean, deduplicate, and load the data into a PostgreSQL instance running on a single overloaded server—while ensuring no PII leaks. This guide covers the field-tested approach to designing, deploying, and troubleshooting pipelines in high-stakes environments.


Key Terms & Concepts

  • ETL vs ELT:
  • ETL (Extract-Transform-Load): Transform data before loading (e.g., Spark on-prem, then write to a database). Used when compute is cheaper than storage (e.g., legacy systems).
  • ELT (Extract-Load-Transform): Load raw data first, then transform (e.g., Snowflake, BigQuery). Used in cloud environments where storage is cheap and compute is elastic.
  • FDE Tip: In air-gapped environments, ETL is often the only option (no cloud-scale storage).

  • Spark (PySpark/Scala):

  • Distributed data processing framework. FDEs use it for:


    • Cleaning messy customer data (e.g., fixing encoding issues, handling nulls).
    • Running ML models at scale (e.g., pyspark.ml for fraud detection).
    • Field Tool: spark-submit --master yarn --deploy-mode cluster (for on-prem Hadoop clusters).
  • Airflow:

  • Workflow orchestration tool. FDEs use it for:
    • Scheduling pipelines (e.g., "Run this Spark job every 6 hours").
    • Retrying failed tasks (e.g., retries=3 in DAGs).
    • Field Tool: airflow dags test <dag_id> <execution_date> (test a DAG without scheduling).
  • ⚠️ Airflow in air-gapped environments: Requires offline installation of Python packages (e.g., pip download + pip install --no-index).

  • Data Lineage:

  • Tracking where data comes from, how it’s transformed, and where it goes. Critical for:
    • Debugging (e.g., "Why is this dashboard wrong?" → trace back to the source).
    • Compliance (e.g., "Show me all transformations applied to this PII field").
  • FDE Tool: OpenLineage (open-source) or custom logging in Spark (df.write.mode("overwrite").saveAsTable("audit_log")).

  • Idempotency:

  • Running the same pipeline multiple times produces the same result. Non-negotiable for:
    • Disaster recovery (e.g., "Rerun the pipeline after a server crash").
    • Testing (e.g., "Did my fix break anything?").
  • FDE Pattern: Use MERGE (SQL) or df.write.mode("overwrite") (Spark) instead of INSERT.

  • Schema Evolution:

  • Handling changes to data structure over time (e.g., new columns, type changes). FDEs must:


    • Design pipelines to be backward-compatible (e.g., ignore new columns, default missing values).
    • Field Tool: spark.read.option("mergeSchema", "true") (for Parquet files).
  • Bastion Host:

  • A jump server used to access internal networks. FDEs use it to:


    • SSH into customer environments (ssh -J bastion-user@bastion-host internal-user@internal-host).
    • Debug pipelines running in restricted networks.
  • Data Contracts:

  • A formal agreement between data producers and consumers (e.g., "This API returns JSON with fields id, timestamp, value"). FDEs enforce this with:


    • Schema validation (e.g., pydantic in Python, Great Expectations).
    • Field Example: A customer’s "clean" CSV has hidden nulls in critical columns—your pipeline fails silently until you add df.dropna(subset=["required_field"]).
    • ⚠️ Always validate at the source (e.g., spark.read.csv(..., enforceSchema=True)).
  • Change Data Capture (CDC):

  • Tracking and propagating changes in source data (e.g., "Only process rows updated since the last run"). FDEs use it for:


    • Incremental loads (e.g., WHERE last_updated > '2023-01-01').
    • Field Tool: Debezium (for databases) or custom watermarks in Spark (df.filter(df.timestamp > last_run_time)).
  • Airflow XCom:

  • Cross-communication between tasks in a DAG (e.g., passing a filename from one task to another). FDEs use it for:


    • Dynamic workflows (e.g., "Run this Spark job only if the previous task succeeded").
    • Field Snippet: ```python # Push a value task_instance.xcom_push(key="output_path", value="/data/processed")

    # Pull a value output_path = task_instance.xcom_pull(key="output_path") ```

  • Spark Tuning (for FDEs):

  • spark.executor.memory: Set to ~70% of available RAM per executor (e.g., --executor-memory 8G).
  • spark.default.parallelism: Defaults to 200; set to 2-3x the number of cores (e.g., --conf spark.default.parallelism=600).
  • spark.sql.shuffle.partitions: Defaults to 200; increase for large datasets (e.g., --conf spark.sql.shuffle.partitions=1000).
  • ⚠️ Always check spark-ui (port 4040) for bottlenecks (e.g., skewed partitions, slow tasks).

  • Data Quality Gates:

  • Automated checks to prevent bad data from propagating (e.g., "Fail the pipeline if >5% of rows have nulls in user_id"). FDEs implement this with:
    • Great Expectations (for SQL/Spark).
    • Custom Spark checks (e.g., df.filter(df.user_id.isNull()).count() > 0 → raise error).
    • Field Example: A customer’s "production" pipeline silently drops 20% of rows due to a NULL in a key field—your gate catches it before it corrupts the dashboard.


Step-by-Step / Field Process


1. Discovery: "What Are We Actually Building?"

Actions:
- Ask vs Infer:
- Ask: "We need a pipeline to load this CSV into our database." - Infer: The CSV is 500GB, updated hourly, has 30% nulls in critical fields, and the database is a single-node PostgreSQL instance with 16GB RAM.
- Map the data flow:
- Source → Ingest → Transform → Load → Consume.
- Field Tool: Draw a whiteboard diagram (even if it’s messy). Example:
[Partner FTP] → [Spark (clean)] → [PostgreSQL] → [Dashboard] - Identify constraints:
- Network: Air-gapped? Firewall rules? - Compute: Can we use Spark? Or is it a single server? - Security: PII? Classified data? Encryption requirements? - Field Snippet: Check disk space on the target server:
bash
df -h /data # Is there enough space for the raw + processed data?

2. Design: "How Do We Build This Without Breaking Things?"

Actions:
- Choose ETL vs ELT:
- ETL: If the target system is weak (e.g., PostgreSQL on a single server).
- ELT: If the target is cloud-scale (e.g., Snowflake).
- Pick tools:
- Airflow: For scheduling/retries.
- Spark: For large-scale transforms (or pandas if the data fits in memory).
- PostgreSQL: For small-scale transforms (e.g., CREATE TABLE AS SELECT ...).
- Design for idempotency:
- Use MERGE (SQL) or df.write.mode("overwrite") (Spark).
- Field Snippet: Idempotent Spark write:
python
df.write.mode("overwrite").parquet("/data/processed")
- Plan for failure:
- Retries (Airflow: retries=3).
- Alerts (e.g., Slack webhook on failure).
- Field Tool: Airflow Slack alert:
python
from airflow.providers.slack.operators.slack_webhook import SlackWebhookOperator
slack_alert = SlackWebhookOperator(
task_id="slack_alert",
slack_webhook_conn_id="slack_conn",
message="Pipeline failed! Check logs.",
)

3. Build: "Write the Code (But Assume It’ll Break)"

Actions:
- Start small:
- Build a minimal pipeline (e.g., read 100 rows, write to a test table).
- Field Snippet: Test Spark locally:
bash
spark-submit --master local[4] test_pipeline.py
- Add data quality gates:
- Field Snippet: Check for nulls in Spark:
python
null_count = df.filter(df.user_id.isNull()).count()
if null_count > 0:
raise ValueError(f"Found {null_count} nulls in user_id!")
- Log everything:
- Field Snippet: Log Spark job progress:
python
spark.sparkContext.setLogLevel("INFO")
- Test in the customer environment:
- Field Command: Copy a sample of customer data to your dev environment:
bash
scp -r user@customer-server:/data/sample.csv ./test_data/

4. Deploy: "Get It Running in the Wild"

Actions:
- Airflow DAG deployment:
- Field Command: Copy DAG to Airflow server:
bash
scp my_dag.py user@airflow-server:/airflow/dags/
- Field Check: Verify the DAG is loaded:
bash
airflow dags list | grep my_dag
- Spark deployment:
- Field Command: Submit to YARN (on-prem Hadoop):
bash
spark-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 8G \
--conf spark.dynamicAllocation.enabled=true \
my_pipeline.py
- Monitor:
- Field Tool: Check Spark UI (port 4040) or Airflow UI (port 8080).
- Field Snippet: Tail Airflow logs:
bash
airflow tasks logs my_dag my_task 2023-01-01

5. Debug: "Why Is This Broken?"

Actions:
- Reproduce the issue:
- Field Command: Run the pipeline manually:
bash
python my_pipeline.py --input /data/raw --output /data/processed
- Check logs:
- Field Command: Tail Spark logs:
bash
yarn logs -applicationId <app_id> | grep -i "error"
- Validate data:
- Field Snippet: Check row counts:
python
print(f"Input rows: {df_input.count()}, Output rows: {df_output.count()}")
- Hotfix:
- Field Example: Customer’s data has a new column—update the schema:
python
df = spark.read.option("mergeSchema", "true").parquet("/data/raw")

6. Handoff: "Make Sure the Customer Can Run This Without You"

Actions:
- Document:
- Field Template: Include:
- Input/output paths.
- Dependencies (e.g., "Requires Spark 3.2").
- Failure modes (e.g., "If the pipeline fails, check /var/log/spark").
- Train:
- Field Tip: Record a 5-minute Loom video walking through the pipeline.
- Monitor:
- Field Tool: Set up a cron job to check pipeline health:
bash
0 * * * * /usr/bin/python3 /scripts/check_pipeline.py


Common Mistakes

Mistake Correction Why
Assuming the customer’s data is clean. Always validate at the source (e.g., df.printSchema(), df.describe()). 90% of pipeline failures are due to "dirty" data (nulls, wrong types, encoding issues).
Not testing in the customer environment. Deploy a minimal pipeline first (e.g., read 10 rows, write to a test table). What works in your lab will break behind their firewall (e.g., missing dependencies, network timeouts).
Ignoring idempotency. Use MERGE (SQL) or df.write.mode("overwrite") (Spark). If the pipeline fails halfway, you need to rerun it without duplicating data.
Over-engineering for "scale." Start with the simplest tool (e.g., pandas for <10GB, Spark for >100GB). The customer’s "big data" is often a 5GB CSV. Don’t waste time on Spark if pandas suffices.
Not logging enough. Log everything (e.g., row counts, timestamps, errors). When the pipeline fails at 3 AM, you’ll need logs to debug.


FDE Interview / War Story Insights


Interview Questions They’ll Ask

  1. "You’re building a pipeline for a classified network with no internet access. How do you handle dependencies?"
  2. Answer: Use offline dependency management:


    • pip download + pip install --no-index (Python).
    • Pre-download Spark/Hadoop binaries and transfer via USB.
    • Why: Air-gapped environments require manual dependency resolution.
  3. "The customer’s data has a new column, but your pipeline fails. How do you handle schema evolution?"

  4. Answer: Use schema merging (e.g., spark.read.option("mergeSchema", "true")) and backward-compatible transforms (e.g., ignore new columns, default missing values).
  5. Why: Schema changes are inevitable—design for them upfront.

  6. "The pipeline runs fine in staging but fails in production. What’s your first step?"

  7. Answer: Reproduce the issue in production (e.g., run a minimal test job). Check:
    • Data differences (e.g., staging has 100 rows, production has 10M).
    • Environment differences (e.g., staging has Spark 3.2, production has 2.4).
    • Why: "Works in staging" is a classic FDE trap—always test in production (with safeguards).

War Stories (How to Frame Your Experience)

  • "The customer’s ‘clean’ data had 30% nulls in a critical field. How did you handle it?"
  • Answer: Added a data quality gate (e.g., df.filter(df.user_id.isNull()).count() > 0 → raise error) and worked with the customer to backfill missing data.
  • FDE Insight: Always validate assumptions—customers often don’t know their data is dirty.

  • "You’re on site and the customer demands a last-minute feature that violates the original scope. How do you respond?"

  • Answer: Push back politely but firmly:


    • "This change will delay the go-live by 2 weeks. Can we prioritize the original scope and add this as a Phase 2?"
    • FDE Insight: Scope creep kills projects—protect the timeline.
  • "The pipeline failed at 2 AM, and the customer’s team is panicking. What do you do?"

  • Answer: Triage first, fix later:
    1. Check logs (yarn logs -applicationId <app_id>).
    2. Reproduce the issue (e.g., run a minimal test job).
    3. Deploy a hotfix (e.g., spark-submit --conf spark.sql.shuffle.partitions=1000).
    4. FDE Insight: Stay calm—customers panic when you panic.


Quick Check Questions

  1. You’re deploying a Spark pipeline to an air-gapped environment. The customer’s server has no internet access. How do you install PySpark?
  2. Answer: Pre-download PySpark and its dependencies (e.g., pip download pyspark), transfer via USB, then install offline (pip install --no-index --find-links=/path/to/wheelhouse pyspark).
  3. Explanation: Air-gapped environments require manual dependency management.

  4. Your Airflow DAG fails with "Task instance was not found." What’s the most likely cause?

  5. Answer: The DAG file wasn’t copied to the Airflow server’s dags/ folder, or the DAG ID doesn’t match the filename.
  6. Explanation: Airflow loads DAGs from the dags/ folder—if the file isn’t there, the DAG won’t appear.

  7. The customer’s pipeline runs slowly in production. The Spark UI shows 90% of tasks finish in 10 seconds, but 10% take 10 minutes. What’s the issue?

  8. Answer: Data skew—a few partitions have way more data than others. Fix with repartition() or salting.
  9. Explanation: Skewed data causes straggler tasks, which bottleneck the entire job.

Last-Minute Cram Sheet

  1. Spark:
  2. spark-submit --master yarn --deploy-mode cluster (for on-prem Hadoop).
  3. spark.sql.shuffle.partitions=1000 (for large datasets).
  4. df.write.mode("overwrite") (idempotent writes).
  5. ⚠️ Always check spark-ui (port 4040) for bottlenecks.

  6. Airflow:

  7. airflow dags test <dag_id> <execution_date> (test without scheduling).
  8. airflow tasks logs <dag_id> <task_id> <execution_date> (view logs).
  9. retries=3 (add to tasks for automatic retries).
  10. ⚠️ Airflow in air-gapped environments requires offline Python packages.

  11. Data Quality:

  12. df.filter(df.user_id.isNull()).count() > 0 → raise error (null check).
  13. spark.read.option("mergeSchema", "true") (handle schema changes).
  14. Great Expectations (for automated data validation).

  15. Field Commands:

  16. df -h /data (check disk space).
  17. ssh -J bastion-user@bastion-host internal-user@internal-host (jump server access).
  18. yarn logs -applicationId <app_id> | grep -i "error" (check Spark logs).

  19. Ports:

  20. Spark UI: 4040.
  21. Airflow UI: 8080.
  22. PostgreSQL: 5432.

  23. Acronyms:

  24. ACO: Authority to Operate (security approval).
  25. ATO: Approval to Operate (same as ACO).
  26. IAM: Identity and Access Management (AWS/GCP permissions).
  27. PII: Personally Identifiable Information (e.g., SSN, email).

  28. Field Traps:

  29. ⚠️ Always test in the exact customer environment—what works in your lab will break behind their firewall.
  30. ⚠️ Assume the data is dirty—validate at the source.
  31. ⚠️ Idempotency is non-negotiable—design for reruns.
  32. ⚠️ Log everything—you’ll need it at 3 AM.


ADVERTISEMENT