By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
Moving Customer Data into the Platform
As an FDE, you’ll spend 30-50% of your time building, debugging, or rescuing data pipelines—often under tight deadlines, in constrained environments (air-gapped, classified, or legacy systems), and with incomplete requirements. These pipelines are the lifeblood of customer operations: feeding ML models, dashboards, or real-time alerts. Example: During a disaster response mission, you’re handed a 500GB CSV dump from a partner agency, a 24-hour deadline, and a requirement to "make it work" in a classified network with no internet access. Your pipeline must clean, deduplicate, and load the data into a PostgreSQL instance running on a single overloaded server—while ensuring no PII leaks. This guide covers the field-tested approach to designing, deploying, and troubleshooting pipelines in high-stakes environments.
FDE Tip: In air-gapped environments, ETL is often the only option (no cloud-scale storage).
Spark (PySpark/Scala):
Distributed data processing framework. FDEs use it for:
pyspark.ml
spark-submit --master yarn --deploy-mode cluster
Airflow:
retries=3
airflow dags test <dag_id> <execution_date>
⚠️ Airflow in air-gapped environments: Requires offline installation of Python packages (e.g., pip download + pip install --no-index).
pip download
pip install --no-index
Data Lineage:
FDE Tool: OpenLineage (open-source) or custom logging in Spark (df.write.mode("overwrite").saveAsTable("audit_log")).
OpenLineage
df.write.mode("overwrite").saveAsTable("audit_log")
Idempotency:
FDE Pattern: Use MERGE (SQL) or df.write.mode("overwrite") (Spark) instead of INSERT.
MERGE
df.write.mode("overwrite")
INSERT
Schema Evolution:
Handling changes to data structure over time (e.g., new columns, type changes). FDEs must:
spark.read.option("mergeSchema", "true")
Bastion Host:
A jump server used to access internal networks. FDEs use it to:
ssh -J bastion-user@bastion-host internal-user@internal-host
Data Contracts:
A formal agreement between data producers and consumers (e.g., "This API returns JSON with fields id, timestamp, value"). FDEs enforce this with:
id
timestamp
value
pydantic
Great Expectations
df.dropna(subset=["required_field"])
spark.read.csv(..., enforceSchema=True)
Change Data Capture (CDC):
Tracking and propagating changes in source data (e.g., "Only process rows updated since the last run"). FDEs use it for:
WHERE last_updated > '2023-01-01'
Debezium
df.filter(df.timestamp > last_run_time)
Airflow XCom:
Cross-communication between tasks in a DAG (e.g., passing a filename from one task to another). FDEs use it for:
# Pull a value output_path = task_instance.xcom_pull(key="output_path") ```
Spark Tuning (for FDEs):
spark.executor.memory
--executor-memory 8G
spark.default.parallelism
200
2-3x the number of cores
--conf spark.default.parallelism=600
spark.sql.shuffle.partitions
--conf spark.sql.shuffle.partitions=1000
⚠️ Always check spark-ui (port 4040) for bottlenecks (e.g., skewed partitions, slow tasks).
spark-ui
Data Quality Gates:
user_id
df.filter(df.user_id.isNull()).count() > 0
NULL
Actions:- Ask vs Infer: - Ask: "We need a pipeline to load this CSV into our database." - Infer: The CSV is 500GB, updated hourly, has 30% nulls in critical fields, and the database is a single-node PostgreSQL instance with 16GB RAM.- Map the data flow: - Source → Ingest → Transform → Load → Consume. - Field Tool: Draw a whiteboard diagram (even if it’s messy). Example: [Partner FTP] → [Spark (clean)] → [PostgreSQL] → [Dashboard] - Identify constraints: - Network: Air-gapped? Firewall rules? - Compute: Can we use Spark? Or is it a single server? - Security: PII? Classified data? Encryption requirements? - Field Snippet: Check disk space on the target server: bash df -h /data # Is there enough space for the raw + processed data?
[Partner FTP] → [Spark (clean)] → [PostgreSQL] → [Dashboard]
bash df -h /data # Is there enough space for the raw + processed data?
Actions:- Choose ETL vs ELT: - ETL: If the target system is weak (e.g., PostgreSQL on a single server). - ELT: If the target is cloud-scale (e.g., Snowflake).- Pick tools: - Airflow: For scheduling/retries. - Spark: For large-scale transforms (or pandas if the data fits in memory). - PostgreSQL: For small-scale transforms (e.g., CREATE TABLE AS SELECT ...).- Design for idempotency: - Use MERGE (SQL) or df.write.mode("overwrite") (Spark). - Field Snippet: Idempotent Spark write: python df.write.mode("overwrite").parquet("/data/processed") - Plan for failure: - Retries (Airflow: retries=3). - Alerts (e.g., Slack webhook on failure). - Field Tool: Airflow Slack alert: python from airflow.providers.slack.operators.slack_webhook import SlackWebhookOperator slack_alert = SlackWebhookOperator( task_id="slack_alert", slack_webhook_conn_id="slack_conn", message="Pipeline failed! Check logs.", )
pandas
CREATE TABLE AS SELECT ...
python df.write.mode("overwrite").parquet("/data/processed")
python from airflow.providers.slack.operators.slack_webhook import SlackWebhookOperator slack_alert = SlackWebhookOperator( task_id="slack_alert", slack_webhook_conn_id="slack_conn", message="Pipeline failed! Check logs.", )
Actions:- Start small: - Build a minimal pipeline (e.g., read 100 rows, write to a test table). - Field Snippet: Test Spark locally: bash spark-submit --master local[4] test_pipeline.py - Add data quality gates: - Field Snippet: Check for nulls in Spark: python null_count = df.filter(df.user_id.isNull()).count() if null_count > 0: raise ValueError(f"Found {null_count} nulls in user_id!") - Log everything: - Field Snippet: Log Spark job progress: python spark.sparkContext.setLogLevel("INFO") - Test in the customer environment: - Field Command: Copy a sample of customer data to your dev environment: bash scp -r user@customer-server:/data/sample.csv ./test_data/
bash spark-submit --master local[4] test_pipeline.py
python null_count = df.filter(df.user_id.isNull()).count() if null_count > 0: raise ValueError(f"Found {null_count} nulls in user_id!")
python spark.sparkContext.setLogLevel("INFO")
bash scp -r user@customer-server:/data/sample.csv ./test_data/
Actions:- Airflow DAG deployment: - Field Command: Copy DAG to Airflow server: bash scp my_dag.py user@airflow-server:/airflow/dags/ - Field Check: Verify the DAG is loaded: bash airflow dags list | grep my_dag - Spark deployment: - Field Command: Submit to YARN (on-prem Hadoop): bash spark-submit \ --master yarn \ --deploy-mode cluster \ --executor-memory 8G \ --conf spark.dynamicAllocation.enabled=true \ my_pipeline.py - Monitor: - Field Tool: Check Spark UI (port 4040) or Airflow UI (port 8080). - Field Snippet: Tail Airflow logs: bash airflow tasks logs my_dag my_task 2023-01-01
bash scp my_dag.py user@airflow-server:/airflow/dags/
bash airflow dags list | grep my_dag
bash spark-submit \ --master yarn \ --deploy-mode cluster \ --executor-memory 8G \ --conf spark.dynamicAllocation.enabled=true \ my_pipeline.py
bash airflow tasks logs my_dag my_task 2023-01-01
Actions:- Reproduce the issue: - Field Command: Run the pipeline manually: bash python my_pipeline.py --input /data/raw --output /data/processed - Check logs: - Field Command: Tail Spark logs: bash yarn logs -applicationId <app_id> | grep -i "error" - Validate data: - Field Snippet: Check row counts: python print(f"Input rows: {df_input.count()}, Output rows: {df_output.count()}") - Hotfix: - Field Example: Customer’s data has a new column—update the schema: python df = spark.read.option("mergeSchema", "true").parquet("/data/raw")
bash python my_pipeline.py --input /data/raw --output /data/processed
bash yarn logs -applicationId <app_id> | grep -i "error"
python print(f"Input rows: {df_input.count()}, Output rows: {df_output.count()}")
python df = spark.read.option("mergeSchema", "true").parquet("/data/raw")
Actions:- Document: - Field Template: Include: - Input/output paths. - Dependencies (e.g., "Requires Spark 3.2"). - Failure modes (e.g., "If the pipeline fails, check /var/log/spark").- Train: - Field Tip: Record a 5-minute Loom video walking through the pipeline.- Monitor: - Field Tool: Set up a cron job to check pipeline health: bash 0 * * * * /usr/bin/python3 /scripts/check_pipeline.py
/var/log/spark
bash 0 * * * * /usr/bin/python3 /scripts/check_pipeline.py
df.printSchema()
df.describe()
Answer: Use offline dependency management:
"The customer’s data has a new column, but your pipeline fails. How do you handle schema evolution?"
Why: Schema changes are inevitable—design for them upfront.
"The pipeline runs fine in staging but fails in production. What’s your first step?"
df.filter(df.user_id.isNull()).count() > 0 → raise error
FDE Insight: Always validate assumptions—customers often don’t know their data is dirty.
"You’re on site and the customer demands a last-minute feature that violates the original scope. How do you respond?"
Answer: Push back politely but firmly:
"The pipeline failed at 2 AM, and the customer’s team is panicking. What do you do?"
yarn logs -applicationId <app_id>
spark-submit --conf spark.sql.shuffle.partitions=1000
pip download pyspark
pip install --no-index --find-links=/path/to/wheelhouse pyspark
Explanation: Air-gapped environments require manual dependency management.
Your Airflow DAG fails with "Task instance was not found." What’s the most likely cause?
dags/
Explanation: Airflow loads DAGs from the dags/ folder—if the file isn’t there, the DAG won’t appear.
The customer’s pipeline runs slowly in production. The Spark UI shows 90% of tasks finish in 10 seconds, but 10% take 10 minutes. What’s the issue?
repartition()
salting
spark.sql.shuffle.partitions=1000
⚠️ Always check spark-ui (port 4040) for bottlenecks.
airflow tasks logs <dag_id> <task_id> <execution_date>
⚠️ Airflow in air-gapped environments requires offline Python packages.
Data Quality:
Great Expectations (for automated data validation).
Field Commands:
df -h /data
yarn logs -applicationId <app_id> | grep -i "error" (check Spark logs).
yarn logs -applicationId <app_id> | grep -i "error"
Ports:
4040
8080
PostgreSQL: 5432.
5432
Acronyms:
PII: Personally Identifiable Information (e.g., SSN, email).
Field Traps:
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.