Fatskills
Practice. Master. Repeat.
Study Guide: Forward Deployed Engineer 101: Data Transformation and Cleaning (Python Pandas, PySpark)
Source: https://www.fatskills.com/forward-deployed-engineer-fde/chapter/forward-deployed-engineer-data-transformation-and-cleaning-python-pandas-pyspark

Forward Deployed Engineer 101: Data Transformation and Cleaning (Python Pandas, PySpark)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~12 min read

Data Transformation and Cleaning (Python Pandas, PySpark)

Field-Ready Study Guide: Data Transformation & Cleaning (Python Pandas, PySpark)

For Forward Deployed Engineers (FDEs) who need to ship production-grade data pipelines in chaotic, constrained environments.

What This Is

Data transformation and cleaning is the unseen 80% of an FDE’s job—turning messy, real-world data into something your models, dashboards, or APIs can actually use. Unlike lab environments, field deployments have no room for "it works on my machine"—you’ll deal with: - Air-gapped networks where you can’t pip install or pull Docker images.
- Classified or sensitive data that can’t leave the customer’s enclave (no cloud uploads, no debugging in public repos).
- Last-minute schema changes (e.g., a disaster response team suddenly needs to ingest drone footage metadata in a new format).
- Performance constraints (e.g., a 10TB dataset must be processed on a single VM with 16GB RAM).

Field Example:
You’re deployed to a military logistics hub during a hurricane relief mission. The customer’s supply chain system spits out CSV files with inconsistent date formats, missing GPS coordinates, and duplicate entries—but your ML model for predicting delivery delays fails silently if the data isn’t cleaned. You have 4 hours to: 1. Write a PySpark job to deduplicate and normalize the data (no internet → must use pre-approved dependencies).
2. Deploy it to a Kubernetes cluster behind a classified firewall (no Helm charts → manual kubectl apply).
3. Validate the output on-site with the customer’s SMEs (Subject Matter Experts) before the next supply convoy leaves.

Key Terms & Concepts

Pandas vs. PySpark:
Pandas = Single-machine, in-memory (good for <10GB data, quick EDA, or air-gapped environments where Spark isn’t installed).
PySpark = Distributed, fault-tolerant (for >100GB data, on-prem clusters, or when you need to scale horizontally).
FDE Rule: Always ask: "Can this run on the customer’s hardware?" (e.g., Pandas on a Raspberry Pi for edge deployments).
Schema Enforcement:
Pandas: df = pd.read_csv(..., dtype={"column": "int32"}) (fail fast if data doesn’t match).
PySpark: .schema(schema) or .option("enforceSchema", "true") (critical for production pipelines—prevents silent type coercion).
Dirty Data Patterns:
Deduplication: df.drop_duplicates() (Pandas) or .dropDuplicates() (PySpark).
Missing Data: .fillna() (impute) vs. .dropna() (remove). FDE Trap: Never assume missing data is random—ask the customer: "Is this missing because the sensor failed, or because the event didn’t happen?"
Outliers: Use IQR (Interquartile Range) or Z-score filtering. Field Tip: In defense/intel, outliers might be signals (e.g., a drone’s erratic flight path = potential threat).
Performance Hacks:
Pandas: Use .astype("category") for low-cardinality strings (e.g., country codes).
PySpark: Partition data by a key (e.g., df.repartition("date")) to avoid skew. FDE Rule: Always check .explain() before running a job—shuffles are expensive.
Air-Gapped Dependencies:
Pandas: Pre-download .whl files and install via pip install --no-index --find-links=./wheels pandas.
PySpark: Use a pre-approved Spark distribution (e.g., customer-provided .tar.gz with Hadoop binaries).
Validation & Testing:
Great Expectations (Python) or Deequ (PySpark) for data quality checks. FDE Use Case: Run validation before and after transformation to catch silent failures (e.g., a column that should never be null suddenly has 10% missing values).
Golden Dataset: A small, manually verified subset of data to test transformations. Field Tip: Always ask the customer: "Can you give me 10 rows of ‘good’ data?"
Deployment Constraints:
No Internet: Use offline package managers (e.g., conda-pack for Python, spark-submit --packages with local JARs).
No Root Access: Install Python packages in ~/.local or use --user flag.
No Docker: Use singularity (common in HPC/defense) or static binaries (e.g., PyInstaller for Python scripts).
Ask vs. Infer (Data Edition):
Ask: "The customer says they need timestamps in UTC."
Infer: The data shows timestamps in local time with no timezone info—you’ll need to reverse-engineer the timezone from GPS coordinates or metadata.

Step-by-Step / Field Process

1. Discovery: Understand the Data & Constraints

Action: SSH into the customer’s bastion host (or VPN into their network) and inspect the raw data.
bash # Check file size and format ls -lh /data/raw/ head -n 5 /data/raw/supply_logs.csv # First 5 rows file /data/raw/supply_logs.csv # Check encoding (e.g., UTF-8 vs. ISO-8859-1)
Ask the Customer:
"What’s the source of truth for this data?" (e.g., a legacy Oracle DB, a sensor network, manual spreadsheets).
"What’s the SLA for this pipeline?" (e.g., "Must process 1TB in <1 hour" vs. "Can run overnight").
"Are there security restrictions?" (e.g., PII redaction, no cloud uploads, no external APIs).
Infer:
Run a quick Pandas script to profile the data:
python import pandas as pd df = pd.read_csv("/data/raw/supply_logs.csv", nrows=1000) print(df.dtypes) # Check column types print(df.isna().sum()) # Count missing values print(df.describe()) # Numeric stats print(df["category"].value_counts()) # Categorical distribution

2. Design the Transformation Pipeline

Choose the Tool:
<10GB data, single machine? → Pandas (faster iteration, easier debugging).
>100GB data, distributed cluster? → PySpark (scalable, fault-tolerant).
Write the Transformation Logic:
Pandas Example (Dedupe + Date Parsing):
```python
import pandas as pd
from datetime import datetime
Read with schema enforcement
df = pd.read_csv(
"/data/raw/supply_logs.csv",
dtype={"item_id": "str", "quantity": "int32"},
parse_dates=["timestamp"],
date_parser=lambda x: datetime.strptime(x, "%m/%d/%Y %H:%M") # Handle inconsistent formats )
Deduplicate (keep first occurrence)
df = df.drop_duplicates(subset=["item_id", "timestamp"], keep="first")
Fill missing GPS with a default (e.g., depot location)
df["latitude"] = df["latitude"].fillna(38.8977) # Default to DC coordinates df["longitude"] = df["longitude"].fillna(-77.0365)
Write output
df.to_parquet("/data/clean/supply_logs.parquet", index=False) - PySpark Example (Same Logic, Distributed):python from pyspark.sql import SparkSession from pyspark.sql.functions import col, to_timestamp from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
spark = SparkSession.builder.appName("SupplyClean").getOrCreate()
Define schema (critical for production!)
schema = StructType([
StructField("item_id", StringType()),
StructField("quantity", IntegerType()),
StructField("timestamp", StringType()), # Will parse later
StructField("latitude", DoubleType()),
StructField("longitude", DoubleType()) ])
Read with schema enforcement
df = spark.read.csv(
"/data/raw/supply_logs.csv",
schema=schema,
header=True,
mode="DROPMALFORMED" # Fail on bad rows )
Parse timestamp (handle multiple formats)
df = df.withColumn(
"timestamp",
to_timestamp(col("timestamp"), "MM/dd/yyyy HH:mm") )
Deduplicate
df = df.dropDuplicates(["item_id", "timestamp"])
Fill missing GPS
df = df.fillna({"latitude": 38.8977, "longitude": -77.0365})
Write as Parquet (columnar, efficient for analytics)
df.write.parquet("/data/clean/supply_logs.parquet", mode="overwrite") - Validation: - Use Great Expectations to define checks:python import great_expectations as ge context = ge.get_context() validator = context.sources.pandas_default.read_parquet("/data/clean/supply_logs.parquet")
Define expectations
validator.expect_column_values_to_not_be_null("item_id") validator.expect_column_values_to_be_between("quantity", min_value=0, max_value=1000) validator.expect_column_values_to_match_regex("item_id", r"^[A-Z]{2}-\d{4}$")
Run validation
validation_result = validator.validate() assert validation_result["success"], "Data validation failed!" ```

3. Deploy to the Customer’s Environment

Air-Gapped Deployment:
Step 1: Package dependencies offline.
```bash
# For Pandas
pip download pandas numpy pyarrow -d ./wheels
For PySpark (pre-download JARs)
wget https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/3.3.0/spark-core_2.12-3.3.0.jar -P ./jars - Step 2: Transfer files to the customer’s network (e.g., via sneakernet—USB drive, DVD, or secure file transfer). - Step 3: Install dependencies locally.bash
Pandas
pip install --no-index --find-links=./wheels pandas
PySpark (if not pre-installed)
export SPARK_HOME=/opt/spark export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH - Run the Job: - Pandas: `python clean_data.py` - PySpark: `spark-submit --master yarn --deploy-mode cluster clean_data.py` - Monitor & Debug: - Check logs:bash
For Spark on YARN
yarn logs -applicationId
For local runs
tail -f /var/log/spark/spark.log - Reproduce Errors: If the job fails, pull a sample of the raw data and test locally:bash head -n 1000 /data/raw/supply_logs.csv > /tmp/sample.csv scp /tmp/sample.csv your-laptop:~/ # If allowed ```

4. Validate with the Customer

Action: Sit with the SME (Subject Matter Expert) and review the output.
"Does this item_id format match what you expect?" (e.g., "AB-1234" vs. "1234-AB").
"Are these missing GPS coordinates acceptable, or should we flag them for review?"
"Does the deduplication logic preserve the correct records?" (e.g., "Keep the first scan" vs. "Keep the most recent").
Fallback Plan: If the customer rejects the output:
Quick Fix: Write a patch script (e.g., fix_gps.py) to address the issue.
Long-Term Fix: Update the pipeline and add a test case to prevent regression.

5. Document & Hand Off

Write a Runbook:
How to run the pipeline (commands, dependencies, expected runtime).
How to debug common failures (e.g., "If the job fails with OutOfMemoryError, reduce spark.executor.memory to 4G").
Contact info for the FDE who built it (you!).
Example Runbook Snippet:
markdown # Supply Chain Data Cleaning Pipeline ## How to Runbash spark-submit --master yarn --executor-memory 8G --conf spark.dynamicAllocation.enabled=false clean_data.py ```

## Common Failures | Error | Cause | Fix | |-------|-------|-----| | java.lang.OutOfMemoryError | Not enough executor memory | Reduce --executor-memory to 4G | | AnalysisException: Cannot resolve column | Schema mismatch | Check /data/raw/supply_logs.csv for new columns | ```

Common Mistakes

Mistake	Correction	Why
Assuming the data is clean	Always profile first (`df.describe()`, `df.isna().sum()`).	Real-world data is never clean. Skipping this step leads to silent failures (e.g., a model trained on null values).
Using `inferSchema` in PySpark	Explicitly define the schema.	`inferSchema` is slow and can silently coerce data types (e.g., treating "123-456" as a number).
Not testing in the customer’s environment	Reproduce the exact environment (OS, Python version, dependencies).	A script that works on your Mac will break on the customer’s RHEL 7 server with Python 3.6.
Hardcoding paths	Use environment variables or config files.	`/home/yourname/data/` won’t exist on the customer’s machine. Use `os.getenv("DATA_DIR", "/default/path")`.
Ignoring performance constraints	Test with realistic data sizes (not just a sample).	A Pandas script that runs in 10s on 100 rows might take 10 hours on 10M rows.

FDE Interview / War Story Insights

Interview Questions They’ll Ask

"You’re given a 1TB CSV file with 100 columns, but your PySpark job fails with OutOfMemoryError. What do you do?"
Answer: Check for data skew (e.g., one partition has 90% of the data). Repartition the data (df.repartition(200)), increase executor memory (--executor-memory 8G), or use salting for skewed joins.
Why? Interviewers want to see if you debug systematically (check logs, profile data, adjust configs).
"The customer’s data has inconsistent date formats (e.g., MM/DD/YYYY and DD-MM-YYYY). How do you handle this?"
Answer: Use multiple date parsers (e.g., pd.to_datetime(df["date"], format="mixed") in Pandas or to_timestamp with multiple formats in PySpark). Validate with the customer: "Which format is the source of truth?"
Why? Shows you think about edge cases and communicate with stakeholders.
"You’re deploying to an air-gapped network with no internet. How do you install dependencies?"
Answer: Pre-download .whl files, .jar files, and verify hashes (e.g., sha256sum). Use --no-index with pip or --packages with spark-submit pointing to local files.
Why? Tests your operational awareness (no internet = no pip install).

War Stories (How to Frame Your Experience)

The Silent Schema Change:
Story: "During a go-live for a defense customer, our PySpark job started failing with AnalysisException. Turns out the source database added a new column overnight, but the schema wasn’t updated. We had to patch the pipeline in 30 minutes while the customer watched."
Lesson: Always enforce schemas and monitor for schema drift (e.g., Great Expectations).
The ‘It Works on My Laptop’ Disaster:
Story: "A data scientist built a Pandas script that ran in 5 minutes on their Mac. On the customer’s 32-core server with 128GB RAM, it took 12 hours because they used .apply() instead of vectorized operations."
Lesson: Test with realistic data sizes and profile performance (%timeit in Jupyter, Spark UI for PySpark).
The ‘We Need This Yesterday’ Escalation:
Story: "A disaster response team needed a real-time dashboard for supply chain data. The raw data had duplicate entries (same shipment scanned multiple times). We wrote a quick deduplication script in Pandas, validated it with the team lead, and deployed it in 2 hours—saving the mission."
Lesson: Prioritize speed over perfection in crises, but document the technical debt for later.

Quick Check Questions

You’re given a CSV file with 500M rows and 200 columns. The customer’s server has 32GB RAM. Should you use Pandas or PySpark?
Answer: PySpark. Pandas would OOM (Out of Memory) on a dataset this size. PySpark can distribute the load across the cluster.
Why? Pandas loads the entire dataset into memory; PySpark processes it in chunks.
Your PySpark job fails with java.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat. What’s the first thing you check?
Answer: Verify that the Spark version matches the Parquet JARs. This error usually means dependency mismatch (e.g., Spark 3.3 with Parquet JARs for Spark 2.4).
Why? Air-gapped environments often have outdated or mismatched dependencies.
The customer’s data has a column status with values ["active", "inactive", "pending", "PENDING", "ACTIVE"]. How do you standardize this in Pandas?
Answer: df["status"] = df["status"].str.lower().str.strip() (lowercase + trim whitespace).
Why? Case sensitivity and whitespace are common sources of bugs in real-world data.

Last-Minute Cram Sheet

Pandas Cheat Sheet:
df.dtypes → Check column types.
df.isna().sum() → Count missing values.
df.astype("category") → Optimize low-cardinality strings.
pd.read_csv(..., parse_dates=["date"]) → Parse dates on read.
df.to_parquet("output.parquet", index=False) → Faster than CSV for analytics.
PySpark Cheat Sheet:
.schema(schema) → Enforce schema (avoid inferSchema).
.repartition(100) → Fix data skew.
.explain() → Check the execution plan (avoid expensive shuffles).
spark-submit --packages org.apache.spark:spark-avro_2.12:3.3.0 → Add dependencies (if internet is available).
df.write.parquet("output/", mode="overwrite") → Write output (columnar, efficient).
Field Traps:
⚠️ Never assume the customer’s data is in UTF-8 (use file to check encoding).
⚠️ Always test with the customer’s exact Python version (e.g., Python 3.6 vs. 3.9).
⚠️ Parquet is not always faster—for small datasets (<1GB), CSV might be better (no schema overhead).
⚠️ Spark UI (port 4040) is your best friend—check for skew, task duration, and failures.
Acronyms:
ACO: Authority to Operate (security approval for software).
ATO: Approval to Operate (similar to ACO, often used in DoD).
IAM: Identity and Access Management (e.g., AWS IAM, Active Directory).
SME: Subject Matter Expert (the customer’s domain expert).
SLA: Service Level Agreement (e.g., "Pipeline must run in <1 hour").
Deployment Checklist:
[ ] Dependencies pre-downloaded and verified (hashes match).
[ ] Schema enforced (no inferSchema).
[ ] Data validated (Great Expectations/Deequ).
[ ] Performance tested with realistic data size.
[ ] Runbook written (how to run, debug, and contact support).

➡️ Next Study Guide

Forward Deployed Engineer 101: Data Transformation and Cleaning (Python Pandas, PySpark)

Data Transformation and Cleaning (Python Pandas, PySpark)

Field-Ready Study Guide: Data Transformation & Cleaning (Python Pandas, PySpark)

What This Is

Key Terms & Concepts

Step-by-Step / Field Process

1. Discovery: Understand the Data & Constraints

2. Design the Transformation Pipeline

Read with schema enforcement

Deduplicate (keep first occurrence)

Fill missing GPS with a default (e.g., depot location)

Write output

Define schema (critical for production!)

Read with schema enforcement

Parse timestamp (handle multiple formats)

Deduplicate

Fill missing GPS

Write as Parquet (columnar, efficient for analytics)

Define expectations

Run validation

3. Deploy to the Customer’s Environment

For PySpark (pre-download JARs)

Pandas

PySpark (if not pre-installed)

For Spark on YARN

For local runs

4. Validate with the Customer

5. Document & Hand Off

Common Mistakes

FDE Interview / War Story Insights

Interview Questions They’ll Ask

War Stories (How to Frame Your Experience)

Quick Check Questions

Last-Minute Cram Sheet

❤ If you liked Fatskills, consider supporting us by checking out The Life Manuals You Never Got.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | What Should We Know?
Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson
© 2026 Fatskills.com

All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement.

Forward Deployed Engineer 101: Data Transformation and Cleaning (Python Pandas, PySpark)

Data Transformation and Cleaning (Python Pandas, PySpark)

Field-Ready Study Guide: Data Transformation & Cleaning (Python Pandas, PySpark)

What This Is

Key Terms & Concepts

Step-by-Step / Field Process

1. Discovery: Understand the Data & Constraints

2. Design the Transformation Pipeline

Read with schema enforcement

Deduplicate (keep first occurrence)

Fill missing GPS with a default (e.g., depot location)

Write output

Define schema (critical for production!)

Read with schema enforcement

Parse timestamp (handle multiple formats)

Deduplicate

Fill missing GPS

Write as Parquet (columnar, efficient for analytics)

Define expectations

Run validation

3. Deploy to the Customer’s Environment

For PySpark (pre-download JARs)

Pandas

PySpark (if not pre-installed)

For Spark on YARN

For local runs

4. Validate with the Customer

5. Document & Hand Off

Common Mistakes

FDE Interview / War Story Insights

Interview Questions They’ll Ask

War Stories (How to Frame Your Experience)

Quick Check Questions

Last-Minute Cram Sheet

❤ If you liked Fatskills, consider supporting us by checking out The Life Manuals You Never Got.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | What Should We Know? Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson© 2026 Fatskills.com

All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | What Should We Know?
Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson
© 2026 Fatskills.com