Fatskills
Practice. Master. Repeat.
Study Guide: Forward Deployed Engineer 101: Data Transformation and Cleaning (Python Pandas, PySpark)
Source: https://www.fatskills.com/forward-deployed-engineer-fde/chapter/forward-deployed-engineer-data-transformation-and-cleaning-python-pandas-pyspark

Forward Deployed Engineer 101: Data Transformation and Cleaning (Python Pandas, PySpark)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~12 min read

Data Transformation and Cleaning (Python Pandas, PySpark)


Field-Ready Study Guide: Data Transformation & Cleaning (Python Pandas, PySpark)

For Forward Deployed Engineers (FDEs) who need to ship production-grade data pipelines in chaotic, constrained environments.


What This Is

Data transformation and cleaning is the unseen 80% of an FDE’s job—turning messy, real-world data into something your models, dashboards, or APIs can actually use. Unlike lab environments, field deployments have no room for "it works on my machine"—you’ll deal with: - Air-gapped networks where you can’t pip install or pull Docker images.
- Classified or sensitive data that can’t leave the customer’s enclave (no cloud uploads, no debugging in public repos).
- Last-minute schema changes (e.g., a disaster response team suddenly needs to ingest drone footage metadata in a new format).
- Performance constraints (e.g., a 10TB dataset must be processed on a single VM with 16GB RAM).

Field Example:
You’re deployed to a military logistics hub during a hurricane relief mission. The customer’s supply chain system spits out CSV files with inconsistent date formats, missing GPS coordinates, and duplicate entries—but your ML model for predicting delivery delays fails silently if the data isn’t cleaned. You have 4 hours to: 1. Write a PySpark job to deduplicate and normalize the data (no internet → must use pre-approved dependencies).
2. Deploy it to a Kubernetes cluster behind a classified firewall (no Helm charts → manual kubectl apply).
3. Validate the output on-site with the customer’s SMEs (Subject Matter Experts) before the next supply convoy leaves.


Key Terms & Concepts

  • Pandas vs. PySpark:
  • Pandas = Single-machine, in-memory (good for <10GB data, quick EDA, or air-gapped environments where Spark isn’t installed).
  • PySpark = Distributed, fault-tolerant (for >100GB data, on-prem clusters, or when you need to scale horizontally).
  • FDE Rule: Always ask: "Can this run on the customer’s hardware?" (e.g., Pandas on a Raspberry Pi for edge deployments).

  • Schema Enforcement:

  • Pandas: df = pd.read_csv(..., dtype={"column": "int32"}) (fail fast if data doesn’t match).
  • PySpark: .schema(schema) or .option("enforceSchema", "true") (critical for production pipelines—prevents silent type coercion).

  • Dirty Data Patterns:

  • Deduplication: df.drop_duplicates() (Pandas) or .dropDuplicates() (PySpark).
  • Missing Data: .fillna() (impute) vs. .dropna() (remove). FDE Trap: Never assume missing data is random—ask the customer: "Is this missing because the sensor failed, or because the event didn’t happen?"
  • Outliers: Use IQR (Interquartile Range) or Z-score filtering. Field Tip: In defense/intel, outliers might be signals (e.g., a drone’s erratic flight path = potential threat).

  • Performance Hacks:

  • Pandas: Use .astype("category") for low-cardinality strings (e.g., country codes).
  • PySpark: Partition data by a key (e.g., df.repartition("date")) to avoid skew. FDE Rule: Always check .explain() before running a job—shuffles are expensive.

  • Air-Gapped Dependencies:

  • Pandas: Pre-download .whl files and install via pip install --no-index --find-links=./wheels pandas.
  • PySpark: Use a pre-approved Spark distribution (e.g., customer-provided .tar.gz with Hadoop binaries).

  • Validation & Testing:

  • Great Expectations (Python) or Deequ (PySpark) for data quality checks. FDE Use Case: Run validation before and after transformation to catch silent failures (e.g., a column that should never be null suddenly has 10% missing values).
  • Golden Dataset: A small, manually verified subset of data to test transformations. Field Tip: Always ask the customer: "Can you give me 10 rows of ‘good’ data?"

  • Deployment Constraints:

  • No Internet: Use offline package managers (e.g., conda-pack for Python, spark-submit --packages with local JARs).
  • No Root Access: Install Python packages in ~/.local or use --user flag.
  • No Docker: Use singularity (common in HPC/defense) or static binaries (e.g., PyInstaller for Python scripts).

  • Ask vs. Infer (Data Edition):

  • Ask: "The customer says they need timestamps in UTC."
  • Infer: The data shows timestamps in local time with no timezone info—you’ll need to reverse-engineer the timezone from GPS coordinates or metadata.


Step-by-Step / Field Process


1. Discovery: Understand the Data & Constraints

  • Action: SSH into the customer’s bastion host (or VPN into their network) and inspect the raw data.
    bash # Check file size and format ls -lh /data/raw/ head -n 5 /data/raw/supply_logs.csv # First 5 rows file /data/raw/supply_logs.csv # Check encoding (e.g., UTF-8 vs. ISO-8859-1)
  • Ask the Customer:
  • "What’s the source of truth for this data?" (e.g., a legacy Oracle DB, a sensor network, manual spreadsheets).
  • "What’s the SLA for this pipeline?" (e.g., "Must process 1TB in <1 hour" vs. "Can run overnight").
  • "Are there security restrictions?" (e.g., PII redaction, no cloud uploads, no external APIs).
  • Infer:
  • Run a quick Pandas script to profile the data:
    python
    import pandas as pd
    df = pd.read_csv("/data/raw/supply_logs.csv", nrows=1000)
    print(df.dtypes) # Check column types
    print(df.isna().sum()) # Count missing values
    print(df.describe()) # Numeric stats
    print(df["category"].value_counts()) # Categorical distribution

2. Design the Transformation Pipeline

  • Choose the Tool:
  • <10GB data, single machine? → Pandas (faster iteration, easier debugging).
  • >100GB data, distributed cluster? → PySpark (scalable, fault-tolerant).
  • Write the Transformation Logic:
  • Pandas Example (Dedupe + Date Parsing):
    ```python
    import pandas as pd
    from datetime import datetime

    Read with schema enforcement

    df = pd.read_csv(
    "/data/raw/supply_logs.csv",
    dtype={"item_id": "str", "quantity": "int32"},
    parse_dates=["timestamp"],
    date_parser=lambda x: datetime.strptime(x, "%m/%d/%Y %H:%M") # Handle inconsistent formats )

    Deduplicate (keep first occurrence)

    df = df.drop_duplicates(subset=["item_id", "timestamp"], keep="first")

    Fill missing GPS with a default (e.g., depot location)

    df["latitude"] = df["latitude"].fillna(38.8977) # Default to DC coordinates df["longitude"] = df["longitude"].fillna(-77.0365)

    Write output

    df.to_parquet("/data/clean/supply_logs.parquet", index=False) - PySpark Example (Same Logic, Distributed):python from pyspark.sql import SparkSession from pyspark.sql.functions import col, to_timestamp from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

    spark = SparkSession.builder.appName("SupplyClean").getOrCreate()

    Define schema (critical for production!)

    schema = StructType([
    StructField("item_id", StringType()),
    StructField("quantity", IntegerType()),
    StructField("timestamp", StringType()), # Will parse later
    StructField("latitude", DoubleType()),
    StructField("longitude", DoubleType()) ])

    Read with schema enforcement

    df = spark.read.csv(
    "/data/raw/supply_logs.csv",
    schema=schema,
    header=True,
    mode="DROPMALFORMED" # Fail on bad rows )

    Parse timestamp (handle multiple formats)

    df = df.withColumn(
    "timestamp",
    to_timestamp(col("timestamp"), "MM/dd/yyyy HH:mm") )

    Deduplicate

    df = df.dropDuplicates(["item_id", "timestamp"])

    Fill missing GPS

    df = df.fillna({"latitude": 38.8977, "longitude": -77.0365})

    Write as Parquet (columnar, efficient for analytics)

    df.write.parquet("/data/clean/supply_logs.parquet", mode="overwrite") - Validation: - Use Great Expectations to define checks:python import great_expectations as ge context = ge.get_context() validator = context.sources.pandas_default.read_parquet("/data/clean/supply_logs.parquet")

    Define expectations

    validator.expect_column_values_to_not_be_null("item_id") validator.expect_column_values_to_be_between("quantity", min_value=0, max_value=1000) validator.expect_column_values_to_match_regex("item_id", r"^[A-Z]{2}-\d{4}$")

    Run validation

    validation_result = validator.validate() assert validation_result["success"], "Data validation failed!" ```

3. Deploy to the Customer’s Environment

  • Air-Gapped Deployment:
  • Step 1: Package dependencies offline.
    ```bash
    # For Pandas
    pip download pandas numpy pyarrow -d ./wheels

    For PySpark (pre-download JARs)

    wget https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/3.3.0/spark-core_2.12-3.3.0.jar -P ./jars - Step 2: Transfer files to the customer’s network (e.g., via sneakernet—USB drive, DVD, or secure file transfer).
    - Step 3: Install dependencies locally.
    bash

    Pandas

    pip install --no-index --find-links=./wheels pandas

    PySpark (if not pre-installed)

    export SPARK_HOME=/opt/spark export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH - Run the Job: - Pandas: `python clean_data.py` - PySpark: `spark-submit --master yarn --deploy-mode cluster clean_data.py` - Monitor & Debug: - Check logs:bash

    For Spark on YARN

    yarn logs -applicationId

    For local runs

    tail -f /var/log/spark/spark.log - Reproduce Errors: If the job fails, pull a sample of the raw data and test locally:bash head -n 1000 /data/raw/supply_logs.csv > /tmp/sample.csv scp /tmp/sample.csv your-laptop:~/ # If allowed ```

4. Validate with the Customer

  • Action: Sit with the SME (Subject Matter Expert) and review the output.
  • "Does this item_id format match what you expect?" (e.g., "AB-1234" vs. "1234-AB").
  • "Are these missing GPS coordinates acceptable, or should we flag them for review?"
  • "Does the deduplication logic preserve the correct records?" (e.g., "Keep the first scan" vs. "Keep the most recent").
  • Fallback Plan: If the customer rejects the output:
  • Quick Fix: Write a patch script (e.g., fix_gps.py) to address the issue.
  • Long-Term Fix: Update the pipeline and add a test case to prevent regression.

5. Document & Hand Off

  • Write a Runbook:
  • How to run the pipeline (commands, dependencies, expected runtime).
  • How to debug common failures (e.g., "If the job fails with OutOfMemoryError, reduce spark.executor.memory to 4G").
  • Contact info for the FDE who built it (you!).
  • Example Runbook Snippet:
    markdown # Supply Chain Data Cleaning Pipeline ## How to Runbash spark-submit --master yarn --executor-memory 8G --conf spark.dynamicAllocation.enabled=false clean_data.py ```

## Common Failures | Error | Cause | Fix | |-------|-------|-----| | java.lang.OutOfMemoryError | Not enough executor memory | Reduce --executor-memory to 4G | | AnalysisException: Cannot resolve column | Schema mismatch | Check /data/raw/supply_logs.csv for new columns | ```


Common Mistakes

Mistake Correction Why
Assuming the data is clean Always profile first (df.describe(), df.isna().sum()). Real-world data is never clean. Skipping this step leads to silent failures (e.g., a model trained on null values).
Using inferSchema in PySpark Explicitly define the schema. inferSchema is slow and can silently coerce data types (e.g., treating "123-456" as a number).
Not testing in the customer’s environment Reproduce the exact environment (OS, Python version, dependencies). A script that works on your Mac will break on the customer’s RHEL 7 server with Python 3.6.
Hardcoding paths Use environment variables or config files. /home/yourname/data/ won’t exist on the customer’s machine. Use os.getenv("DATA_DIR", "/default/path").
Ignoring performance constraints Test with realistic data sizes (not just a sample). A Pandas script that runs in 10s on 100 rows might take 10 hours on 10M rows.


FDE Interview / War Story Insights


Interview Questions They’ll Ask

  1. "You’re given a 1TB CSV file with 100 columns, but your PySpark job fails with OutOfMemoryError. What do you do?"
  2. Answer: Check for data skew (e.g., one partition has 90% of the data). Repartition the data (df.repartition(200)), increase executor memory (--executor-memory 8G), or use salting for skewed joins.
  3. Why? Interviewers want to see if you debug systematically (check logs, profile data, adjust configs).

  4. "The customer’s data has inconsistent date formats (e.g., MM/DD/YYYY and DD-MM-YYYY). How do you handle this?"

  5. Answer: Use multiple date parsers (e.g., pd.to_datetime(df["date"], format="mixed") in Pandas or to_timestamp with multiple formats in PySpark). Validate with the customer: "Which format is the source of truth?"
  6. Why? Shows you think about edge cases and communicate with stakeholders.

  7. "You’re deploying to an air-gapped network with no internet. How do you install dependencies?"

  8. Answer: Pre-download .whl files, .jar files, and verify hashes (e.g., sha256sum). Use --no-index with pip or --packages with spark-submit pointing to local files.
  9. Why? Tests your operational awareness (no internet = no pip install).

War Stories (How to Frame Your Experience)

  1. The Silent Schema Change:
  2. Story: "During a go-live for a defense customer, our PySpark job started failing with AnalysisException. Turns out the source database added a new column overnight, but the schema wasn’t updated. We had to patch the pipeline in 30 minutes while the customer watched."
  3. Lesson: Always enforce schemas and monitor for schema drift (e.g., Great Expectations).

  4. The ‘It Works on My Laptop’ Disaster:

  5. Story: "A data scientist built a Pandas script that ran in 5 minutes on their Mac. On the customer’s 32-core server with 128GB RAM, it took 12 hours because they used .apply() instead of vectorized operations."
  6. Lesson: Test with realistic data sizes and profile performance (%timeit in Jupyter, Spark UI for PySpark).

  7. The ‘We Need This Yesterday’ Escalation:

  8. Story: "A disaster response team needed a real-time dashboard for supply chain data. The raw data had duplicate entries (same shipment scanned multiple times). We wrote a quick deduplication script in Pandas, validated it with the team lead, and deployed it in 2 hours—saving the mission."
  9. Lesson: Prioritize speed over perfection in crises, but document the technical debt for later.

Quick Check Questions

  1. You’re given a CSV file with 500M rows and 200 columns. The customer’s server has 32GB RAM. Should you use Pandas or PySpark?
  2. Answer: PySpark. Pandas would OOM (Out of Memory) on a dataset this size. PySpark can distribute the load across the cluster.
  3. Why? Pandas loads the entire dataset into memory; PySpark processes it in chunks.

  4. Your PySpark job fails with java.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat. What’s the first thing you check?

  5. Answer: Verify that the Spark version matches the Parquet JARs. This error usually means dependency mismatch (e.g., Spark 3.3 with Parquet JARs for Spark 2.4).
  6. Why? Air-gapped environments often have outdated or mismatched dependencies.

  7. The customer’s data has a column status with values ["active", "inactive", "pending", "PENDING", "ACTIVE"]. How do you standardize this in Pandas?

  8. Answer: df["status"] = df["status"].str.lower().str.strip() (lowercase + trim whitespace).
  9. Why? Case sensitivity and whitespace are common sources of bugs in real-world data.

Last-Minute Cram Sheet

  1. Pandas Cheat Sheet:
  2. df.dtypes → Check column types.
  3. df.isna().sum() → Count missing values.
  4. df.astype("category") → Optimize low-cardinality strings.
  5. pd.read_csv(..., parse_dates=["date"]) → Parse dates on read.
  6. df.to_parquet("output.parquet", index=False) → Faster than CSV for analytics.

  7. PySpark Cheat Sheet:

  8. .schema(schema) → Enforce schema (avoid inferSchema).
  9. .repartition(100) → Fix data skew.
  10. .explain() → Check the execution plan (avoid expensive shuffles).
  11. spark-submit --packages org.apache.spark:spark-avro_2.12:3.3.0 → Add dependencies (if internet is available).
  12. df.write.parquet("output/", mode="overwrite") → Write output (columnar, efficient).

  13. Field Traps:

  14. ⚠️ Never assume the customer’s data is in UTF-8 (use file to check encoding).
  15. ⚠️ Always test with the customer’s exact Python version (e.g., Python 3.6 vs. 3.9).
  16. ⚠️ Parquet is not always faster—for small datasets (<1GB), CSV might be better (no schema overhead).
  17. ⚠️ Spark UI (port 4040) is your best friend—check for skew, task duration, and failures.

  18. Acronyms:

  19. ACO: Authority to Operate (security approval for software).
  20. ATO: Approval to Operate (similar to ACO, often used in DoD).
  21. IAM: Identity and Access Management (e.g., AWS IAM, Active Directory).
  22. SME: Subject Matter Expert (the customer’s domain expert).
  23. SLA: Service Level Agreement (e.g., "Pipeline must run in <1 hour").

  24. Deployment Checklist:

  25. [ ] Dependencies pre-downloaded and verified (hashes match).
  26. [ ] Schema enforced (no inferSchema).
  27. [ ] Data validated (Great Expectations/Deequ).
  28. [ ] Performance tested with realistic data size.
  29. [ ] Runbook written (how to run, debug, and contact support).


ADVERTISEMENT