Fatskills
Practice. Master. Repeat.
Study Guide: Data Science and Machine Learning 101: Programming and Data Engineering Working with Big Data Spark Hadoop Distributed Computing Basics
Source: https://www.fatskills.com/introdution-to-engineering/chapter/data-science-and-machine-learning-data-science-and-machine-learning-programming-and-data-engineering-working-with-big-data-spark-hadoop-distributed-computing-basics

Data Science and Machine Learning 101: Programming and Data Engineering Working with Big Data Spark Hadoop Distributed Computing Basics

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~5 min read

What This Is

Working with big data means processing datasets that are too large, too fast‑changing, or too complex for a single laptop’s memory or CPU. In a data‑science pipeline you move the data into a distributed engine (Spark, Hadoop MapReduce, or Dask), run the same pandas‑style transformations at scale, and then train models that can consume millions of rows or terabytes of logs. Example: a telecom company streams 10 GB / hour of call‑detail records to predict churn in near‑real‑time; Spark lets the data‑scientist build features and fit a gradient‑boosted tree without ever materialising the whole table on one machine.


Key Terms & Formulas

  • Spark RDD (Resilient Distributed Dataset) – immutable, partitioned collection of records; the low‑level API that underpins DataFrames.
  • DataFrame API – Spark’s columnar, SQL‑like interface (spark.read.parquet(...), df.select(...)). Works like pandas but executes lazily across the cluster.
  • Lazy Evaluation – transformations (map, filter, groupBy) build a DAG; the actual computation runs only when an action (show, collect, write) is called.
  • Shuffle – data movement required for operations like groupByKey or join; the main source of latency and memory pressure.
  • Partitioning / Coalescedf.repartition(n) splits data into n partitions; coalesce(m) reduces partitions without full shuffle. Use to balance parallelism vs. overhead.
  • MapReduce – classic Hadoop paradigm: Map emits (key, value) pairs; Reduce aggregates all values for each key. Formula: output_key = Reduce(Map(input_key, input_value)).
  • Spark MLlib PipelinePipeline(stages=[assembler, scaler, estimator]); mirrors scikit‑learn’s Pipeline but runs distributed.
  • Amdahl’s Law – theoretical speed‑up: S = 1 / ((1‑P) + P/N), where P is the parallelisable portion and N the number of workers. Shows diminishing returns after a point.
  • Broadcast Variable – read‑only data (e.g., a lookup table) sent once to each executor; avoids repeated shuffles.
  • Checkpointing – persisting an RDD/DataFrame to stable storage (HDFS, S3) to truncate the lineage DAG and recover from failures.
  • Spark Structured Streaming – micro‑batch engine; each batch is a DataFrame, enabling the same code for batch & streaming.
  • YARN / Mesos / Kubernetes – cluster managers that allocate containers/slots for Spark executors.


Step‑by‑Step / Process Flow

  1. Spin up a Spark session
    python
    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
    .appName("ChurnPredict") \
    .config("spark.executor.memory","4g") \
    .getOrCreate()

  2. Load data lazily (Parquet, ORC, CSV, Kafka)
    python
    df = spark.read.parquet("s3://telco/cdrs/2024/*")
    # or streaming source
    # df = spark.readStream.format("kafka")...

  3. Feature engineering at scale

  4. df = df.withColumn("duration_min", df["duration_sec"]/60)
  5. df = df.groupBy("customer_id").agg(F.sum("duration_min").alias("total_min"))
  6. Use VectorAssembler + StandardScaler for ML‑ready vectors.

  7. Train a distributed model (e.g., Gradient‑Boosted Trees)
    python
    from pyspark.ml.classification import GBTClassifier
    gbt = GBTClassifier(labelCol="churn", featuresCol="features", maxIter=50)
    model = gbt.fit(train_df)

  8. Evaluate & tune

  9. pred = model.transform(test_df)
  10. evaluator = BinaryClassificationEvaluator(labelCol="churn", metricName="areaUnderROC")
  11. Use CrossValidator with a ParamGridBuilder to search maxDepth, maxIter.

  12. Persist & serve

  13. model.write().overwrite().save("s3://models/churn_gbt")
  14. Load in a Spark‑UDF or export to ONNX for downstream services.

Common Mistakes

Mistake Correction
Collecting the whole DataFrame to the driver (df.collect()) Keep data distributed; use df.take(n) for a quick peek or write results to storage.
Using groupByKey instead of reduceByKey reduceByKey performs local aggregation before shuffle, cutting network traffic dramatically.
Setting too many partitions (e.g., one per row) Over‑partitioning inflates scheduler overhead; aim for 2–4 × CPU cores per executor.
Neglecting broadcast joins for small tables Broadcast the small lookup (broadcast(small_df)) to avoid a costly shuffle join.
Forgetting to cache intermediate results Cache (df.cache()) when the same DataFrame is reused multiple times (e.g., in iterative ML).


Data Science Interview / Practical Insights

  1. Explain the difference between Spark’s DataFrame and Hadoop MapReduce.
    Key points: DataFrames are declarative, columnar, and lazily optimized (Catalyst); MapReduce is a low‑level, eager, key‑value API requiring explicit shuffles.

  2. When would you choose repartition vs. coalesce?
    Answer: Use repartition to increase parallelism (adds a shuffle) and coalesce to decrease partitions without a shuffle, typically before writing output.

  3. How does Amdahl’s Law guide the number of executors you request?
    Answer: It shows diminishing returns; after ~80 % parallelizable work, adding more nodes yields marginal speed‑up, so you balance cost vs. gain.

  4. What is a “shuffle spill” and how do you mitigate it?
    Answer: When intermediate data exceeds executor memory, Spark writes to disk (spills). Mitigate by increasing spark.memory.fraction, tuning spark.sql.shuffle.partitions, and using broadcast for small tables.


Quick Check Questions

  1. Scenario: Your Spark job spends 70 % of its time in shuffle stages and fails with “OutOfMemoryError”.
    Answer: Broadcast the small side of the join or increase spark.sql.shuffle.partitions and spark.memory.storageFraction.
    Why: Reducing shuffle size and avoiding large in‑memory buffers prevents spills and OOM.

  2. Scenario: You need to train a model on a 500 GB CSV that updates daily.
    Answer: Load the CSV as a Spark DataFrame, cache the parsed schema, and use MLlib’s incremental algorithms (e.g., LinearRegression with elasticNetParam).
    Why: Spark reads the file in parallel; incremental training avoids re‑reading the whole dataset each day.

  3. Scenario: After a groupBy you notice the job runs slower on a 10‑node cluster than on 4 nodes.
    Answer: Likely too many partitions causing overhead; call df.repartition(optimal_num) or coalesce to reduce partitions.


Last‑Minute Cram Sheet (10 one‑liners)

  1. ⚡ Spark lazily builds a DAG; actions trigger execution.
  2. RDD = immutable, partitioned collection; DataFrame = optimized, columnar view on top of RDD.
  3. Shuffle = network‑heavy; avoid with reduceByKey, broadcast, or pre‑aggregation.
  4. Amdahl’s Law: S = 1 / ((1‑P) + P/N) – parallelism caps speed‑up.
  5. repartition(n) adds a shuffle; coalesce(m) reduces partitions without shuffle.
  6. Broadcast variable = send‑once, read‑only data to all executors.
  7. MLlib Pipeline ≈ scikit‑learn Pipeline but runs distributed.
  8. Checkpointing = truncate lineage; essential for long iterative jobs.
  9. Structured Streaming = micro‑batch DataFrames → same code for batch & streaming.
  10. ⚠️ Never call collect() on a TB‑scale DataFrame; it blows driver memory.


ADVERTISEMENT