By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
Working with big data means processing datasets that are too large, too fast‑changing, or too complex for a single laptop’s memory or CPU. In a data‑science pipeline you move the data into a distributed engine (Spark, Hadoop MapReduce, or Dask), run the same pandas‑style transformations at scale, and then train models that can consume millions of rows or terabytes of logs. Example: a telecom company streams 10 GB / hour of call‑detail records to predict churn in near‑real‑time; Spark lets the data‑scientist build features and fit a gradient‑boosted tree without ever materialising the whole table on one machine.
spark.read.parquet(...)
df.select(...)
map
filter
groupBy
show
collect
write
groupByKey
join
df.repartition(n)
coalesce(m)
(key, value)
output_key = Reduce(Map(input_key, input_value))
Pipeline(stages=[assembler, scaler, estimator])
Pipeline
S = 1 / ((1‑P) + P/N)
Spin up a Spark session python from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("ChurnPredict") \ .config("spark.executor.memory","4g") \ .getOrCreate()
python from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("ChurnPredict") \ .config("spark.executor.memory","4g") \ .getOrCreate()
Load data lazily (Parquet, ORC, CSV, Kafka) python df = spark.read.parquet("s3://telco/cdrs/2024/*") # or streaming source # df = spark.readStream.format("kafka")...
python df = spark.read.parquet("s3://telco/cdrs/2024/*") # or streaming source # df = spark.readStream.format("kafka")...
Feature engineering at scale
df = df.withColumn("duration_min", df["duration_sec"]/60)
df = df.groupBy("customer_id").agg(F.sum("duration_min").alias("total_min"))
Use VectorAssembler + StandardScaler for ML‑ready vectors.
VectorAssembler
StandardScaler
Train a distributed model (e.g., Gradient‑Boosted Trees) python from pyspark.ml.classification import GBTClassifier gbt = GBTClassifier(labelCol="churn", featuresCol="features", maxIter=50) model = gbt.fit(train_df)
python from pyspark.ml.classification import GBTClassifier gbt = GBTClassifier(labelCol="churn", featuresCol="features", maxIter=50) model = gbt.fit(train_df)
Evaluate & tune
pred = model.transform(test_df)
evaluator = BinaryClassificationEvaluator(labelCol="churn", metricName="areaUnderROC")
Use CrossValidator with a ParamGridBuilder to search maxDepth, maxIter.
CrossValidator
ParamGridBuilder
maxDepth
maxIter
Persist & serve
model.write().overwrite().save("s3://models/churn_gbt")
df.collect()
df.take(n)
reduceByKey
broadcast(small_df)
df.cache()
Explain the difference between Spark’s DataFrame and Hadoop MapReduce. Key points: DataFrames are declarative, columnar, and lazily optimized (Catalyst); MapReduce is a low‑level, eager, key‑value API requiring explicit shuffles.
DataFrame
When would you choose repartition vs. coalesce? Answer: Use repartition to increase parallelism (adds a shuffle) and coalesce to decrease partitions without a shuffle, typically before writing output.
repartition
coalesce
How does Amdahl’s Law guide the number of executors you request? Answer: It shows diminishing returns; after ~80 % parallelizable work, adding more nodes yields marginal speed‑up, so you balance cost vs. gain.
What is a “shuffle spill” and how do you mitigate it? Answer: When intermediate data exceeds executor memory, Spark writes to disk (spills). Mitigate by increasing spark.memory.fraction, tuning spark.sql.shuffle.partitions, and using broadcast for small tables.
spark.memory.fraction
spark.sql.shuffle.partitions
broadcast
Scenario: Your Spark job spends 70 % of its time in shuffle stages and fails with “OutOfMemoryError”. Answer: Broadcast the small side of the join or increase spark.sql.shuffle.partitions and spark.memory.storageFraction. Why: Reducing shuffle size and avoiding large in‑memory buffers prevents spills and OOM.
spark.memory.storageFraction
Scenario: You need to train a model on a 500 GB CSV that updates daily. Answer: Load the CSV as a Spark DataFrame, cache the parsed schema, and use MLlib’s incremental algorithms (e.g., LinearRegression with elasticNetParam). Why: Spark reads the file in parallel; incremental training avoids re‑reading the whole dataset each day.
MLlib
LinearRegression
elasticNetParam
Scenario: After a groupBy you notice the job runs slower on a 10‑node cluster than on 4 nodes. Answer: Likely too many partitions causing overhead; call df.repartition(optimal_num) or coalesce to reduce partitions.
df.repartition(optimal_num)
repartition(n)
collect()
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.