Fatskills
Practice. Master. Repeat.
Study Guide: Cloud ML - Google Cloud Professional Machine Learning Engineer: Data Preparation Architecture (Dataflow, Dataproc, Pub/Sub)
Source: https://www.fatskills.com/hesi/chapter/cloud-ml-cert-gcp-ml-data-preparation-architecture-dataflow-dataproc-pubsub

Cloud ML - Google Cloud Professional Machine Learning Engineer: Data Preparation Architecture (Dataflow, Dataproc, Pub/Sub)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~7 min read

GCP_ML – Data Preparation Architecture (Dataflow, Dataproc, Pub/Sub)

Google Cloud Professional Machine Learning Engineer – Data Preparation Architecture Study Guide

Topic: Data Preparation Architecture (Dataflow, Dataproc, Pub/Sub)

What This Is

Data preparation is the backbone of any ML pipeline—garbage in, garbage out. In GCP, Dataflow (serverless stream/batch processing), Dataproc (managed Spark/Hadoop), and Pub/Sub (real-time messaging) handle ingestion, cleaning, transformation, and feature engineering at scale. Real-world scenario: A fintech company processes 10M+ daily transactions for fraud detection, requiring low-latency feature computation (e.g., rolling averages of user spending) and batch training on historical data. The pipeline must handle schema evolution, late-arriving data, and exactly-once processing while keeping costs predictable.

Key Terms & Services

Cloud Dataflow:
GCP’s serverless stream/batch processing service (Apache Beam under the hood). Best for ETL, feature engineering, and real-time analytics with auto-scaling and exactly-once processing. Use when you need unified batch/stream processing without managing clusters.
Cloud Dataproc:
GCP’s managed Spark/Hadoop service. Best for large-scale batch processing (e.g., training data prep) or custom PySpark jobs where you need full control over the runtime (e.g., installing libraries). Not serverless—you pay for cluster uptime.
Pub/Sub:
GCP’s real-time messaging service (like Kafka). Best for decoupling producers/consumers (e.g., IoT devices → fraud detection pipeline) and fan-out patterns (e.g., one event triggers multiple ML models). Guarantees at-least-once delivery (idempotency required).
BigQuery:
GCP’s serverless data warehouse. Best for SQL-based feature engineering, training data extraction, and ad-hoc analysis. Integrates with Vertex AI Feature Store for low-latency feature serving.
Vertex AI Feature Store:
GCP’s managed feature repository for storing, sharing, and serving ML features. Reduces training-serving skew by ensuring the same features are used in both phases. Supports online/offline feature retrieval.
Data Fusion:
GCP’s drag-and-drop ETL tool (like Informatica). Best for no-code/low-code pipelines (e.g., business analysts moving data from Salesforce to BigQuery). Not for ML-specific transformations (use Dataflow/Dataproc instead).
Apache Beam:
Open-source unified batch/stream processing model (used by Dataflow). Key concepts:
PCollection: Distributed dataset (like an RDD in Spark).
PTransform: Operations (e.g., ParDo, GroupByKey).
Windowing: Splits unbounded data into finite chunks (e.g., fixed, sliding, or session windows).
Spark on Dataproc:
RDDs (Resilient Distributed Datasets): Immutable, fault-tolerant collections.
DataFrames: Schema-aware, optimized for SQL-like operations.
Structured Streaming: Micro-batch processing for real-time (but higher latency than Dataflow).
Schema Evolution:
Handling changes in data structure (e.g., new columns) without breaking pipelines. BigQuery and Dataflow support schema auto-detection; Dataproc requires manual handling.
Exactly-Once Processing:
Ensures each event is processed once and only once (critical for financial transactions). Dataflow supports this natively; Pub/Sub + Dataflow requires idempotent sinks (e.g., BigQuery with WRITE_TRUNCATE).
Watermarks:
Mechanism to track late-arriving data in streaming pipelines. Dataflow uses watermarks to decide when to close a window (e.g., "wait 5 minutes for late transactions").
Cost Model:
Dataflow: Pay per vCPU-hour, memory, and shuffle data (expensive for large joins).
Dataproc: Pay for cluster uptime (cheaper for long-running batch jobs).
Pub/Sub: Pay per message and GB transferred.

Step-by-Step / Process Flow

Scenario: Real-Time Fraud Detection Pipeline

Goal: Ingest transactions → compute rolling features (e.g., "user spent $X in last 5 mins") → train a model → serve predictions in <100ms.

Ingest Data with Pub/Sub
Create a Pub/Sub topic (transactions) and subscription (fraud-detection-sub).
Configure IoT devices/ATMs to publish transactions to the topic.
Why Pub/Sub? Decouples producers (devices) from consumers (Dataflow) and handles spikes in traffic.
Stream Processing with Dataflow
Write an Apache Beam pipeline (Python/Java) to:
- Read from Pub/Sub (beam.io.ReadFromPubSub).
- Apply windowing (e.g., 5-minute sliding windows for rolling averages).
- Compute features (e.g., user_id, transaction_count_5min, avg_amount_5min).
- Write to BigQuery (for training) and Vertex AI Feature Store (for serving).
Deploy with:
bash gcloud dataflow jobs run fraud-features \ --gcs-location=gs://dataflow-templates/latest/Python \ --parameters=input_subscription=projects/PROJECT/subscriptions/fraud-detection-sub,output_table=PROJECT:dataset.transactions
Key settings:
- --enableStreamingEngine (reduces cost by offloading shuffle to GCP).
- --numWorkers=10 (auto-scales up to 100 if needed).
Batch Training Data Prep with Dataproc
For historical training data, use Dataproc to:
- Read raw data from Cloud Storage (e.g., gs://fraud-data/raw/).
- Run a PySpark job to:
- Clean data (handle missing values, outliers).
- Join with reference tables (e.g., user_metadata).
- Write to BigQuery or TFRecords for training.
Submit job:
bash gcloud dataproc jobs submit pyspark \ --cluster=fraud-cluster \ --region=us-central1 \ --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar \ -- gs://fraud-data/scripts/preprocess.py
Why Dataproc? Cheaper for long-running batch jobs (vs. Dataflow’s per-vCPU pricing).
Feature Serving with Vertex AI Feature Store
Create a feature group in Vertex AI:
```python
from google.cloud import aiplatform
aiplatform.init(project=PROJECT, location=REGION) feature_group = aiplatform.FeatureGroup.create(
name="fraud_features",
entity_type="user_id",
online_store_fixed_node_count=1 # For low-latency serving ) - Ingest features from Dataflow (streaming) or Dataproc (batch). - Serve features in real-time via:python features = feature_group.read(
entity_id="user123",
feature_selector=["transaction_count_5min", "avg_amount_5min"] ) ```
Monitor and Optimize
Dataflow: Monitor system lag (streaming) and autoscaling in Cloud Console.
Dataproc: Use Cloud Monitoring to track Spark job metrics (e.g., shuffle read size).
Pub/Sub: Set alerts for backlog size (indicates consumer lag).

Common Mistakes

Mistake	Correction
Using Dataproc for real-time processing	Dataproc (Spark Streaming) has higher latency than Dataflow. Use Dataflow for <1s latency requirements.
Not handling late data in Dataflow	Forgetting to set allowed lateness or watermarks leads to dropped events. Use `Window.into(FixedWindows.of(5, TimeUnit.MINUTES)).withAllowedLateness(Duration.standardMinutes(1))`.
Over-provisioning Dataflow workers	Dataflow auto-scales, but shuffle-heavy jobs (e.g., large joins) can spike costs. Use Streaming Engine to reduce shuffle overhead.
Storing raw data in BigQuery for streaming	BigQuery is not a time-series database. For high-frequency streaming (e.g., IoT), use Pub/Sub → Dataflow → Bigtable or Firestore.
Ignoring schema evolution	If a new column is added, Dataflow auto-detects it, but Dataproc jobs may fail. Use BigQuery’s schema auto-detection or Avro/Parquet for schema flexibility.

Certification Exam Insights

Service Selection Traps:
Dataflow vs. Dataproc:
- Use Dataflow for serverless, unified batch/stream (e.g., real-time feature engineering).
- Use Dataproc for long-running batch jobs (e.g., training data prep) or custom Spark jobs.
- Exam trap: "Which service for a 1-hour batch job?" → Dataproc (cheaper than Dataflow for short batch jobs).
Streaming vs. Batch Costs:
Streaming Dataflow is more expensive than batch (per-vCPU pricing). For cost-sensitive pipelines, use Pub/Sub → Cloud Functions → BigQuery (if latency allows).
Feature Store Integration:
Vertex AI Feature Store is not a data warehouse. Use it for low-latency feature serving, not for storing raw data.
Exam trap: "Where to store raw transaction data?" → BigQuery/Cloud Storage, not Feature Store.
Exactly-Once Processing:
Dataflow + Pub/Sub requires idempotent sinks (e.g., BigQuery with WRITE_TRUNCATE). Dataproc + Spark Streaming does not guarantee exactly-once.

Quick Check Questions

A retail company wants to compute "customer lifetime value" (CLV) daily using 5 years of historical data. The pipeline must run in <2 hours and scale to 100M+ records. Which GCP service is the best fit?
Answer: Dataproc (cheaper for long-running batch jobs; Dataflow would be overkill for daily batch).
A gaming company needs to detect cheating in real-time (latency <100ms) by analyzing player actions. The pipeline must handle 10K events/sec and compute rolling features (e.g., "actions per minute"). Which GCP architecture should they use?
Answer: Pub/Sub → Dataflow (streaming) → Vertex AI Feature Store (Dataflow for low-latency processing; Feature Store for serving).
A data scientist notices that their model’s online predictions are worse than offline metrics. The training data comes from BigQuery, while real-time features are computed in a separate pipeline. What’s the most likely cause, and how can they fix it?
Answer: Training-serving skew (features computed differently in training vs. serving). Fix by using Vertex AI Feature Store to ensure the same features are used in both phases.

Last-Minute Cram Sheet

Dataflow = serverless batch/stream processing (Apache Beam). Use for ETL, feature engineering, real-time analytics.
Dataproc = managed Spark/Hadoop. Use for batch jobs, custom PySpark, or long-running transformations.
Pub/Sub = real-time messaging (like Kafka). Use for decoupling producers/consumers (e.g., IoT → ML pipeline).
Vertex AI Feature Store = low-latency feature serving. Reduces training-serving skew.
BigQuery = serverless data warehouse. Use for SQL-based feature engineering and training data extraction.
Dataflow cost drivers: vCPU-hour, memory, shuffle data. Use Streaming Engine to reduce shuffle costs.
Dataproc cost drivers: cluster uptime. Use preemptible VMs for cost savings (but jobs may fail).
⚠️ Dataflow streaming is not free – batch is cheaper. For cost-sensitive pipelines, use Pub/Sub → Cloud Functions → BigQuery.
⚠️ Dataproc is not for real-time – Spark Streaming has higher latency than Dataflow.
⚠️ Schema evolution – Dataflow auto-detects new columns; Dataproc jobs may fail if schema changes. Use Avro/Parquet for flexibility.

⚡ Recently practiced quizzes in this class

Machine Learning Test Machine Learning: Recommendation Systems Questions Machine Learning 101 Practice Test: Linear Regression Machine Learning Basics Knowledge Test Machine Learning 101 Practice Test: Fundamental Theorem of PAC Learning Machine Learning 101 Practice Test: Kernels And Kernel Trick Machine Learning 101 Practice Test: K-Nearest Neighbor Algorithm and Nearest Neighbor Analysis Machine Learning 101 Practice Test: Neural Networks in Machine Learning Machine Learning 101 Practice Test: Decision Trees Machine Learning 101 Practice Test: Version Spaces, Find-S Algorithm And Candidate Elimination Algorithm

➡️ Next Study Guide

Cloud ML - Google Cloud Professional Machine Learning Engineer: Data Preparation Architecture (Dataflow, Dataproc, Pub/Sub)

GCP_ML – Data Preparation Architecture (Dataflow, Dataproc, Pub/Sub)

Google Cloud Professional Machine Learning Engineer – Data Preparation Architecture Study Guide

What This Is

Key Terms & Services

Step-by-Step / Process Flow

Scenario: Real-Time Fraud Detection Pipeline

Common Mistakes

Certification Exam Insights

Quick Check Questions

Last-Minute Cram Sheet

❤ If you liked Fatskills, consider supporting us by checking out The Life Manuals You Never Got.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | What Should We Know?
Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson
© 2026 Fatskills.com

All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement.

Cloud ML - Google Cloud Professional Machine Learning Engineer: Data Preparation Architecture (Dataflow, Dataproc, Pub/Sub)

GCP_ML – Data Preparation Architecture (Dataflow, Dataproc, Pub/Sub)

Google Cloud Professional Machine Learning Engineer – Data Preparation Architecture Study Guide

What This Is

Key Terms & Services

Step-by-Step / Process Flow

Scenario: Real-Time Fraud Detection Pipeline

Common Mistakes

Certification Exam Insights

Quick Check Questions

Last-Minute Cram Sheet

❤ If you liked Fatskills, consider supporting us by checking out The Life Manuals You Never Got.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | What Should We Know? Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson© 2026 Fatskills.com

All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | What Should We Know?
Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson
© 2026 Fatskills.com