Fatskills
Practice. Master. Repeat.
Study Guide: Cloud ML - Google Cloud Professional Machine Learning Engineer: Data Preparation Architecture (Dataflow, Dataproc, Pub/Sub)
Source: https://www.fatskills.com/hesi/chapter/cloud-ml-cert-gcp-ml-data-preparation-architecture-dataflow-dataproc-pubsub

Cloud ML - Google Cloud Professional Machine Learning Engineer: Data Preparation Architecture (Dataflow, Dataproc, Pub/Sub)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~7 min read

GCP_ML – Data Preparation Architecture (Dataflow, Dataproc, Pub/Sub)


Google Cloud Professional Machine Learning Engineer – Data Preparation Architecture Study Guide

Topic: Data Preparation Architecture (Dataflow, Dataproc, Pub/Sub)


What This Is

Data preparation is the backbone of any ML pipeline—garbage in, garbage out. In GCP, Dataflow (serverless stream/batch processing), Dataproc (managed Spark/Hadoop), and Pub/Sub (real-time messaging) handle ingestion, cleaning, transformation, and feature engineering at scale. Real-world scenario: A fintech company processes 10M+ daily transactions for fraud detection, requiring low-latency feature computation (e.g., rolling averages of user spending) and batch training on historical data. The pipeline must handle schema evolution, late-arriving data, and exactly-once processing while keeping costs predictable.


Key Terms & Services

  • Cloud Dataflow:
    GCP’s serverless stream/batch processing service (Apache Beam under the hood). Best for ETL, feature engineering, and real-time analytics with auto-scaling and exactly-once processing. Use when you need unified batch/stream processing without managing clusters.

  • Cloud Dataproc:
    GCP’s managed Spark/Hadoop service. Best for large-scale batch processing (e.g., training data prep) or custom PySpark jobs where you need full control over the runtime (e.g., installing libraries). Not serverless—you pay for cluster uptime.

  • Pub/Sub:
    GCP’s real-time messaging service (like Kafka). Best for decoupling producers/consumers (e.g., IoT devices → fraud detection pipeline) and fan-out patterns (e.g., one event triggers multiple ML models). Guarantees at-least-once delivery (idempotency required).

  • BigQuery:
    GCP’s serverless data warehouse. Best for SQL-based feature engineering, training data extraction, and ad-hoc analysis. Integrates with Vertex AI Feature Store for low-latency feature serving.

  • Vertex AI Feature Store:
    GCP’s managed feature repository for storing, sharing, and serving ML features. Reduces training-serving skew by ensuring the same features are used in both phases. Supports online/offline feature retrieval.

  • Data Fusion:
    GCP’s drag-and-drop ETL tool (like Informatica). Best for no-code/low-code pipelines (e.g., business analysts moving data from Salesforce to BigQuery). Not for ML-specific transformations (use Dataflow/Dataproc instead).

  • Apache Beam:
    Open-source unified batch/stream processing model (used by Dataflow). Key concepts:

  • PCollection: Distributed dataset (like an RDD in Spark).
  • PTransform: Operations (e.g., ParDo, GroupByKey).
  • Windowing: Splits unbounded data into finite chunks (e.g., fixed, sliding, or session windows).

  • Spark on Dataproc:

  • RDDs (Resilient Distributed Datasets): Immutable, fault-tolerant collections.
  • DataFrames: Schema-aware, optimized for SQL-like operations.
  • Structured Streaming: Micro-batch processing for real-time (but higher latency than Dataflow).

  • Schema Evolution:
    Handling changes in data structure (e.g., new columns) without breaking pipelines. BigQuery and Dataflow support schema auto-detection; Dataproc requires manual handling.

  • Exactly-Once Processing:
    Ensures each event is processed once and only once (critical for financial transactions). Dataflow supports this natively; Pub/Sub + Dataflow requires idempotent sinks (e.g., BigQuery with WRITE_TRUNCATE).

  • Watermarks:
    Mechanism to track late-arriving data in streaming pipelines. Dataflow uses watermarks to decide when to close a window (e.g., "wait 5 minutes for late transactions").

  • Cost Model:

  • Dataflow: Pay per vCPU-hour, memory, and shuffle data (expensive for large joins).
  • Dataproc: Pay for cluster uptime (cheaper for long-running batch jobs).
  • Pub/Sub: Pay per message and GB transferred.


Step-by-Step / Process Flow


Scenario: Real-Time Fraud Detection Pipeline

Goal: Ingest transactions → compute rolling features (e.g., "user spent $X in last 5 mins") → train a model → serve predictions in <100ms.


  1. Ingest Data with Pub/Sub
  2. Create a Pub/Sub topic (transactions) and subscription (fraud-detection-sub).
  3. Configure IoT devices/ATMs to publish transactions to the topic.
  4. Why Pub/Sub? Decouples producers (devices) from consumers (Dataflow) and handles spikes in traffic.

  5. Stream Processing with Dataflow

  6. Write an Apache Beam pipeline (Python/Java) to:
    • Read from Pub/Sub (beam.io.ReadFromPubSub).
    • Apply windowing (e.g., 5-minute sliding windows for rolling averages).
    • Compute features (e.g., user_id, transaction_count_5min, avg_amount_5min).
    • Write to BigQuery (for training) and Vertex AI Feature Store (for serving).
  7. Deploy with:
    bash
    gcloud dataflow jobs run fraud-features \
    --gcs-location=gs://dataflow-templates/latest/Python \
    --parameters=input_subscription=projects/PROJECT/subscriptions/fraud-detection-sub,output_table=PROJECT:dataset.transactions
  8. Key settings:


    • --enableStreamingEngine (reduces cost by offloading shuffle to GCP).
    • --numWorkers=10 (auto-scales up to 100 if needed).
  9. Batch Training Data Prep with Dataproc

  10. For historical training data, use Dataproc to:
    • Read raw data from Cloud Storage (e.g., gs://fraud-data/raw/).
    • Run a PySpark job to:
    • Clean data (handle missing values, outliers).
    • Join with reference tables (e.g., user_metadata).
    • Write to BigQuery or TFRecords for training.
  11. Submit job:
    bash
    gcloud dataproc jobs submit pyspark \
    --cluster=fraud-cluster \
    --region=us-central1 \
    --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar \
    -- gs://fraud-data/scripts/preprocess.py
  12. Why Dataproc? Cheaper for long-running batch jobs (vs. Dataflow’s per-vCPU pricing).

  13. Feature Serving with Vertex AI Feature Store

  14. Create a feature group in Vertex AI:
    ```python
    from google.cloud import aiplatform

    aiplatform.init(project=PROJECT, location=REGION) feature_group = aiplatform.FeatureGroup.create(
    name="fraud_features",
    entity_type="user_id",
    online_store_fixed_node_count=1 # For low-latency serving ) - Ingest features from Dataflow (streaming) or Dataproc (batch).
    - Serve features in real-time via:
    python features = feature_group.read(
    entity_id="user123",
    feature_selector=["transaction_count_5min", "avg_amount_5min"] ) ```

  15. Monitor and Optimize

  16. Dataflow: Monitor system lag (streaming) and autoscaling in Cloud Console.
  17. Dataproc: Use Cloud Monitoring to track Spark job metrics (e.g., shuffle read size).
  18. Pub/Sub: Set alerts for backlog size (indicates consumer lag).

Common Mistakes

Mistake Correction
Using Dataproc for real-time processing Dataproc (Spark Streaming) has higher latency than Dataflow. Use Dataflow for <1s latency requirements.
Not handling late data in Dataflow Forgetting to set allowed lateness or watermarks leads to dropped events. Use Window.into(FixedWindows.of(5, TimeUnit.MINUTES)).withAllowedLateness(Duration.standardMinutes(1)).
Over-provisioning Dataflow workers Dataflow auto-scales, but shuffle-heavy jobs (e.g., large joins) can spike costs. Use Streaming Engine to reduce shuffle overhead.
Storing raw data in BigQuery for streaming BigQuery is not a time-series database. For high-frequency streaming (e.g., IoT), use Pub/Sub → Dataflow → Bigtable or Firestore.
Ignoring schema evolution If a new column is added, Dataflow auto-detects it, but Dataproc jobs may fail. Use BigQuery’s schema auto-detection or Avro/Parquet for schema flexibility.


Certification Exam Insights

  1. Service Selection Traps:
  2. Dataflow vs. Dataproc:


    • Use Dataflow for serverless, unified batch/stream (e.g., real-time feature engineering).
    • Use Dataproc for long-running batch jobs (e.g., training data prep) or custom Spark jobs.
    • Exam trap: "Which service for a 1-hour batch job?" → Dataproc (cheaper than Dataflow for short batch jobs).
  3. Streaming vs. Batch Costs:

  4. Streaming Dataflow is more expensive than batch (per-vCPU pricing). For cost-sensitive pipelines, use Pub/Sub → Cloud Functions → BigQuery (if latency allows).

  5. Feature Store Integration:

  6. Vertex AI Feature Store is not a data warehouse. Use it for low-latency feature serving, not for storing raw data.
  7. Exam trap: "Where to store raw transaction data?" → BigQuery/Cloud Storage, not Feature Store.

  8. Exactly-Once Processing:

  9. Dataflow + Pub/Sub requires idempotent sinks (e.g., BigQuery with WRITE_TRUNCATE). Dataproc + Spark Streaming does not guarantee exactly-once.

Quick Check Questions

  1. A retail company wants to compute "customer lifetime value" (CLV) daily using 5 years of historical data. The pipeline must run in <2 hours and scale to 100M+ records. Which GCP service is the best fit?
  2. Answer: Dataproc (cheaper for long-running batch jobs; Dataflow would be overkill for daily batch).

  3. A gaming company needs to detect cheating in real-time (latency <100ms) by analyzing player actions. The pipeline must handle 10K events/sec and compute rolling features (e.g., "actions per minute"). Which GCP architecture should they use?

  4. Answer: Pub/Sub → Dataflow (streaming) → Vertex AI Feature Store (Dataflow for low-latency processing; Feature Store for serving).

  5. A data scientist notices that their model’s online predictions are worse than offline metrics. The training data comes from BigQuery, while real-time features are computed in a separate pipeline. What’s the most likely cause, and how can they fix it?

  6. Answer: Training-serving skew (features computed differently in training vs. serving). Fix by using Vertex AI Feature Store to ensure the same features are used in both phases.

Last-Minute Cram Sheet

  1. Dataflow = serverless batch/stream processing (Apache Beam). Use for ETL, feature engineering, real-time analytics.
  2. Dataproc = managed Spark/Hadoop. Use for batch jobs, custom PySpark, or long-running transformations.
  3. Pub/Sub = real-time messaging (like Kafka). Use for decoupling producers/consumers (e.g., IoT → ML pipeline).
  4. Vertex AI Feature Store = low-latency feature serving. Reduces training-serving skew.
  5. BigQuery = serverless data warehouse. Use for SQL-based feature engineering and training data extraction.
  6. Dataflow cost drivers: vCPU-hour, memory, shuffle data. Use Streaming Engine to reduce shuffle costs.
  7. Dataproc cost drivers: cluster uptime. Use preemptible VMs for cost savings (but jobs may fail).
  8. ⚠️ Dataflow streaming is not free – batch is cheaper. For cost-sensitive pipelines, use Pub/Sub → Cloud Functions → BigQuery.
  9. ⚠️ Dataproc is not for real-time – Spark Streaming has higher latency than Dataflow.
  10. ⚠️ Schema evolution – Dataflow auto-detects new columns; Dataproc jobs may fail if schema changes. Use Avro/Parquet for flexibility.