By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
Topic: Data Preparation Architecture (Dataflow, Dataproc, Pub/Sub)
Data preparation is the backbone of any ML pipeline—garbage in, garbage out. In GCP, Dataflow (serverless stream/batch processing), Dataproc (managed Spark/Hadoop), and Pub/Sub (real-time messaging) handle ingestion, cleaning, transformation, and feature engineering at scale. Real-world scenario: A fintech company processes 10M+ daily transactions for fraud detection, requiring low-latency feature computation (e.g., rolling averages of user spending) and batch training on historical data. The pipeline must handle schema evolution, late-arriving data, and exactly-once processing while keeping costs predictable.
Cloud Dataflow: GCP’s serverless stream/batch processing service (Apache Beam under the hood). Best for ETL, feature engineering, and real-time analytics with auto-scaling and exactly-once processing. Use when you need unified batch/stream processing without managing clusters.
Cloud Dataproc: GCP’s managed Spark/Hadoop service. Best for large-scale batch processing (e.g., training data prep) or custom PySpark jobs where you need full control over the runtime (e.g., installing libraries). Not serverless—you pay for cluster uptime.
Pub/Sub: GCP’s real-time messaging service (like Kafka). Best for decoupling producers/consumers (e.g., IoT devices → fraud detection pipeline) and fan-out patterns (e.g., one event triggers multiple ML models). Guarantees at-least-once delivery (idempotency required).
BigQuery: GCP’s serverless data warehouse. Best for SQL-based feature engineering, training data extraction, and ad-hoc analysis. Integrates with Vertex AI Feature Store for low-latency feature serving.
Vertex AI Feature Store: GCP’s managed feature repository for storing, sharing, and serving ML features. Reduces training-serving skew by ensuring the same features are used in both phases. Supports online/offline feature retrieval.
Data Fusion: GCP’s drag-and-drop ETL tool (like Informatica). Best for no-code/low-code pipelines (e.g., business analysts moving data from Salesforce to BigQuery). Not for ML-specific transformations (use Dataflow/Dataproc instead).
Apache Beam: Open-source unified batch/stream processing model (used by Dataflow). Key concepts:
ParDo
GroupByKey
Windowing: Splits unbounded data into finite chunks (e.g., fixed, sliding, or session windows).
Spark on Dataproc:
Structured Streaming: Micro-batch processing for real-time (but higher latency than Dataflow).
Schema Evolution: Handling changes in data structure (e.g., new columns) without breaking pipelines. BigQuery and Dataflow support schema auto-detection; Dataproc requires manual handling.
Exactly-Once Processing: Ensures each event is processed once and only once (critical for financial transactions). Dataflow supports this natively; Pub/Sub + Dataflow requires idempotent sinks (e.g., BigQuery with WRITE_TRUNCATE).
WRITE_TRUNCATE
Watermarks: Mechanism to track late-arriving data in streaming pipelines. Dataflow uses watermarks to decide when to close a window (e.g., "wait 5 minutes for late transactions").
Cost Model:
Goal: Ingest transactions → compute rolling features (e.g., "user spent $X in last 5 mins") → train a model → serve predictions in <100ms.
transactions
fraud-detection-sub
Why Pub/Sub? Decouples producers (devices) from consumers (Dataflow) and handles spikes in traffic.
Stream Processing with Dataflow
beam.io.ReadFromPubSub
user_id
transaction_count_5min
avg_amount_5min
bash gcloud dataflow jobs run fraud-features \ --gcs-location=gs://dataflow-templates/latest/Python \ --parameters=input_subscription=projects/PROJECT/subscriptions/fraud-detection-sub,output_table=PROJECT:dataset.transactions
Key settings:
--enableStreamingEngine
--numWorkers=10
Batch Training Data Prep with Dataproc
gs://fraud-data/raw/
user_metadata
bash gcloud dataproc jobs submit pyspark \ --cluster=fraud-cluster \ --region=us-central1 \ --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar \ -- gs://fraud-data/scripts/preprocess.py
Why Dataproc? Cheaper for long-running batch jobs (vs. Dataflow’s per-vCPU pricing).
Feature Serving with Vertex AI Feature Store
Create a feature group in Vertex AI: ```python from google.cloud import aiplatform
aiplatform.init(project=PROJECT, location=REGION) feature_group = aiplatform.FeatureGroup.create( name="fraud_features", entity_type="user_id", online_store_fixed_node_count=1 # For low-latency serving ) - Ingest features from Dataflow (streaming) or Dataproc (batch). - Serve features in real-time via:python features = feature_group.read( entity_id="user123", feature_selector=["transaction_count_5min", "avg_amount_5min"] ) ```
- Ingest features from Dataflow (streaming) or Dataproc (batch). - Serve features in real-time via:
Monitor and Optimize
shuffle read size
Window.into(FixedWindows.of(5, TimeUnit.MINUTES)).withAllowedLateness(Duration.standardMinutes(1))
Dataflow vs. Dataproc:
Streaming vs. Batch Costs:
Streaming Dataflow is more expensive than batch (per-vCPU pricing). For cost-sensitive pipelines, use Pub/Sub → Cloud Functions → BigQuery (if latency allows).
Feature Store Integration:
Exam trap: "Where to store raw transaction data?" → BigQuery/Cloud Storage, not Feature Store.
Exactly-Once Processing:
Answer: Dataproc (cheaper for long-running batch jobs; Dataflow would be overkill for daily batch).
A gaming company needs to detect cheating in real-time (latency <100ms) by analyzing player actions. The pipeline must handle 10K events/sec and compute rolling features (e.g., "actions per minute"). Which GCP architecture should they use?
Answer: Pub/Sub → Dataflow (streaming) → Vertex AI Feature Store (Dataflow for low-latency processing; Feature Store for serving).
A data scientist notices that their model’s online predictions are worse than offline metrics. The training data comes from BigQuery, while real-time features are computed in a separate pipeline. What’s the most likely cause, and how can they fix it?
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.