Fatskills
Practice. Master. Repeat.
Study Guide: Cloud ML - Google Cloud Professional Machine Learning Engineer: MLOps and CI/CD (Vertex Pipelines, Kubeflow, Cloud Build)
Source: https://www.fatskills.com/hesi/chapter/cloud-ml-cert-gcp-ml-mlops-and-cicd-vertex-pipelines-kubeflow-cloud-build

Cloud ML - Google Cloud Professional Machine Learning Engineer: MLOps and CI/CD (Vertex Pipelines, Kubeflow, Cloud Build)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~8 min read

GCP_ML – MLOps and CI/CD (Vertex Pipelines, Kubeflow, Cloud Build)

Google Cloud Professional Machine Learning Engineer – MLOps & CI/CD Study Guide

Topic: Vertex Pipelines, Kubeflow, Cloud Build


What This Is

MLOps (Machine Learning Operations) is the practice of automating and scaling ML workflows—from data prep to training, deployment, and monitoring—while ensuring reproducibility, governance, and CI/CD (Continuous Integration/Continuous Deployment). In GCP, Vertex AI Pipelines (managed Kubeflow Pipelines) and Cloud Build are the backbone for orchestrating ML workflows, while Kubeflow (open-source) provides a portable alternative for hybrid/multi-cloud setups.

Real-world scenario: A fintech company trains a fraud-detection model nightly on fresh transaction data. They use Vertex Pipelines to automate:
1. Data validation (BigQuery-Dataflow for cleaning),
2. Feature engineering (Vertex AI Feature Store),
3. Model training (Vertex AI Training),
4. A/B testing (Vertex AI Endpoints),
5. Rollback if drift exceeds 5% (Vertex AI Model Monitoring). Cloud Build triggers the pipeline on Git pushes, ensuring code changes are tested before deployment.


Key Terms & Services

  • Vertex AI Pipelines: GCP’s managed service for running Kubeflow Pipelines (KFP) or TensorFlow Extended (TFX) workflows. Handles scheduling, artifact tracking, and execution on Google Kubernetes Engine (GKE) or serverless. Best for production-grade MLOps with minimal DevOps overhead.

  • Kubeflow Pipelines (KFP): Open-source framework for building portable ML pipelines (works on GKE, AWS EKS, or on-prem). Uses Python SDK to define pipelines as DAGs (Directed Acyclic Graphs) of components. Ideal for hybrid/multi-cloud or teams needing customization.

  • Cloud Build: GCP’s serverless CI/CD service for automating builds, tests, and deployments. Triggers pipelines on Git events (e.g., git push to Cloud Source Repositories) or schedules. Supports Docker, Terraform, and custom scripts.

  • Vertex AI Components: Reusable building blocks for pipelines (e.g., Vertex AI Training, Vertex AI Hyperparameter Tuning, Vertex AI Batch Prediction). Each component is a containerized step with inputs/outputs tracked in Vertex ML Metadata.

  • Artifact Registry: GCP’s managed container registry for storing Docker images (e.g., custom training containers). Replaces Container Registry (deprecated). Critical for reproducible builds in pipelines.

  • Vertex ML Metadata: GCP’s lineage tracking service for ML artifacts (datasets, models, metrics). Automatically logs inputs/outputs of pipeline steps. Enables auditability and reproducibility.

  • TFX (TensorFlow Extended): Google’s open-source end-to-end ML platform for production pipelines. Integrates with Vertex Pipelines for orchestration. Best for TensorFlow-centric workflows (e.g., TF Serving, TFX libraries like ExampleGen).

  • Kubeflow on GKE: Self-managed Kubeflow deployment on GKE. Offers more control than Vertex Pipelines but requires DevOps expertise (e.g., managing Istio, Katib for HPO). Useful for custom ML frameworks (e.g., PyTorch, XGBoost).

  • Cloud Scheduler: GCP’s cron service for triggering pipelines on a schedule (e.g., nightly retraining). Works with Cloud Functions or Cloud Build to start Vertex Pipelines.

  • Vertex AI Feature Store: Managed feature repository for serving pre-computed features to training/inference. Reduces training-serving skew and feature duplication. Integrates with BigQuery and Dataflow.

  • CI/CD for ML: The practice of automating ML workflows (testing, training, deployment) using tools like Cloud Build, GitHub Actions, or GitLab CI. Key steps:

  • Code-Test (unit tests, data validation),
  • Build-Train (containerize, run pipeline),
  • Deploy-Monitor (A/B test, rollback if needed).

  • Pipeline Triggers: Mechanisms to start pipelines automatically:

  • Git events (e.g., git push to main branch),
  • Schedule (e.g., daily at 2 AM via Cloud Scheduler),
  • Data changes (e.g., new files in Cloud Storage).

Step-by-Step / Process Flow

1. Design Your ML Pipeline (DAG)

  • Action: Sketch your pipeline as a DAG (e.g., data-preprocess-train-evaluate-deploy).
  • Tools:
  • Use Kubeflow Pipelines SDK (Python) to define components.
  • For TFX, use tfx.orchestration.pipeline.Pipeline.
  • Example: ```python from kfp import dsl @dsl.component def preprocess(data: Input[Dataset], output: Output[Dataset]): # Preprocessing logic here pass

@dsl.pipeline def fraud_detection_pipeline(): data = dsl.importer(...) processed_data = preprocess(data=data) model = train(data=processed_data.output) ```

2. Containerize Components

  • Action: Package each pipeline step as a Docker container.
  • Tools:
  • Cloud Build to build images (e.g., gcloud builds submit --tag gcr.io/PROJECT_ID/preprocess:v1).
  • Store images in Artifact Registry.
  • Key Rule: Use lightweight base images (e.g., python:3.9-slim) to reduce build time.

3. Deploy Pipeline to Vertex AI

  • Action: Compile the pipeline and submit to Vertex AI Pipelines.
  • Tools:
  • Compile with kfp.compiler.Compiler().compile(pipeline_func, 'pipeline.json').
  • Submit via Vertex AI SDK: python from google.cloud import aiplatform aiplatform.init(project=PROJECT_ID, location=REGION) job = aiplatform.PipelineJob( display_name="fraud-detection", template_path="pipeline.json", parameter_values={"param1": "value1"} ) job.run()
  • Key Rule: Use parameterized pipelines (e.g., data_path, model_version) for reusability.

4. Set Up CI/CD with Cloud Build

  • Action: Automate pipeline execution on Git events.
  • Tools:
  • Cloud Build YAML (cloudbuild.yaml): ```yaml steps:
    • name: 'gcr.io/cloud-builders/gcloud' args: ['builds', 'submit', '--tag', 'gcr.io/$PROJECT_ID/preprocess:v1']
    • name: 'python:3.9' args: ['-m', 'pip', 'install', 'kfp', 'google-cloud-aiplatform']
    • name: 'python:3.9' args: ['pipeline_deploy.py'] # Script to submit pipeline ```
  • Trigger setup:
    • Go to Cloud Build-Triggers-Create Trigger.
    • Link to GitHub/GitLab repo and branch (e.g., main).
    • Set trigger type (e.g., push to main).

5. Monitor and Debug

  • Action: Track pipeline runs in Vertex AI Pipelines UI.
  • Tools:
  • Vertex ML Metadata for lineage (e.g., "Which dataset trained Model v3?").
  • Cloud Logging for component logs.
  • Vertex AI Model Monitoring for drift detection.
  • Key Rule: Set up alerts (e.g., Slack notifications via Cloud Functions) for failed runs.

Common Mistakes

Mistake 1: Using Kubeflow on GKE Instead of Vertex Pipelines for Simple Workflows

  • Correction: Use Vertex Pipelines for managed orchestration (no GKE cluster to manage). Reserve Kubeflow on GKE for:
  • Custom ML frameworks (e.g., PyTorch, JAX),
  • Advanced use cases (e.g., Katib for HPO, KFServing for inference).

Mistake 2: Hardcoding Paths in Pipeline Components

  • Correction: Use pipeline parameters (e.g., data_path: str) and pass them at runtime. Example: python @dsl.pipeline def pipeline(data_path: str): data = dsl.importer(artifact_uri=data_path, ...) Why? Hardcoded paths break reproducibility and CI/CD.

Mistake 3: Skipping Artifact Tracking

  • Correction: Always log artifacts (datasets, models, metrics) to Vertex ML Metadata. Example: python from kfp.v2 import dsl @dsl.component def train(data: Input[Dataset], model: Output[Model]): # Training logic model.metadata["accuracy"] = 0.95 # Log metrics Why? Without lineage, debugging failures or auditing compliance is impossible.

Mistake 4: Not Using Cloud Build for CI/CD

  • Correction: Automate pipeline execution with Cloud Build (not manual gcloud commands). Example:
  • Trigger on git push to main.
  • Run tests, build containers, and submit pipeline in one workflow. Why? Manual deployments are error-prone and not scalable.

Mistake 5: Ignoring Pipeline Caching

  • Correction: Enable caching in Vertex Pipelines to skip unchanged steps. Example: python job = aiplatform.PipelineJob(..., enable_caching=True) Why? Caching reduces costs and speeds up iterative development.

Certification Exam Insights

1. Vertex Pipelines vs. Kubeflow on GKE

  • Trap: The exam tests when to use Vertex Pipelines (managed) vs. Kubeflow on GKE (self-managed).
  • Vertex Pipelines: Best for GCP-native, low-DevOps teams. Supports TFX, KFP, and custom containers.
  • Kubeflow on GKE: Best for multi-cloud, custom frameworks, or advanced HPO (Katib).

2. CI/CD Triggers

  • Trap: Know the 3 ways to trigger pipelines:
  • Git events (Cloud Build + GitHub/GitLab),
  • Schedules (Cloud Scheduler),
  • Data changes (Cloud Storage triggers via Cloud Functions).
  • Key Rule: For production, use Cloud Build (not manual triggers).

3. Artifact Registry vs. Container Registry

  • Trap: Container Registry is deprecated; use Artifact Registry for pipeline images.
  • Key Rule: Artifact Registry supports Docker, Maven, npm, and Python packages (Container Registry only does Docker).

4. TFX vs. Kubeflow Pipelines

  • Trap: TFX is TensorFlow-centric (e.g., ExampleGen, Trainer), while KFP is framework-agnostic.
  • Key Rule: Use TFX if your pipeline is 100% TensorFlow; use KFP for PyTorch, XGBoost, or custom code.

Quick Check Questions

Question 1

A retail company wants to retrain its recommendation model daily using fresh user clickstream data. The pipeline must run automatically on a schedule, log all artifacts, and support rollback if model performance degrades. Which GCP services should they use? Answer: Vertex AI Pipelines + Cloud Scheduler + Vertex ML Metadata + Vertex AI Model Monitoring. - Why? Vertex Pipelines orchestrates the workflow, Cloud Scheduler triggers it daily, ML Metadata tracks artifacts, and Model Monitoring detects drift.

Question 2

A data scientist wants to test a new preprocessing step in their pipeline without affecting production. They use GitHub for version control. What’s the most efficient way to implement this? Answer: Cloud Build trigger on a feature branch + Vertex Pipelines with parameterized inputs. - Why? Cloud Build can run the pipeline on a feature branch, and parameters allow testing without modifying production code.

Question 3

A team is migrating from AWS to GCP. Their current pipeline uses SageMaker Pipelines and ECR. Which GCP services should they use for equivalent functionality? Answer: Vertex AI Pipelines + Artifact Registry. - Why? Vertex Pipelines replaces SageMaker Pipelines, and Artifact Registry replaces ECR.


Last-Minute Cram Sheet

  1. Vertex Pipelines = Managed Kubeflow Pipelines (KFP) on GCP. Use for production MLOps.
  2. Kubeflow on GKE = Self-managed Kubeflow. Use for custom frameworks or multi-cloud.
  3. Cloud Build = GCP’s CI/CD service. Triggers pipelines on Git events or schedules.
  4. Artifact Registry = Replaces Container Registry. Stores Docker images for pipelines.
  5. Vertex ML Metadata = Tracks lineage (datasets-models-metrics).
  6. TFX = TensorFlow Extended. Use for TF-centric pipelines (e.g., ExampleGen).
  7. KFP SDK = Python library to define pipelines as DAGs. Compile with kfp.compiler.Compiler().
  8. Pipeline caching = Set enable_caching=True to skip unchanged steps. Disable for debugging.
  9. Cloud Scheduler = Cron for pipelines. Use with Cloud Functions or Cloud Build.
  10. Trap: Vertex Pipelines does not support Katib (use Kubeflow on GKE for HPO).