Fatskills
Practice. Master. Repeat.
Study Guide: Cloud ML - Google Cloud Professional Machine Learning Engineer: Distributed Training (GPUs, TPUs, Reduction Server)
Source: https://www.fatskills.com/hesi/chapter/cloud-ml-cert-gcp-ml-distributed-training-gpus-tpus-reduction-server

Cloud ML - Google Cloud Professional Machine Learning Engineer: Distributed Training (GPUs, TPUs, Reduction Server)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~6 min read

GCP_ML – Distributed Training (GPUs, TPUs, Reduction Server)

Google Cloud Professional Machine Learning Engineer: Distributed Training (GPUs, TPUs, Reduction Server) – Exam-Ready Study Guide

What This Is

Distributed training is the process of splitting a large ML model’s training workload across multiple machines (or accelerators) to speed up training and handle models too big for a single device. This is critical when training deep learning models (e.g., LLMs, vision transformers) on massive datasets where single-GPU training would take weeks. Real-world scenario: A fintech company needs to train a fraud detection model on 10TB of transaction logs. Using a single GPU would take 30 days, but distributed training with TPUs and Reduction Server cuts this to 2 days while keeping costs predictable.


Key Terms & Services

  • Cloud TPU (Tensor Processing Unit): Google’s custom ASIC designed for high-speed matrix operations (e.g., v5e for cost-efficient training, v4 for large-scale LLMs). Best for TensorFlow/PyTorch workloads with XLA (Accelerated Linear Algebra) compilation.
  • Cloud GPU (A100, L4, T4): NVIDIA GPUs on GCP (e.g., A100 for high-memory training, L4 for inference). More flexible than TPUs but require manual scaling (e.g., gcloud compute instances create --accelerator=type=nvidia-a100).
  • Reduction Server: A GCP-managed service that optimizes all-reduce operations (key for distributed training) by reducing network bottlenecks. Works with TensorFlow/PyTorch and Vertex AI Training.
  • Vertex AI Training: GCP’s managed service for distributed training (supports TPUs, GPUs, and custom containers). Handles job scheduling, scaling, and logging.
  • Horovod: Open-source framework for distributed training (works with TensorFlow, PyTorch, Keras). Uses ring-allreduce for gradient synchronization.
  • Data Parallelism: Splits batches across workers (each worker trains on a subset of data). Most common pattern for distributed training.
  • Model Parallelism: Splits the model across workers (e.g., one worker handles layers 1–10, another 11–20). Used for very large models (e.g., LLMs > 10B parameters).
  • Synchronous Training: All workers sync gradients at each step (slower but stable). Used in Horovod and Vertex AI Training.
  • Asynchronous Training: Workers update gradients independently (faster but can diverge). Rare in production due to instability.
  • XLA (Accelerated Linear Algebra): Compiler that optimizes TensorFlow/PyTorch code for TPUs/GPUs. Required for TPU training.
  • GKE (Google Kubernetes Engine) for Training: Run distributed training on Kubernetes (e.g., PyTorch on GKE with GPU autoscaling). More control but higher ops overhead.
  • Preemptible VMs: Cheaper (up to 80% discount) but can be terminated by GCP. Use for fault-tolerant workloads (e.g., hyperparameter tuning).

Step-by-Step / Process Flow

1. Choose Your Accelerator (GPU vs. TPU)

  • Use TPUs if:
  • Training TensorFlow/PyTorch models with XLA (e.g., tf.distribute.TPUStrategy).
  • Need cost-efficient scaling (TPUs are ~50% cheaper than GPUs for large jobs).
  • Example: gcloud compute tpus create my-tpu --zone=us-central1-b --accelerator-type=v4-8 --version=tpu-vm-tf-2.12
  • Use GPUs if:
  • Training non-XLA frameworks (e.g., Hugging Face Transformers without XLA).
  • Need flexibility (e.g., mixed precision, custom CUDA kernels).
  • Example: gcloud compute instances create my-gpu-vm --machine-type=n1-standard-16 --accelerator=type=nvidia-a100,count=4

2. Set Up Distributed Training

Option A: Vertex AI Training (Managed)

  1. Package your training code (e.g., trainer/task.py with tf.distribute.MirroredStrategy or torch.nn.parallel.DistributedDataParallel).
  2. Define a config.yaml (specify workerPoolSpecs with machineType, acceleratorType, acceleratorCount).
  3. Submit the job: bash gcloud ai custom-jobs create \ --region=us-central1 \ --display-name=my-distributed-job \ --config=config.yaml
  4. Monitor in Vertex AI Dashboard (logs, metrics, and failure handling).

Option B: GKE (Self-Managed)

  1. Create a GKE cluster with GPU nodes: bash gcloud container clusters create my-gke-cluster \ --zone=us-central1-a \ --machine-type=n1-standard-16 \ --accelerator=type=nvidia-a100,count=4 \ --num-nodes=4
  2. Deploy a PyTorch training job (using kubectl and a YAML manifest with torch.distributed).
  3. Use HorizontalPodAutoscaler to scale workers dynamically.

3. Optimize with Reduction Server

  • Enable Reduction Server in Vertex AI Training by adding: ```yaml workerPoolSpecs:
    • machineSpec: acceleratorType: NVIDIA_A100 acceleratorCount: 4 machineType: n1-standard-32 replicaCount: 4 containerSpec: imageUri: gcr.io/my-project/my-trainer:latest reductionServerConfig: enable: true ```
  • For custom setups, use gcloud compute instances create with --reduction-server-count.

4. Monitor & Debug

  • Vertex AI Metrics: Track training/step_time, gpu_utilization, network_bytes.
  • Cloud Logging: Filter for worker-0 vs. worker-1 to debug stragglers.
  • TPU Profiler: Use tf.profiler to identify bottlenecks (e.g., data loading vs. compute).

Common Mistakes

Mistake Correction
Using TPUs without XLA TPUs require XLA compilation (tf.function(jit_compile=True)). Without it, performance drops 10x.
Ignoring data pipeline bottlenecks Distributed training is network-bound. Use tf.data.Dataset with num_parallel_calls and prefetch.
Mixing GPUs and TPUs in one job GCP does not support hybrid GPU/TPU training. Pick one accelerator type per job.
Not using Reduction Server for >4 GPUs Without Reduction Server, all-reduce becomes a bottleneck. Enable it for jobs with >4 GPUs.
Assuming preemptible VMs are always cheaper Preemptible VMs can fail mid-training. Use them only for checkpointed or stateless workloads.

Certification Exam Insights

  1. TPU vs. GPU Selection:
  2. TPUs are tested for cost efficiency (e.g., "Which accelerator minimizes cost for a 100B-parameter LLM?"-v4-128 TPU).
  3. GPUs are tested for flexibility (e.g., "Which accelerator supports CUDA custom kernels?"-A100).

  4. Reduction Server Trap:

  5. The exam may ask: "When should you enable Reduction Server?"

    • Answer: For GPU jobs with >4 workers (not needed for TPUs or small GPU clusters).
  6. Vertex AI vs. GKE:

  7. Vertex AI Training is for managed jobs (less ops overhead).
  8. GKE is for custom setups (e.g., Kubeflow, Ray).

  9. Data Parallelism vs. Model Parallelism:

  10. Data parallelism is the default (splits batches).
  11. Model parallelism is for huge models (e.g., "A 50B-parameter model won’t fit on a single GPU"-use torch.nn.parallel.DistributedModelParallel).

Quick Check Questions

  1. A team is training a ResNet-50 model on 10M images. They want the fastest training time with minimal code changes. Which GCP accelerator should they use?
  2. Answer: Cloud TPU v4 (faster than GPUs for TensorFlow/PyTorch with XLA, minimal code changes with tf.distribute.TPUStrategy).

  3. A company is running a PyTorch distributed training job on 8 A100 GPUs. Training is slow due to gradient synchronization. What should they enable?

  4. Answer: Reduction Server (optimizes all-reduce operations for GPU clusters).

  5. A startup wants to train a 10B-parameter LLM on a tight budget. Which GCP setup is most cost-effective?

  6. Answer: Preemptible VMs with TPU v5e (cheaper than GPUs, TPUs are cost-efficient for LLMs).

Last-Minute Cram Sheet

  1. TPUs require XLA (tf.function(jit_compile=True)). No XLA = 10x slower.
  2. Reduction Server is for GPUs (>4 workers). Not needed for TPUs.
  3. Vertex AI Training supports TPUs/GPUs but not hybrid (pick one).
  4. Preemptible VMs are 80% cheaper but can be terminated. Use for checkpointed jobs.
  5. Data parallelism = split batches. Model parallelism = split layers (for huge models).
  6. A100 GPUs are best for flexibility (CUDA, mixed precision).
  7. TPU v4 is best for large-scale LLMs (cost-efficient).
  8. GKE is for custom setups (Kubeflow, Ray). Vertex AI is for managed jobs.
  9. All-reduce bottleneck? Enable Reduction Server or use Horovod.
  10. TPUs don’t support all PyTorch ops (check TPU compatibility list).