Fatskills
Practice. Master. Repeat.
Study Guide: Cloud ML - Google Cloud Professional Machine Learning Engineer: Logging, Monitoring, and Alerting (Cloud Logging, Cloud Monitoring, Vertex Explainable AI)
Source: https://www.fatskills.com/machine-learning-101/chapter/cloud-ml-cert-gcp-ml-logging-monitoring-and-alerting-cloud-logging-cloud-monitoring-vertex-explainable-ai

Cloud ML - Google Cloud Professional Machine Learning Engineer: Logging, Monitoring, and Alerting (Cloud Logging, Cloud Monitoring, Vertex Explainable AI)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~6 min read

GCP_ML – Logging, Monitoring, and Alerting (Cloud Logging, Cloud Monitoring, Vertex Explainable AI)

Google Cloud Professional Machine Learning Engineer Study Guide: Logging, Monitoring, and Alerting (Cloud Logging, Cloud Monitoring, Vertex Explainable AI)

What This Is

Logging, monitoring, and alerting are critical for maintaining reliability, performance, and explainability in ML pipelines. Without them, models can silently degrade, drift, or produce biased predictions—leading to costly failures. For example, in a real-time fraud detection system, you need to: - Log inference requests (who, when, what input, what prediction). - Monitor latency, error rates, and feature drift (e.g., sudden spikes in transaction amounts). - Alert when model confidence drops below a threshold (e.g., fraud probability < 0.7). - Explain why a transaction was flagged (e.g., "high transaction amount + unusual location") to comply with regulations like GDPR.

Google Cloud provides Cloud Logging (centralized logs), Cloud Monitoring (metrics, dashboards, alerts), and Vertex Explainable AI (model interpretability) to solve these challenges.


Key Terms & Services

Google Cloud Services

  • Cloud Logging: GCP’s centralized log management service. Stores, searches, and analyzes logs from Compute Engine, GKE, Cloud Functions, Vertex AI, and custom apps. Best for debugging, auditing, and compliance.
  • Cloud Monitoring: GCP’s observability platform. Collects metrics, logs, and traces from GCP and hybrid environments. Best for dashboards, SLOs, and alerting.
  • Vertex AI Model Monitoring: Tracks drift, skew, and prediction quality in deployed models. Alerts when features or predictions deviate from training data.
  • Vertex Explainable AI: Provides feature attributions (e.g., SHAP values, integrated gradients) to explain model predictions. Critical for regulated industries (finance, healthcare).
  • Cloud Audit Logs: Tracks who did what, when in GCP (e.g., "User X deployed a model to Vertex AI at 2:30 PM"). Required for compliance (SOC 2, HIPAA).
  • Cloud Trace: Distributed tracing for latency analysis (e.g., "Why is my Vertex AI endpoint slow?").
  • Cloud Profiler: CPU and heap profiling for performance bottlenecks in ML services (e.g., custom prediction containers).

General ML Concepts

  • Feature Drift: When input data distribution changes over time (e.g., user behavior shifts post-pandemic). Detected via KL divergence, JS divergence, or population stability index (PSI).
  • Prediction Drift: When model outputs change unexpectedly (e.g., fraud detection model starts flagging too many false positives).
  • Bias Detection: Monitoring for disparate impact (e.g., loan approvals favoring one demographic). Vertex Explainable AI helps identify biased features.
  • SLOs (Service Level Objectives): Targets for model performance (e.g., "99% of predictions must return in <200ms"). Cloud Monitoring enforces these.

Step-by-Step / Process Flow

1. Set Up Logging for an ML Pipeline

Scenario: You’re deploying a Vertex AI endpoint for a recommendation model. You need to log: - Inference requests (user ID, input features, timestamp). - Prediction responses (recommended items, confidence scores). - Errors (e.g., "Feature X missing").

Steps:
1. Enable Cloud Logging for Vertex AI: - Go to Vertex AI > Model Deployments > [Your Endpoint] > Logs. - Enable "Request/Response Logging" (stores inputs/outputs in Cloud Logging).
2. Custom Logs via Python SDK: ```python from google.cloud import logging logging_client = logging.Client() logger = logging_client.logger("vertex_ai_recommendations")

def predict(request): logger.log_struct({ "user_id": request.user_id, "features": request.features, "prediction": prediction, "timestamp": datetime.now().isoformat() }) return prediction ``
3. Query Logs in Cloud Logging: - Filter with:
resource.type="aiplatform.googleapis.com/Endpoint" jsonPayload.method="predict"`. - Export to BigQuery for long-term analysis.


2. Monitor Model Performance & Drift

Scenario: Your fraud detection model’s precision drops from 95% to 80%. You need to detect this automatically.

Steps:
1. Enable Vertex AI Model Monitoring: - In Vertex AI > Model Monitoring, create a monitoring job. - Select: - Objective: "Prediction drift" or "Feature skew". - Baseline: Training data (or a recent time window). - Schedule: Hourly/daily.
2. Set Alerting Policies in Cloud Monitoring: - Go to Cloud Monitoring > Alerting > Create Policy. - Condition: metric.type="aiplatform.googleapis.com/model_monitoring/drift" > threshold (e.g., PSI > 0.25). - Notification: Email/Slack/PagerDuty.
3. Visualize in Dashboards: - Create a Cloud Monitoring dashboard with: - Latency (p99). - Error rate. - Drift metrics (PSI, KL divergence).


3. Explain Model Predictions with Vertex Explainable AI

Scenario: A bank’s loan approval model rejects an applicant. The applicant requests an explanation (GDPR "right to explanation").

Steps:
1. Enable Explainability During Training: - In Vertex AI Training, set explanation_method="integrated-gradients" (or "sampled-shapley"). - Deploy the model with explanations enabled.
2. Request Explanations at Inference: python from google.cloud import aiplatform endpoint = aiplatform.Endpoint("projects/PROJECT/locations/REGION/endpoints/ENDPOINT_ID") response = endpoint.explain(instances=[input_data]) print(response.explanations[0].attributions) # Feature importance
3. Log Explanations for Compliance: - Store explanations in Cloud Logging or BigQuery for audits.


Common Mistakes

Mistake Correction
Assuming Cloud Logging is enabled by default for Vertex AI. Vertex AI does not log requests/responses by default. You must enable it in the endpoint settings or via SDK.
Monitoring only latency/errors, not drift. Model performance can degrade without errors. Always monitor feature drift, prediction drift, and bias.
Using Cloud Monitoring for logs (or vice versa). Cloud Monitoring = metrics/dashboards. Cloud Logging = logs. They’re complementary, not interchangeable.
Not setting up alerts for drift. Drift can go unnoticed for weeks. Set automated alerts in Cloud Monitoring.
Ignoring Vertex Explainable AI’s cost. Explanations double inference costs (each explain request counts as 2 predictions). Use sparingly for compliance.

Certification Exam Insights

What the Exam Tests

  1. Service Selection Traps:
  2. "When to use Cloud Logging vs. Cloud Monitoring?"
    • Logging: Debugging, auditing, compliance (e.g., "Who called this endpoint?").
    • Monitoring: Metrics, dashboards, alerts (e.g., "Is latency > 200ms?").
  3. "Vertex AI Model Monitoring vs. Vertex Explainable AI?"

    • Model Monitoring: Detects drift/skew (operational).
    • Explainable AI: Explains individual predictions (compliance).
  4. Key Constraints:

  5. Vertex AI Model Monitoring only works for structured data (not images/text).
  6. Cloud Logging retention: 30 days by default (extend with Log Buckets or export to BigQuery).
  7. Explainable AI supports tabular data and images (not text).

  8. "Which Service?" Scenarios:

  9. Need to debug a failed Vertex AI training job?-Cloud Logging (filter for aiplatform.googleapis.com/TrainingJob).
  10. Need to alert on high prediction latency?-Cloud Monitoring (create an alert policy).
  11. Need to explain a loan rejection to a customer?-Vertex Explainable AI.

Quick Check Questions

Question 1

A fintech company’s fraud detection model is deployed on Vertex AI. They need to comply with GDPR and provide explanations for rejected transactions. Which GCP service should they use? ? Answer: Vertex Explainable AI (provides feature attributions for individual predictions). ? Why not Vertex AI Model Monitoring? That detects drift, not explanations.

Question 2

A data scientist notices that their Vertex AI endpoint’s latency has increased from 100ms to 500ms. They want to identify the bottleneck. Which two GCP services should they use? ? Answer: Cloud Trace (distributed tracing) + Cloud Profiler (CPU/heap analysis). ? Why not Cloud Logging? Logs won’t show latency breakdowns.

Question 3

A retail company’s recommendation model is experiencing feature drift (user behavior changed post-holiday season). They want to detect this automatically and alert the ML team. Which GCP service should they configure? ? Answer: Vertex AI Model Monitoring (tracks drift/skew) + Cloud Monitoring alerts. ? Why not Cloud Logging? Logs won’t calculate drift metrics.


Last-Minute Cram Sheet

  1. Cloud Logging = logs (debugging, auditing). Cloud Monitoring = metrics/dashboards/alerts.
  2. Vertex AI Model Monitoring detects drift/skew (operational). Vertex Explainable AI explains predictions (compliance).
  3. Enable request/response logging in Vertex AI endpoints (not on by default).
  4. Explainable AI costs 2x per prediction (each explain request counts as 2 inferences).
  5. Cloud Logging retention: 30 days default. Extend with Log Buckets or export to BigQuery.
  6. Vertex AI Model Monitoring only works for structured data (not images/text).
  7. Cloud Audit Logs track who did what (e.g., "User X deployed a model").
  8. SLOs (e.g., "99% of predictions <200ms") are enforced in Cloud Monitoring.
  9. Don’t confuse Cloud Trace (latency) with Cloud Profiler (CPU/heap).
  10. Vertex AI Model Monitoring-Vertex Explainable AI (drift vs. explanations).