Fatskills
Practice. Master. Repeat.
Study Guide: Cloud ML - Azure AI Engineer Associate (Exam AI-102): Monitor and Diagnose (Application Insights, Log Analytics, Azure Monitor)
Source: https://www.fatskills.com/machine-learning-101/chapter/cloud-ml-cert-azure-ai-monitor-and-diagnose-application-insights-log-analytics-azure-monitor

Cloud ML - Azure AI Engineer Associate (Exam AI-102): Monitor and Diagnose (Application Insights, Log Analytics, Azure Monitor)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~7 min read

Azure_AI – Monitor and Diagnose (Application Insights, Log Analytics, Azure Monitor)

Azure AI-102 Study Guide: Monitor and Diagnose (Application Insights, Log Analytics, Azure Monitor)

What This Is

Monitoring and diagnostics in Azure AI/ML ensure your models, endpoints, and pipelines run reliably, scale efficiently, and meet SLAs. Imagine deploying a real-time fraud detection model behind an Azure Machine Learning (AML) managed endpoint—without monitoring, you won’t know if latency spikes are due to cold starts, throttling, or data drift. Application Insights tracks live metrics (latency, failures), Log Analytics stores and queries logs, and Azure Monitor provides alerts and dashboards. Together, they form the observability stack for AI systems, helping you debug failures, optimize costs, and meet compliance requirements.


Key Terms & Services

  • Azure Monitor: The umbrella service for collecting, analyzing, and acting on telemetry (metrics, logs, traces) from Azure resources. Think of it as the "nervous system" of your cloud environment.
  • Best for: Centralized monitoring, alerting, and visualization (e.g., tracking AML endpoint latency or AKS pod failures).

  • Application Insights: A feature of Azure Monitor that provides APM (Application Performance Monitoring) for live apps, including ML endpoints, APIs, and web services.

  • Best for: Tracking request rates, failure rates, dependency calls (e.g., to a database or external API), and custom metrics (e.g., model prediction confidence).
  • Key feature: Live Metrics Stream (real-time telemetry) and Smart Detection (automated anomaly alerts).

  • Log Analytics: A log storage and query engine (part of Azure Monitor) that ingests logs from Azure resources, VMs, containers, and custom apps.

  • Best for: Storing and querying logs (e.g., AML pipeline runs, AKS pod logs, or custom Python logs from a model training script).
  • Query language: Kusto Query Language (KQL)—essential for the AI-102 exam.

  • Azure Monitor Alerts: Rules that trigger actions (e.g., emails, webhooks, or Azure Functions) when a condition is met (e.g., "AML endpoint latency > 500ms for 5 minutes").

  • Best for: Proactive incident response (e.g., scaling up an AKS cluster when CPU > 80%).

  • Azure Monitor Workbooks: Interactive dashboards that combine metrics, logs, and visualizations (e.g., a workbook showing AML endpoint performance + data drift metrics).

  • Best for: Custom dashboards for stakeholders (e.g., business teams monitoring model accuracy over time).

  • Azure Monitor Metrics Explorer: A tool to visualize and analyze metrics (e.g., AML endpoint request count, AKS CPU usage).

  • Best for: Ad-hoc troubleshooting (e.g., "Why did my endpoint latency spike at 2 PM?").

  • Azure Monitor Logs (Diagnostic Settings): Configures where logs are sent (e.g., Log Analytics, Storage Account, Event Hub).

  • Best for: Routing AML pipeline logs to Log Analytics for long-term retention.

  • Azure Monitor for Containers: Specialized monitoring for AKS (Azure Kubernetes Service), including pod logs, node metrics, and cluster health.

  • Best for: Debugging AML endpoints deployed on AKS (e.g., "Why is my pod crashing?").

  • Azure Monitor Autoscale: Automatically scales resources (e.g., AKS nodes, AML compute instances) based on metrics (e.g., CPU, memory, or custom metrics like "requests per second").

  • Best for: Cost optimization (e.g., scaling down AML compute during off-peak hours).

  • Azure Event Grid: A pub/sub service for event-driven architectures (e.g., triggering an alert when an AML pipeline fails).

  • Best for: Real-time notifications (e.g., "Notify Slack when model training completes").

  • Azure Data Explorer (ADX): A big data analytics platform (similar to Log Analytics but optimized for high-volume, low-latency queries).

  • Best for: Advanced log analysis (e.g., querying terabytes of AML pipeline logs for trends).

Step-by-Step / Process Flow

Scenario: Monitoring an AML Managed Endpoint for a Fraud Detection Model

  1. Enable Application Insights for the AML Endpoint
  2. In the AML Studio, navigate to your managed endpoint-Monitoring-Enable Application Insights.
  3. Why? This automatically instruments the endpoint with APM (latency, failures, dependencies).

  4. Configure Diagnostic Settings to Log Analytics

  5. Go to Azure Monitor-Diagnostic settings-Select your AML workspace-Add a setting to send logs to Log Analytics.
  6. What to log?:

    • AmlComputeClusterEvent (compute usage)
    • AmlRunStatusChangedEvent (pipeline runs)
    • AmlDataStoreEvent (data access)
  7. Set Up Alerts for Anomalies

  8. In Azure Monitor-Alerts-New alert rule:

    • Scope: AML endpoint (e.g., fraud-detection-endpoint).
    • Condition: Failed Requests > 5 in 5 minutes.
    • Action: Send email to [email protected] + trigger an Azure Logic App to restart the endpoint.
  9. Create a Workbook for Stakeholders

  10. In Azure Monitor-Workbooks-New:

    • Add a metrics chart (e.g., "Endpoint Latency Over Time").
    • Add a KQL query (e.g., requests | where success == false | summarize count() by bin(timestamp, 1h)).
    • Share with the fraud team via a direct link.
  11. Query Logs for Debugging

  12. Open Log Analytics-Run a KQL query to find failed requests: kql requests | where cloud_RoleName == "fraud-detection-endpoint" | where success == false | project timestamp, operation_Name, resultCode, duration | order by timestamp desc
  13. Why? This helps identify if failures are due to model errors (e.g., 500 Internal Server Error) or throttling (e.g., 429 Too Many Requests).

  14. Autoscale the Endpoint Based on Traffic

  15. In AML Studio-Endpoints-Select your endpoint-Autoscale:
    • Default: 1 instance.
    • Scale out: Add 1 instance when requests per second > 100.
    • Scale in: Remove 1 instance when requests per second < 50.

Common Mistakes

Mistake Correction
Assuming Application Insights is only for web apps Application Insights works for any HTTP-based service, including AML endpoints, AKS pods, and Azure Functions. Enable it for all ML endpoints.
Not enabling diagnostic settings for AML AML logs (e.g., pipeline runs, compute usage) won’t appear in Log Analytics unless you explicitly configure diagnostic settings.
Using Log Analytics for real-time alerts Log Analytics has ~5-10 minute latency for logs. For real-time alerts, use Application Insights Metrics (e.g., requests/failure rate).
Forgetting to set up autoscale for AML endpoints AML endpoints don’t autoscale by default. Without autoscale, you’ll either overpay for idle instances or throttle users during traffic spikes.
Querying logs without KQL knowledge The AI-102 exam heavily tests KQL. Learn basic queries like where, summarize, project, and join. Example: requests | where duration > 1000 | count.

Certification Exam Insights

  1. Service Selection Traps
  2. Application Insights vs. Log Analytics:
    • Use Application Insights for real-time APM (latency, failures, dependencies).
    • Use Log Analytics for long-term log storage and querying (e.g., AML pipeline logs).
  3. Azure Monitor Alerts vs. AML Alerts:

    • Azure Monitor Alerts are for infrastructure metrics (e.g., AKS CPU, AML endpoint latency).
    • AML Alerts are for ML-specific events (e.g., model drift, data drift).
  4. Key Constraints

  5. Log Analytics retention: Default is 30 days (can be extended to 2 years for a fee).
  6. Application Insights sampling: By default, only 100% of failed requests and 1% of successful requests are logged (adjustable).
  7. KQL query limits: Log Analytics queries time out after 10 minutes (optimize with summarize and where clauses).

  8. Tricky Scenarios

  9. "Which service to use for debugging a failed AML pipeline?"
    • Answer: Log Analytics (query AmlComputeClusterEvent and AmlRunStatusChangedEvent).
  10. "How to monitor AML endpoint latency in real time?"
    • Answer: Application Insights Live Metrics Stream.
  11. "Which service triggers an action when AML model accuracy drops?"

    • Answer: Azure Monitor Alerts (condition: accuracy < 0.9-action: Logic App to retrain the model).
  12. Cost Optimization

  13. Log Analytics: Pay per GB ingested and GB stored. Use log filtering (e.g., exclude verbose logs) to reduce costs.
  14. Application Insights: Pay per GB of data ingested. Use sampling to reduce volume.

Quick Check Questions

  1. A data scientist notices that their AML endpoint is returning 500 Internal Server Error for some requests. They need to debug the issue quickly. Which service should they use first?
  2. Answer: Application Insights (check the Failures blade for error details and stack traces).
  3. Why? Application Insights provides real-time APM for endpoints, including error traces.

  4. An ML engineer wants to set up an alert that triggers when an AML pipeline run fails. Which service should they configure?

  5. Answer: Azure Monitor Alerts (condition: AML pipeline status = "Failed").
  6. Why? AML pipeline logs are sent to Log Analytics, and alerts are configured in Azure Monitor.

  7. A team needs to store AML pipeline logs for 1 year for compliance. Which service should they use, and what setting must they configure?

  8. Answer: Log Analytics with retention set to 365 days.
  9. Why? Log Analytics supports long-term retention (up to 2 years), while Application Insights is optimized for short-term APM.

Last-Minute Cram Sheet

  1. Application Insights = APM for live apps (latency, failures, dependencies).
  2. Log Analytics = Log storage + KQL queries (AML pipeline logs, AKS pod logs).
  3. Azure Monitor = Umbrella service (alerts, workbooks, metrics).
  4. KQL is required for Log Analytics queries (e.g., requests | where success == false).
  5. Diagnostic settings must be configured to send AML logs to Log Analytics.
  6. Autoscale AML endpoints to handle traffic spikes (default: 1 instance).
  7. Application Insights sampling reduces costs (default: 1% of successful requests).
  8. Log Analytics retention = 30 days (extendable to 2 years).
  9. Application Insights-Log Analytics (use the right tool for the job).
  10. AML endpoints don’t autoscale by default (configure it manually).