Fatskills
Practice. Master. Repeat.
Study Guide: Cloud ML - Azure AI Engineer Associate (Exam AI-102): Azure AI Vision (OCR, Spatial Analysis)
Source: https://www.fatskills.com/hesi/chapter/cloud-ml-cert-azure-ai-azure-ai-vision-ocr-spatial-analysis

Cloud ML - Azure AI Engineer Associate (Exam AI-102): Azure AI Vision (OCR, Spatial Analysis)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~7 min read

Azure_AI – Azure AI Vision (OCR, Spatial Analysis)

Azure AI Vision (OCR, Spatial Analysis) – AI-102 Exam-Ready Study Guide

What This Is

Azure AI Vision is a suite of pre-built computer vision APIs that extract text (OCR), analyze spatial relationships (people/object detection in video), and interpret visual content (image tagging, object detection). It’s critical in ML pipelines where unstructured image/video data must be converted into structured insights—e.g., automating invoice processing (OCR), monitoring retail foot traffic (Spatial Analysis), or validating IDs in banking. Unlike custom-trained models (e.g., Azure Custom Vision), these APIs require no training data and deploy in minutes, making them ideal for rapid prototyping or low-code solutions.


Key Terms & Services

  • Azure AI Vision (Computer Vision API): Microsoft’s managed service for image analysis (tagging, object detection, OCR, facial analysis). Best for pre-built models with no training required. Supports batch and real-time processing.

  • Read API (OCR): A specialized OCR endpoint within Azure AI Vision optimized for printed and handwritten text (e.g., receipts, forms, PDFs). Handles multi-language and complex layouts (tables, mixed fonts).

  • Spatial Analysis: A video analytics feature that detects people/objects in space (e.g., crowd density, social distancing, queue length). Uses RTSP streams (security cameras) or pre-recorded videos. Outputs JSON events (e.g., "Person entered zone A at 10:02 AM").

  • Azure Video Analyzer (AVA): A hybrid service (cloud + edge) for real-time video processing (e.g., object tracking, anomaly detection). Often paired with Spatial Analysis for IoT scenarios (e.g., smart cities).

  • Azure Form Recognizer: A document-focused OCR service with pre-built models for invoices, receipts, IDs, and business cards. Better than Read API for structured forms (e.g., extracting line items from a receipt).

  • Azure Cognitive Search: A search engine that indexes OCR output (e.g., PDFs, images) for full-text search. Often used with AI Vision to enable searchable archives (e.g., legal documents).

  • Azure IoT Edge: Deploys AI Vision models to edge devices (e.g., cameras, drones) for low-latency processing without cloud dependency. Critical for offline scenarios (e.g., oil rigs, ships).

  • Bounding Box: A rectangle (x,y coordinates) marking the location of detected text/objects in an image. Returned by OCR and Spatial Analysis for downstream processing (e.g., cropping, redacting).

  • Confidence Score: A 0–1 probability indicating how certain the model is about a prediction (e.g., "92% confidence this is the word 'Invoice'"). Used to filter low-quality results in production.

  • RTSP (Real-Time Streaming Protocol): A video streaming protocol used by cameras to send live feeds to Spatial Analysis or Video Analyzer. Requires on-premises or edge deployment for real-time processing.

  • Batch vs. Real-Time Processing:

  • Batch: Process large volumes of images/videos at once (e.g., nightly invoice processing). Uses Azure Storage + AI Vision batch endpoints.
  • Real-Time: Process individual frames/streams instantly (e.g., live retail analytics). Uses REST API calls or IoT Edge.

Step-by-Step / Process Flow

1. OCR with Azure AI Vision (Read API)

Scenario: Extract text from 10,000 scanned invoices stored in Azure Blob Storage.

  1. Create an Azure AI Vision resource
  2. Navigate to the Azure Portal-Create a Cognitive Services resource (select "Computer Vision").
  3. Note the endpoint (e.g., https://<your-resource>.cognitiveservices.azure.com/) and API key.

  4. Upload images to Azure Blob Storage

  5. Create a Blob Storage container (e.g., invoices-input).
  6. Upload PDFs/images (supported formats: JPEG, PNG, PDF, TIFF).

  7. Call the Read API (Batch or Real-Time)

  8. Batch (async):
    • Use the POST /vision/v3.2/read/analyze endpoint with the Blob Storage SAS URL.
    • Poll the GET /vision/v3.2/read/operations/{operationId} endpoint for results.
  9. Real-Time (sync):

    • Send a base64-encoded image or public URL to POST /vision/v3.2/read/analyze.
    • Receive immediate JSON response with text + bounding boxes.
  10. Process the OCR output

  11. Parse the JSON to extract:
    • Detected text (e.g., "Invoice #: 12345").
    • Bounding boxes (for cropping/redaction).
    • Confidence scores (filter low-confidence results).
  12. Store results in Azure SQL Database or Cosmos DB for downstream use.

  13. Automate with Azure Functions

  14. Trigger a Function on new Blob uploads to call the Read API.
  15. Use Durable Functions for async batch processing.

2. Spatial Analysis for Retail Foot Traffic

Scenario: Count customers entering a store and measure queue wait times using security cameras.

  1. Set up Azure Video Analyzer (AVA)
  2. Deploy an AVA resource in the Azure Portal.
  3. Configure an IoT Edge device (e.g., NVIDIA Jetson) to run the AVA module near the camera.

  4. Connect the camera feed

  5. Configure the camera to stream via RTSP (e.g., rtsp://<camera-ip>/stream).
  6. Register the camera in AVA and assign a topology (e.g., "RetailAnalytics").

  7. Define spatial zones and rules

  8. Use the AVA portal to draw:
    • Entry/exit zones (e.g., "Front Door").
    • Queue lines (e.g., "Checkout Line").
  9. Set rules (e.g., "Alert if >5 people in queue for >10 minutes").

  10. Deploy the Spatial Analysis model

  11. Push the AVA module to the IoT Edge device.
  12. The model processes video locally (no cloud dependency) and sends JSON events to Azure Event Hubs.

  13. Visualize insights in Power BI

  14. Connect Event Hubs to Azure Stream Analytics to aggregate data.
  15. Build a Power BI dashboard showing:
    • Peak foot traffic hours.
    • Average queue wait time.
    • Heatmaps of customer movement.

Common Mistakes

Mistake Correction
Using Read API for structured forms (e.g., invoices). Use Azure Form Recognizer instead—it’s optimized for key-value pairs (e.g., "Total: $100") and tables. Read API is better for unstructured text (e.g., books, signs).
Assuming Spatial Analysis works with cloud-only video. Spatial Analysis requires RTSP streams or pre-recorded videos processed at the edge (IoT Edge). It cannot analyze videos stored in Blob Storage directly.
Ignoring confidence scores in OCR output. Always filter results with low confidence (e.g., <80%) to avoid errors in downstream systems (e.g., billing). Use Azure Functions to auto-reject low-confidence extractions.
Deploying AI Vision models to edge without IoT Edge. For offline/low-latency scenarios, use IoT Edge to deploy the model. The cloud API is for online-only processing.
Mixing up Azure AI Vision and Custom Vision. - AI Vision: Pre-built models (no training needed).
- Custom Vision: Train your own models (e.g., detect custom objects like "defective widgets").

Certification Exam Insights

  1. Service Selection Traps
  2. OCR: Know when to use Read API (general text) vs. Form Recognizer (structured documents).
  3. Video Analytics: Spatial Analysis (people/object tracking) vs. Video Analyzer (custom models + edge deployment).
  4. Search: Azure Cognitive Search (index OCR output) vs. AI Vision (extract text).

  5. Key Constraints

  6. Read API:
    • Max 4 MB/image (for synchronous calls).
    • Async batch supports up to 10,000 images per request.
  7. Spatial Analysis:

    • Requires RTSP (not HTTP video streams).
    • IoT Edge is mandatory for real-time processing.
  8. Tricky Scenarios

  9. "Which service for handwritten notes in a PDF?"-Read API (Form Recognizer doesn’t support handwriting well).
  10. "Which service for real-time queue monitoring in a store?"-Spatial Analysis + IoT Edge.
  11. "Which service to search OCR’d documents?"-Azure Cognitive Search.

  12. Cost Optimization

  13. Batch processing is cheaper than real-time (pay per 1,000 transactions).
  14. IoT Edge reduces cloud costs for video analytics (process locally, send only events to cloud).

Quick Check Questions

  1. A retail chain wants to analyze customer movement in stores using existing security cameras. The solution must work offline during internet outages. Which Azure service should they use?
  2. Answer: Azure Video Analyzer (AVA) + IoT Edge
  3. Explanation: Spatial Analysis requires edge deployment for offline/real-time processing. IoT Edge runs the model locally on the camera’s network.

  4. A company needs to extract line items from 50,000 PDF invoices stored in Azure Blob Storage. Which service offers the highest accuracy for this task?

  5. Answer: Azure Form Recognizer
  6. Explanation: Form Recognizer is optimized for structured documents (invoices, receipts) and extracts key-value pairs (e.g., "Item: Laptop, Price: $999").

  7. A developer is building a mobile app that scans handwritten notes in real-time. Which Azure service should they call from the app?

  8. Answer: Azure AI Vision (Read API)
  9. Explanation: The Read API supports handwritten text and offers real-time REST endpoints for mobile apps.

Last-Minute Cram Sheet

  1. Azure AI Vision = Pre-built image/video APIs (OCR, object detection, facial analysis).
  2. Read API = Best for general OCR (printed/handwritten text, PDFs, images).
  3. Form Recognizer = Best for structured documents (invoices, receipts, IDs).
  4. Spatial Analysis = Detects people/objects in video (crowds, queues, social distancing).
  5. Spatial Analysis requires RTSP + IoT Edge for real-time processing.
  6. Max image size for Read API (sync): 4 MB. Use async batch for larger files.
  7. Confidence scores (0–1) filter low-quality OCR results. Always check!
  8. Azure Cognitive Search indexes OCR output for full-text search.
  9. IoT Edge deploys AI Vision models to edge devices (cameras, drones).
  10. Custom Vision = Train your own models; AI Vision = Use pre-built models. Don’t mix them up!