Fatskills
Practice. Master. Repeat.
Study Guide: Cloud ML - Azure AI Engineer Associate (Exam AI-102): Azure Cognitive Search – Indexing, Skillsets, AI Enrichment, Knowledge Store
Source: https://www.fatskills.com/hesi/chapter/cloud-ml-cert-azure-ai-azure-cognitive-search-indexing-skillsets-ai-enrichment-knowledge-store

Cloud ML - Azure AI Engineer Associate (Exam AI-102): Azure Cognitive Search – Indexing, Skillsets, AI Enrichment, Knowledge Store

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~9 min read

Azure_AI – Azure Cognitive Search – Indexing, Skillsets, AI Enrichment, Knowledge Store

Azure Cognitive Search – Indexing, Skillsets, AI Enrichment, Knowledge Store

Exam-Ready Study Guide for AI-102


What This Is

Azure Cognitive Search is a fully managed cloud search service that enables AI-powered information retrieval from structured and unstructured data (PDFs, images, databases, etc.). It’s critical in ML pipelines where semantic search, document processing, and knowledge extraction are needed—such as: - Enterprise document search (e.g., legal contracts, medical records, customer support tickets). - AI-enriched knowledge bases (e.g., extracting entities, key phrases, and relationships from invoices or research papers). - Hybrid search (combining keyword and vector search for RAG applications). - Knowledge mining (e.g., building a chatbot that answers questions from internal company documents).

Real-world scenario: A healthcare provider wants to extract patient diagnoses, medications, and lab results from unstructured clinical notes (PDFs, scanned forms) and make them searchable for doctors. They use Azure Cognitive Search with AI enrichment (OCR, entity recognition, key phrase extraction) to index the documents, then expose the results via a secure API for a custom EHR dashboard.


Key Terms & Services

  • Azure Cognitive Search (ACS): Microsoft’s managed search-as-a-service for full-text, vector, and hybrid search. Best for AI-enriched document processing (unlike Azure AI Document Intelligence, which focuses on structured data extraction from forms).

  • Index: A searchable data structure (like a database table) that stores documents (JSON objects) with fields (e.g., title, content, entities). Supports filtering, sorting, and faceting.

  • Indexer: A crawler that automatically extracts data from a data source (Blob Storage, SQL DB, Cosmos DB) and populates an index. Can run on a schedule or be triggered manually.

  • Skillset: A pipeline of AI enrichments (e.g., OCR, entity recognition, translation) applied to unstructured data during indexing. Uses prebuilt skills (e.g., EntityRecognition, KeyPhraseExtraction) or custom skills (Azure Functions, ML models).

  • AI Enrichment: The process of applying AI models (via Cognitive Services or custom ML) to extract structured data from unstructured content (e.g., text, images). Example: Using Azure Form Recognizer to pull tables from PDFs.

  • Knowledge Store: A persistent storage (Blob Storage, Table Storage, or Cosmos DB) where enriched data (from skillsets) is projected (saved) for downstream analytics (e.g., Power BI, Synapse). Unlike an index, which is optimized for search, a knowledge store is for long-term storage and analysis.

  • Vector Search (Semantic Search): Uses embeddings (from Azure OpenAI, Hugging Face, or custom models) to enable semantic similarity search (e.g., "Find documents about 'heart disease' even if they don’t contain the exact phrase").

  • Semantic Search (Preview): An enhanced search mode in ACS that uses deep learning to improve relevance (e.g., understanding synonyms, context). Requires Azure OpenAI for embeddings.

  • Cognitive Services (Azure AI Services): Prebuilt AI models (e.g., Text Analytics, Computer Vision, Translator) used in skillsets for enrichment. Example: SentimentAnalysis skill for classifying document tone.

  • Custom Skills: User-defined functions (Azure Functions, Logic Apps, or ML models) that extend skillsets (e.g., calling a custom NER model hosted in Azure ML).

  • Projection: The process of saving enriched data from a skillset into a knowledge store (e.g., tables, objects, or files). Example: Storing extracted entities in Azure Table Storage for analytics.

  • Data Source: The origin of data (e.g., Blob Storage, SQL DB, Cosmos DB) that an indexer crawls to populate an index.


Step-by-Step / Process Flow

How to Build an AI-Enriched Search Pipeline in Azure Cognitive Search

1. Set Up Data Source & Index

  • Create a storage account (Blob Storage, SQL DB, or Cosmos DB) and upload documents (PDFs, images, JSON).
  • Define an index in Azure Cognitive Search with fields (e.g., id, content, entities, language).
  • Example schema: json { "name": "clinical-notes-index", "fields": [ { "name": "id", "type": "Edm.String", "key": true }, { "name": "content", "type": "Edm.String", "searchable": true }, { "name": "entities", "type": "Collection(Edm.String)", "filterable": true }, { "name": "language", "type": "Edm.String", "filterable": true } ] }

2. Create a Skillset (AI Enrichment Pipeline)

  • Define a skillset with prebuilt skills (e.g., OCR, entity recognition) or custom skills (Azure Functions).
  • Example skillset (extracts entities and key phrases): json { "name": "clinical-notes-skillset", "skills": [ { "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill", "context": "/document", "inputs": [ { "name": "text", "source": "/document/content" } ], "outputs": [ { "name": "entities", "targetName": "entities" } ] }, { "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill", "context": "/document", "inputs": [ { "name": "text", "source": "/document/content" } ], "outputs": [ { "name": "keyPhrases", "targetName": "keyPhrases" } ] } ] }
  • Optional: Add custom skills (e.g., call an Azure ML endpoint for custom NER).

3. Configure Indexer & Knowledge Store

  • Create an indexer that:
  • Connects to the data source (e.g., Blob Storage).
  • Applies the skillset (AI enrichment).
  • Projects enriched data into a knowledge store (Blob/Table Storage).
  • Example indexer config: json { "name": "clinical-notes-indexer", "dataSourceName": "clinical-notes-blob", "targetIndexName": "clinical-notes-index", "skillsetName": "clinical-notes-skillset", "knowledgeStore": { "storageConnectionString": "DefaultEndpointsProtocol=https;AccountName=...", "projections": [ { "tables": [ { "tableName": "entities", "generatedKeyName": "entityId" } ], "objects": [], "files": [] } ] } }

4. Run & Monitor the Indexer

  • Trigger the indexer (manually or on a schedule).
  • Monitor progress in the Azure Portal (check for errors in indexer execution history).
  • Query the index using the Search Explorer or REST API: http GET https://[service-name].search.windows.net/indexes/[index-name]/docs?search=heart%20disease&$select=id,content,entities

5. Enable Vector Search (Optional)

  • Generate embeddings (using Azure OpenAI or a custom model).
  • Add a vector field to the index: json { "name": "embedding", "type": "Collection(Edm.Single)", "searchable": true, "dimensions": 1536, "vectorSearchConfiguration": "vector-config" }
  • Configure vector search in the index definition: json "vectorSearch": { "algorithmConfigurations": [ { "name": "vector-config", "kind": "hnsw", "hnswParameters": { "m": 4, "efConstruction": 400, "efSearch": 500, "metric": "cosine" } } ] }
  • Query using vectors (e.g., for RAG applications): http POST https://[service-name].search.windows.net/indexes/[index-name]/docs/search?api-version=2023-11-01 { "vector": { "value": [0.1, 0.2, ..., 0.1536], "fields": "embedding", "k": 5 } }

6. Expose Search via API & Frontend

  • Secure the search service (Azure AD, API keys, or private endpoints).
  • Build a frontend (e.g., React app, Power Apps) that queries the search API.
  • Optimize relevance with:
  • Scoring profiles (boost certain fields).
  • Semantic search (requires Azure OpenAI).
  • Synonym maps (e.g., "heart attack" = "myocardial infarction").

Common Mistakes

Mistake 1: Confusing Azure Cognitive Search with Azure AI Document Intelligence

  • Mistake: Using Azure AI Document Intelligence (formerly Form Recognizer) for full-text search or semantic search.
  • Correction:
  • Azure AI Document Intelligence is for structured data extraction (e.g., invoices, receipts, forms).
  • Azure Cognitive Search is for searching unstructured content (e.g., PDFs, emails, images) with AI enrichment.
  • Rule of thumb:
    • Need tables, key-value pairs, or form fields?-Document Intelligence.
    • Need full-text search, semantic search, or knowledge mining?-Cognitive Search.

Mistake 2: Not Using a Knowledge Store for Downstream Analytics

  • Mistake: Only storing enriched data in the search index and not in a knowledge store.
  • Correction:
  • Search indexes are optimized for search, not analytics.
  • Knowledge stores (Blob/Table Storage) allow Power BI, Synapse, or custom apps to analyze enriched data.
  • Example: Store extracted entities in Azure Table Storage for a Power BI dashboard.

Mistake 3: Overlooking Indexer Throttling & Performance

  • Mistake: Running a large indexer job without partitioning or scheduling.
  • Correction:
  • Indexers have limits (e.g., 10,000 documents per batch, 120 minutes runtime).
  • Solutions:
    • Split large datasets into smaller batches.
    • Use incremental indexing (track lastModified timestamps).
    • Schedule indexers during off-peak hours.

Mistake 4: Using Vector Search Without Proper Embeddings

  • Mistake: Enabling vector search but not generating high-quality embeddings.
  • Correction:
  • Vector search requires embeddings (from Azure OpenAI, Hugging Face, or custom models).
  • Bad embeddings = bad search results.
  • Best practices:
    • Use Azure OpenAI’s text-embedding-ada-002 for general-purpose embeddings.
    • Fine-tune embeddings for domain-specific data (e.g., medical, legal).

Mistake 5: Ignoring Security & Compliance

  • Mistake: Exposing search APIs publicly without authentication or encryption.
  • Correction:
  • Secure the search service with:
    • Azure AD authentication (for enterprise apps).
    • API keys (for testing, but rotate them).
    • Private endpoints (for VNet isolation).
  • Encrypt data at rest (Azure Storage encryption) and in transit (HTTPS).

Certification Exam Insights

1. "Which Service?" Traps

  • Azure Cognitive Search vs. Azure AI Document Intelligence:
  • Exam trick: A question asks for searching unstructured documents (e.g., research papers) but lists Document Intelligence as an option.
  • Correct answer: Azure Cognitive Search (Document Intelligence is for forms, not search).

  • Azure Cognitive Search vs. Azure AI Search (Semantic Search):

  • Semantic search is a feature of Cognitive Search, not a separate service.
  • Exam trap: A question asks which service provides semantic search—the answer is Azure Cognitive Search (not a standalone "Azure AI Search").

  • Knowledge Store vs. Index:

  • Index = Optimized for search (fast queries, but not for analytics).
  • Knowledge Store = Optimized for storage & analytics (Power BI, Synapse).
  • Exam trick: A question asks where to store enriched data for a Power BI dashboard—the answer is Knowledge Store, not the index.

2. Key Constraints & Limits

  • Indexer limits:
  • Max 10,000 documents per batch.
  • Max 120 minutes runtime per indexer run.
  • Solution: Use incremental indexing or split data into smaller batches.

  • Vector search limits (as of 2024):

  • Max 1 million vectors per index.
  • Max 2,000 dimensions per vector.
  • Exam trap: A question asks about scaling vector search—know that partitioning is required for large datasets.

  • Skillset limits:

  • Max 30 skills per skillset.
  • Prebuilt skills (e.g., OCR, entity recognition) have rate limits (e.g., 1,000 calls/minute).
  • Solution: Use custom skills (Azure Functions) for high-volume processing.

3. AI Enrichment & Skillset Questions

  • Common exam scenario:
  • "A company wants to extract entities and key phrases from PDFs and store them in a database for analytics. Which Azure services should they use?"
  • Answer:

    1. Azure Cognitive Search (for indexing & enrichment).
    2. Azure Blob Storage (data source).
    3. Azure Table Storage (knowledge store for analytics).
    4. Cognitive Services (prebuilt skills for entity/key phrase extraction).
  • Tricky question:

  • "A healthcare app needs to search for patient records using natural language queries (e.g., 'Show me patients with diabetes'). Which feature should they enable?"
  • Answer: Semantic search (requires Azure OpenAI embeddings).

4. Cost Optimization Questions

  • Exam trap: A question asks about reducing costs for a large-scale search solution.
  • Correct answers:
  • Use incremental indexing (avoid re-processing unchanged data).
  • Partition indexes (avoid hitting limits).
  • Use free-tier Cognitive Services (for low-volume enrichment).
  • Avoid over-provisioning (start with S1 tier, scale up if needed).

Quick Check Questions

Question 1

A legal firm wants to extract key clauses, dates, and parties from 50,000 contracts stored in Blob Storage and make them searchable via a web app. They also need to analyze extracted data in Power BI. Which Azure services should they use? Answer: ? Azure Cognitive Search (indexing + AI enrichment) + Azure Blob Storage (data source) + Azure Table Storage (knowledge store for Power BI) + Cognitive Services (prebuilt skills for entity extraction). ? Azure AI Document Intelligence (wrong—this is for forms, not full-text search).

Question 2

A retail company wants to implement semantic search for product recommendations (e.g., "Find me a red dress under $50"). They already use Azure OpenAI for embeddings. What’s the minimum Azure service they need? Answer: ? Azure Cognitive Search (supports vector search with Azure OpenAI embeddings). ? Azure AI Search (doesn’t exist—semantic search is a feature of Cognitive Search).

Question 3

A data engineer notices that their indexer fails after processing 10,000 documents. What’s the most likely cause, and how should they fix it? Answer: ? Cause: Indexer batch limit (max 10,000 documents per run). ? Fix: Split data into smaller batches or use incremental indexing (track lastModified timestamps).


Last-Minute Cram Sheet

  1. Azure Cognitive Search = AI-powered search + enrichment (not for forms—use Document Intelligence for that).
  2. Index = search-optimized, Knowledge Store = analytics-optimized (Blob/Table Storage).
  3. Skillset = AI enrichment pipeline (prebuilt skills + custom skills).
  4. Vector search requires embeddings (Azure OpenAI, Hugging Face, or custom models).
  5. Semantic search = deep learning-powered relevance (requires Azure OpenAI).
  6. Indexer limits: 10K docs/batch, 120 min runtime-use incremental indexing.
  7. Knowledge Store projections: Tables (for analytics), Objects (for files), Files (for blobs).
  8. Security: Azure AD, API keys, private endpoints (never expose publicly!).
  9. Trap: Document Intelligence-Search (it’s for forms, not full-text search).
  10. Trap: Semantic search is a feature of Cognitive Search, not a standalone service.