Fatskills
Practice. Master. Repeat.
Study Guide: Cloud ML - Google Cloud Professional Machine Learning Engineer: Generative AI and LLMs (Vertex AI Model Garden, PaLM, Gemini, Prompt Design, Fine-Tuning, RAG)
Source: https://www.fatskills.com/machine-learning-101/chapter/cloud-ml-cert-gcp-ml-generative-ai-and-llms-vertex-ai-model-garden-palm-gemini-prompt-design-finetuning-rag

Cloud ML - Google Cloud Professional Machine Learning Engineer: Generative AI and LLMs (Vertex AI Model Garden, PaLM, Gemini, Prompt Design, Fine-Tuning, RAG)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~6 min read

GCP_ML – Generative AI and LLMs (Vertex AI Model Garden, PaLM, Gemini, Prompt Design, Fine?tuning, RAG)

Google Cloud Professional Machine Learning Engineer – Study Guide: Generative AI & LLMs

(Vertex AI Model Garden, PaLM, Gemini, Prompt Design, Fine-Tuning, RAG)


What This Is

Generative AI and large language models (LLMs) are transforming how businesses automate content creation, customer support, and decision-making. In Google Cloud, Vertex AI Model Garden provides a curated catalog of foundation models (like PaLM 2 and Gemini), while Vertex AI offers tools for prompt design, fine-tuning, and retrieval-augmented generation (RAG). A real-world scenario: A retail company uses Gemini to generate product descriptions, fine-tunes PaLM 2 on internal support logs for a chatbot, and implements RAG with Vertex AI Vector Search to answer customer queries using proprietary documentation—all while ensuring low latency and cost efficiency.


Key Terms & Services

  • Vertex AI Model Garden: GCP’s marketplace for pre-trained foundation models (e.g., PaLM 2, Gemini, Imagen, Codey). Lets you deploy, fine-tune, or use models via API without managing infrastructure.
  • PaLM 2: Google’s text-based LLM (successor to PaLM) optimized for reasoning, multilingual tasks, and code generation. Available in sizes (e.g., text-bison for chat, text-unicorn for complex tasks).
  • Gemini: Google’s multimodal LLM (text + images + audio) designed for enterprise use cases (e.g., document analysis, video summarization). Supports 1M+ token context windows (Gemini 1.5 Pro).
  • Prompt Design: Crafting input text to guide an LLM’s output (e.g., zero-shot, few-shot, chain-of-thought). Critical for reducing hallucinations and improving accuracy.
  • Fine-Tuning (Vertex AI): Adapting a foundation model to a specific task (e.g., customer support, legal document analysis) using custom datasets. Reduces inference costs vs. few-shot prompting.
  • Retrieval-Augmented Generation (RAG): Combines LLM generation with external knowledge retrieval (e.g., from a vector database) to improve factual accuracy. Uses Vertex AI Vector Search for low-latency lookups.
  • Vertex AI Vector Search: GCP’s managed vector database for semantic search (e.g., finding similar documents, products, or images). Powers RAG and recommendation systems.
  • Vertex AI Studio: Web UI for prompt experimentation, model evaluation, and deployment without writing code. Supports A/B testing and safety filters.
  • Grounding (in RAG): Ensuring LLM responses are factually accurate by retrieving relevant documents before generation. Reduces hallucinations in enterprise use cases.
  • Safety Attributes (Vertex AI): Configurable filters (e.g., toxicity, bias, harmful content) applied to LLM outputs. Adjustable per use case (e.g., stricter for customer-facing apps).
  • Token Limits: LLMs process text in tokens (?4 chars in English). PaLM 2 supports 32K tokens, Gemini 1.5 Pro supports 1M+ tokens. Exceeding limits truncates input or fails.
  • Cost Model: GCP charges for input/output tokens (e.g., $0.0005/1K tokens for PaLM 2) and fine-tuning compute (per-hour GPU costs). RAG adds vector search costs.

Step-by-Step / Process Flow

1. Deploying a Foundation Model (e.g., PaLM 2 for Chat)

  1. Select Model: In Vertex AI Model Garden, choose text-bison@002 (PaLM 2 for chat) or gemini-1.5-pro (multimodal).
  2. Deploy to Endpoint:
  3. Navigate to Vertex AI > Endpoints.
  4. Click Create Endpoint, select the model, and configure machine type (e.g., n1-standard-4 for low traffic, a2-highgpu-1g for high throughput).
  5. Test with Vertex AI Studio:
  6. Use the Playground to experiment with prompts (e.g., "Summarize this support ticket: [text]").
  7. Adjust temperature (creativity) and top-k/top-p (randomness).
  8. Integrate via API: python from google.cloud import aiplatform endpoint = aiplatform.Endpoint("projects/PROJECT/locations/us-central1/endpoints/ENDPOINT_ID") response = endpoint.predict(instances=[{"prompt": "Explain RAG in simple terms."}])

2. Fine-Tuning PaLM 2 for a Custom Task

  1. Prepare Dataset:
  2. Format as JSONL (one example per line): json {"input_text": "How do I reset my password?", "output_text": "Go to settings > account > reset password."}
  3. Upload to Google Cloud Storage (GCS).
  4. Start Fine-Tuning Job:
  5. In Vertex AI > Training, select Custom Training.
  6. Choose PaLM 2 as the base model and specify the GCS dataset path.
  7. Set hyperparameters (e.g., learning_rate=0.0001, epochs=3).
  8. Evaluate & Deploy:
  9. Monitor training in Vertex AI > Model Registry.
  10. Deploy the fine-tuned model to an endpoint (same as Step 1).

3. Building a RAG System with Vertex AI Vector Search

  1. Chunk & Embed Documents:
  2. Use Vertex AI Text Embeddings API to convert documents into vectors: python from google.cloud import aiplatform client = aiplatform.gapic.PredictionServiceClient() response = client.predict(endpoint="projects/PROJECT/locations/us-central1/publishers/google/models/textembedding-gecko", instances=[{"content": "Your document text here"}])
  3. Store vectors in Vertex AI Vector Search (create an index).
  4. Retrieve Relevant Context:
  5. For a user query, generate an embedding and search the index: python query_embedding = get_embedding("How do I return a product?") results = vector_search_index.find_neighbors(query_embedding, k=3)
  6. Generate Response with Grounding:
  7. Pass retrieved documents + query to PaLM 2/Gemini: python prompt = f"Answer the question using these documents: {retrieved_docs}\nQuestion: {query}" response = endpoint.predict(instances=[{"prompt": prompt}])

Common Mistakes

Mistake Correction
Using few-shot prompting for high-volume tasks Fine-tune instead. Few-shot prompting costs 10–100x more per query due to longer prompts.
Ignoring token limits Truncate inputs or use Gemini 1.5 Pro for long documents. PaLM 2’s 32K limit is easy to exceed.
Deploying LLMs without safety filters Enable Vertex AI’s safety attributes (e.g., block harmful content) to avoid compliance risks.
Storing embeddings in BigQuery instead of Vector Search BigQuery is not optimized for vector search (high latency). Use Vertex AI Vector Search for RAG.
Fine-tuning with small datasets (<1K examples) Fine-tuning requires thousands of examples to outperform few-shot. Use prompt engineering for small datasets.

Certification Exam Insights

  1. Service Selection Traps:
  2. Vertex AI Model Garden vs. Custom Training: Use Model Garden for pre-trained models (e.g., PaLM 2, Gemini). Use Custom Training only for custom architectures (e.g., PyTorch/TensorFlow models).
  3. Vertex AI Vector Search vs. BigQuery ML: Vector Search is for semantic search (RAG), while BigQuery ML is for structured data (e.g., SQL-based predictions).
  4. Gemini vs. PaLM 2: Gemini is multimodal (text + images), while PaLM 2 is text-only. Choose based on input type.

  5. Key Constraints:

  6. Fine-tuning costs: GCP charges per-hour GPU costs (e.g., ~$1.50/hr for an A100). Fine-tuning a model can cost $100–$1,000+.
  7. Latency: RAG adds ~100–300ms for vector search. Optimize with index sharding or approximate nearest neighbor (ANN).
  8. Data Privacy: Fine-tuning datasets must be stored in GCS (not local files). Use VPC-SC for sensitive data.

  9. Tricky Scenarios:

  10. "Which service for low-latency RAG?"-Vertex AI Vector Search (not BigQuery or Cloud SQL).
  11. "How to reduce LLM hallucinations?"-RAG + grounding (not just prompt engineering).
  12. "Best model for code generation?"-Codey (PaLM 2-based) or Gemini (if multimodal).

Quick Check Questions

  1. A healthcare company needs to analyze patient records (text + images) to generate summaries. Which GCP service should they use?
  2. Answer: Gemini (multimodal, supports text + images).
  3. Why: PaLM 2 is text-only, while Gemini handles both modalities.

  4. A startup wants to build a chatbot for customer support but has only 500 labeled examples. Should they fine-tune PaLM 2 or use few-shot prompting?

  5. Answer: Few-shot prompting.
  6. Why: Fine-tuning requires thousands of examples to outperform few-shot.

  7. A retail company wants to implement RAG for product recommendations. Which GCP service should they use for vector search?

  8. Answer: Vertex AI Vector Search.
  9. Why: Optimized for low-latency similarity search (BigQuery is too slow for RAG).

Last-Minute Cram Sheet

  1. Vertex AI Model Garden = GCP’s marketplace for PaLM 2, Gemini, Imagen, Codey.
  2. PaLM 2 = Text-only LLM (use text-bison for chat, text-unicorn for complex tasks).
  3. Gemini = Multimodal LLM (text + images + audio), supports 1M+ token context.
  4. Fine-tuning requires thousands of examples (use few-shot for small datasets).
  5. RAG = LLM + Vertex AI Vector Search (not BigQuery).
  6. Token limits: PaLM 2 = 32K, Gemini 1.5 Pro = 1M+.
  7. Cost model: Pay per input/output tokens + fine-tuning GPU hours.
  8. Safety filters = Enable in Vertex AI Studio to block harmful content.
  9. Fine-tuning-prompt engineering – Fine-tuning is for custom tasks, prompt engineering is for quick adjustments.
  10. Vertex AI Vector Search-BigQuery – Vector Search is for semantic search, BigQuery is for structured data.