Fatskills
Practice. Master. Repeat.
Study Guide: Cloud ML - AWS Certified Machine Learning Engineer – Associate (MLA-C01): Data Ingestion (Kinesis, S3, Glue, DMS, DataSync)
Source: https://www.fatskills.com/hesi/chapter/cloud-ml-cert-aws-ml-data-ingestion-kinesis-s3-glue-dms-datasync

Cloud ML - AWS Certified Machine Learning Engineer – Associate (MLA-C01): Data Ingestion (Kinesis, S3, Glue, DMS, DataSync)

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~7 min read

AWS_ML – Data Ingestion (Kinesis, S3, Glue, DMS, DataSync)

AWS Certified Machine Learning – Specialty: Data Ingestion Study Guide

Topic: Data Ingestion (Kinesis, S3, Glue, DMS, DataSync)


What This Is

Data ingestion is the first step in any ML pipeline—getting raw data from sources (databases, IoT devices, logs, APIs) into a format and location where it can be cleaned, transformed, and used for training or inference. Real-world scenario: A fintech company needs to detect fraud in real-time as transactions stream in from mobile apps. They use Amazon Kinesis Data Streams to ingest transactions, Kinesis Data Firehose to batch and store them in S3, and AWS Glue to catalog and transform the data for downstream ML models (e.g., SageMaker fraud detection). Without reliable ingestion, the ML model starves for data—or worse, trains on stale or inconsistent data.


Key Terms & Services

  • Amazon Kinesis Data Streams AWS’s scalable, real-time data streaming service. Best for high-throughput, low-latency ingestion (e.g., IoT telemetry, clickstreams). Producers (apps, devices) push records to shards; consumers (Lambda, EC2, SageMaker) process them in real time.

  • Amazon Kinesis Data Firehose Fully managed service to load streaming data into S3, Redshift, OpenSearch, or HTTP endpoints. Automatically batches, compresses, and transforms data (via Lambda) before delivery. Ideal for near-real-time analytics (e.g., log aggregation) but not for sub-second latency (use Kinesis Data Streams instead).

  • Amazon S3 (Simple Storage Service) AWS’s object storage for raw and processed data. ML context: Primary storage for training datasets, model artifacts, and batch inference outputs. Supports versioning, lifecycle policies (e.g., move old data to Glacier), and event triggers (e.g., invoke Lambda on new uploads).

  • AWS Glue Serverless ETL (Extract, Transform, Load) service. Uses Glue Crawlers to scan data in S3/RDS and auto-generate a Glue Data Catalog (schema + metadata). Glue Jobs (Python/Scala scripts) transform data for ML (e.g., normalize features, handle missing values). Best for batch processing (not streaming).

  • AWS Database Migration Service (DMS) Migrates data from on-premises databases (Oracle, SQL Server) or other clouds to AWS (RDS, Redshift, S3). Supports homogeneous (same DB engine) and heterogeneous (different engines) migrations. ML use case: Move historical transaction data from an on-prem SQL Server to S3 for training a churn prediction model.

  • AWS DataSync Automates and accelerates data transfers between on-premises storage and AWS (S3, EFS, FSx). Uses a DataSync agent (VM or physical device) to sync files over the internet or Direct Connect. ML use case: Move large datasets (e.g., medical images) from a hospital’s on-prem NAS to S3 for training a computer vision model.

  • Amazon MSK (Managed Streaming for Kafka) Fully managed Apache Kafka service. Alternative to Kinesis for teams already using Kafka. Key difference: MSK requires more manual tuning (e.g., partition management) but offers better compatibility with Kafka tools (e.g., Kafka Connect, KSQL).

  • S3 Event Notifications Triggers actions (Lambda, SQS, SNS) when objects are created, deleted, or restored in S3. ML use case: Automatically kick off a Glue ETL job or SageMaker training job when new labeled data lands in an S3 bucket.

  • Glue Data Catalog Central metadata repository for AWS Glue. Stores table definitions (schema, partitions, location) for S3, RDS, and Redshift. ML context: Enables SageMaker Feature Store and Athena to query data without manual schema management.

  • Partitioning (S3/Glue) Organizing data in S3 by keys (e.g., s3://bucket/year=2023/month=01/day=15/) to improve query performance. ML best practice: Partition training data by date or region to speed up Athena queries or SageMaker training jobs.

  • Data Wrangler (SageMaker Data Wrangler) Not for ingestion! A SageMaker tool for interactive data cleaning and feature engineering (e.g., handling missing values, encoding categorical variables). Use Glue or Kinesis for ingestion, then Data Wrangler for prep.


Step-by-Step / Process Flow

Scenario: Build a real-time fraud detection pipeline.

Goal: Ingest transaction data from mobile apps-process in real time-store in S3 for batch training-catalog for ML.

  1. Set up Kinesis Data Streams
  2. Create a Kinesis Data Stream with enough shards to handle peak throughput (e.g., 1 shard per 1MB/s write or 1,000 records/s).
  3. Configure producers (mobile apps) to send transactions to the stream using the Kinesis Producer Library (KPL) or AWS SDK.

  4. Process and Store with Kinesis Data Firehose

  5. Create a Kinesis Data Firehose delivery stream with the Kinesis Data Stream as the source.
  6. Set the destination to S3 (e.g., s3://fraud-data/raw/).
  7. Enable Lambda transformation (e.g., filter out test transactions, mask PII).
  8. Configure buffer hints (e.g., 5MB or 300 seconds) to balance latency and cost.

  9. Catalog Data with AWS Glue

  10. Run a Glue Crawler on the S3 bucket to auto-detect schema and create a table in the Glue Data Catalog.
  11. Define partitions (e.g., by year/month/day) to optimize queries.

  12. Transform Data for ML (Glue ETL)

  13. Create a Glue Job (Python/Scala) to:
    • Join transaction data with user profiles (from RDS).
    • Normalize features (e.g., scale transaction amounts).
    • Write output to a new S3 location (e.g., s3://fraud-data/processed/).
  14. Schedule the job to run daily (or trigger via S3 Event Notifications).

  15. Train Model with SageMaker

  16. Use the Glue Data Catalog to query processed data in Athena or SageMaker Feature Store.
  17. Train a fraud detection model (e.g., XGBoost) using SageMaker Training Jobs.

Common Mistakes

Mistake Correction
Using Kinesis Data Streams for batch processing Kinesis Data Streams is for real-time processing. For batch, use Kinesis Data Firehose (to S3) or Glue.
Not partitioning S3 data Without partitioning (e.g., by date), Athena/Glue queries scan all files, increasing cost and latency. Partition by year/month/day.
Assuming Glue Crawlers handle transformations Crawlers only catalog data; they don’t clean or transform it. Use Glue Jobs or Data Wrangler for ETL.
Using DMS for real-time CDC (Change Data Capture) DMS is for one-time migrations or batch CDC. For real-time CDC, use Kinesis Data Streams or MSK (Kafka).
Ignoring S3 lifecycle policies Storing raw data indefinitely in S3 Standard is expensive. Move old data to S3 Glacier or Glacier Deep Archive after 30–90 days.

Certification Exam Insights

  1. Service Selection Traps
  2. Kinesis Data Streams vs. Firehose:
    • Use Data Streams for custom real-time processing (e.g., fraud detection with Lambda).
    • Use Firehose for automated batch delivery to S3/Redshift (e.g., log aggregation).
  3. Glue vs. EMR:
    • Glue is serverless and best for ETL jobs < 24 hours.
    • EMR is for long-running, complex jobs (e.g., Spark MLlib training).
  4. DMS vs. DataSync:

    • DMS migrates databases (e.g., Oracle-RDS).
    • DataSync syncs files (e.g., on-prem NAS-S3).
  5. Key Constraints

  6. Kinesis Data Streams: 1MB/s write per shard (or 1,000 records/s). Exam trap: Candidates forget to calculate shard count for throughput.
  7. Glue Crawlers: Can’t infer schema for nested JSON/Parquet without a custom classifier.
  8. S3 Event Notifications: Only trigger on new objects, not updates/deletes (unless using S3 EventBridge).

  9. Tricky Scenarios

  10. "Which service to ingest IoT sensor data?"
    • Kinesis Data Streams (real-time) or IoT Core + Kinesis (if devices use MQTT).
  11. "How to migrate 10TB of on-prem data to S3 for ML?"
    • DataSync (faster than CLI tools) or Snowball (for >10TB).
  12. "How to catalog data in S3 for SageMaker?"
    • Glue Crawler (auto-detects schema) + Glue Data Catalog (metadata store).

Quick Check Questions

  1. A retail company wants to analyze customer clickstream data in real time to personalize recommendations. The data arrives at 5,000 records per second. Which AWS service should they use for ingestion?
  2. Answer: Kinesis Data Streams (handles high-throughput real-time data; Firehose is for batch delivery).
  3. Why? Kinesis Data Streams supports custom processing (e.g., Lambda) and scales with shards.

  4. A data scientist needs to train a model on historical sales data stored in an on-premises SQL Server database. The database is 5TB and updates daily. Which AWS service should they use to migrate the data to S3 for training?

  5. Answer: AWS DMS (Database Migration Service) (supports heterogeneous migrations and CDC for ongoing updates).
  6. Why? DMS can replicate the SQL Server to S3 in Parquet format, with minimal downtime.

  7. A team is building a batch ML pipeline. They store raw data in S3 and want to automatically trigger a Glue ETL job when new files arrive. Which AWS feature should they use?

  8. Answer: S3 Event Notifications (triggers Lambda/Glue when objects are created).
  9. Why? S3 Event Notifications integrate directly with Glue Jobs or Lambda for automation.

Last-Minute Cram Sheet

  1. Kinesis Data Streams: Real-time, custom processing; 1MB/s per shard. Not for batch.
  2. Kinesis Data Firehose: Auto-delivers streams to S3/Redshift; buffers data (5MB or 300s). Not for sub-second latency.
  3. S3: Object storage; use partitioning (year/month/day) for performance. Not a database (no transactions).
  4. Glue Crawler: Auto-detects schema; writes to Glue Data Catalog. Doesn’t transform data.
  5. Glue Jobs: Serverless ETL; Python/Scala. Max runtime: 24 hours (use EMR for longer jobs).
  6. DMS: Migrates databases; supports CDC. Not for real-time streaming.
  7. DataSync: Syncs files (on-prem-S3/EFS). Not for databases.
  8. S3 Event Notifications: Triggers Lambda/Glue on object creation. Doesn’t trigger on updates/deletes (use EventBridge).
  9. Partitioning: s3://bucket/key=value/ speeds up queries. Avoid too many small files (merge with Glue).
  10. Cost Trap: Kinesis Data Streams charges per shard-hour; Firehose charges per GB ingested. Over-provisioning shards = $$$.