By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
Topic: Data Ingestion (Kinesis, S3, Glue, DMS, DataSync)
Data ingestion is the first step in any ML pipeline—getting raw data from sources (databases, IoT devices, logs, APIs) into a format and location where it can be cleaned, transformed, and used for training or inference. Real-world scenario: A fintech company needs to detect fraud in real-time as transactions stream in from mobile apps. They use Amazon Kinesis Data Streams to ingest transactions, Kinesis Data Firehose to batch and store them in S3, and AWS Glue to catalog and transform the data for downstream ML models (e.g., SageMaker fraud detection). Without reliable ingestion, the ML model starves for data—or worse, trains on stale or inconsistent data.
Amazon Kinesis Data Streams AWS’s scalable, real-time data streaming service. Best for high-throughput, low-latency ingestion (e.g., IoT telemetry, clickstreams). Producers (apps, devices) push records to shards; consumers (Lambda, EC2, SageMaker) process them in real time.
Amazon Kinesis Data Firehose Fully managed service to load streaming data into S3, Redshift, OpenSearch, or HTTP endpoints. Automatically batches, compresses, and transforms data (via Lambda) before delivery. Ideal for near-real-time analytics (e.g., log aggregation) but not for sub-second latency (use Kinesis Data Streams instead).
Amazon S3 (Simple Storage Service) AWS’s object storage for raw and processed data. ML context: Primary storage for training datasets, model artifacts, and batch inference outputs. Supports versioning, lifecycle policies (e.g., move old data to Glacier), and event triggers (e.g., invoke Lambda on new uploads).
AWS Glue Serverless ETL (Extract, Transform, Load) service. Uses Glue Crawlers to scan data in S3/RDS and auto-generate a Glue Data Catalog (schema + metadata). Glue Jobs (Python/Scala scripts) transform data for ML (e.g., normalize features, handle missing values). Best for batch processing (not streaming).
AWS Database Migration Service (DMS) Migrates data from on-premises databases (Oracle, SQL Server) or other clouds to AWS (RDS, Redshift, S3). Supports homogeneous (same DB engine) and heterogeneous (different engines) migrations. ML use case: Move historical transaction data from an on-prem SQL Server to S3 for training a churn prediction model.
AWS DataSync Automates and accelerates data transfers between on-premises storage and AWS (S3, EFS, FSx). Uses a DataSync agent (VM or physical device) to sync files over the internet or Direct Connect. ML use case: Move large datasets (e.g., medical images) from a hospital’s on-prem NAS to S3 for training a computer vision model.
Amazon MSK (Managed Streaming for Kafka) Fully managed Apache Kafka service. Alternative to Kinesis for teams already using Kafka. Key difference: MSK requires more manual tuning (e.g., partition management) but offers better compatibility with Kafka tools (e.g., Kafka Connect, KSQL).
S3 Event Notifications Triggers actions (Lambda, SQS, SNS) when objects are created, deleted, or restored in S3. ML use case: Automatically kick off a Glue ETL job or SageMaker training job when new labeled data lands in an S3 bucket.
Glue Data Catalog Central metadata repository for AWS Glue. Stores table definitions (schema, partitions, location) for S3, RDS, and Redshift. ML context: Enables SageMaker Feature Store and Athena to query data without manual schema management.
Partitioning (S3/Glue) Organizing data in S3 by keys (e.g., s3://bucket/year=2023/month=01/day=15/) to improve query performance. ML best practice: Partition training data by date or region to speed up Athena queries or SageMaker training jobs.
s3://bucket/year=2023/month=01/day=15/
Data Wrangler (SageMaker Data Wrangler) Not for ingestion! A SageMaker tool for interactive data cleaning and feature engineering (e.g., handling missing values, encoding categorical variables). Use Glue or Kinesis for ingestion, then Data Wrangler for prep.
Goal: Ingest transaction data from mobile apps-process in real time-store in S3 for batch training-catalog for ML.
Configure producers (mobile apps) to send transactions to the stream using the Kinesis Producer Library (KPL) or AWS SDK.
Process and Store with Kinesis Data Firehose
s3://fraud-data/raw/
Configure buffer hints (e.g., 5MB or 300 seconds) to balance latency and cost.
Catalog Data with AWS Glue
Define partitions (e.g., by year/month/day) to optimize queries.
year/month/day
Transform Data for ML (Glue ETL)
s3://fraud-data/processed/
Schedule the job to run daily (or trigger via S3 Event Notifications).
Train Model with SageMaker
DMS vs. DataSync:
Key Constraints
S3 Event Notifications: Only trigger on new objects, not updates/deletes (unless using S3 EventBridge).
Tricky Scenarios
Why? Kinesis Data Streams supports custom processing (e.g., Lambda) and scales with shards.
A data scientist needs to train a model on historical sales data stored in an on-premises SQL Server database. The database is 5TB and updates daily. Which AWS service should they use to migrate the data to S3 for training?
Why? DMS can replicate the SQL Server to S3 in Parquet format, with minimal downtime.
A team is building a batch ML pipeline. They store raw data in S3 and want to automatically trigger a Glue ETL job when new files arrive. Which AWS feature should they use?
s3://bucket/key=value/
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.