AWS ML Engineer Associate (MLA-C01) Study Guide & Cheat Sheet

Ready to practice? Take the free AWS ML Engineer Associate (MLA-C01) practice quiz → · Get the full exam prep →

AWS Certified Machine Learning Engineer - Associate (MLA-C01) Study Guide

Questions	65 (50 scored + 15 unscored)
Time limit	170 minutes
Price	$150 USD
Delivery	Pearson VUE or PSI; test center or online proctored
Scoring	Scaled 100-1000; 720 to pass
Validity	3 years
Prerequisites	None required; ~1 year of ML/SageMaker experience recommended
Language	English plus other localized languages

Exam domains

Domain	Weight	What it covers
Data Preparation for ML	28%	Ingest, store, transform, and validate data for ML. Covers S3 storage, Glue/DataBrew and EMR/Spark transforms, Athena queries, streaming ingestion with Kinesis/Firehose/MSK, feature engineering with SageMaker Data Wrangler and Feature Store, labeling with Ground Truth, and governance with Lake Formation, handling issues like class imbalance and data quality.
ML Model Development	26%	Choose, train, and evaluate models. Includes selecting built-in algorithms or JumpStart models, SageMaker training jobs, Automatic Model Tuning and Autopilot, tracking with Experiments, debugging with Debugger and bias/explainability checks with Clarify, and using metrics (accuracy, precision, recall, F1, AUC, RMSE) while managing overfitting and underfitting.
Deployment and Orchestration of ML Workflows	22%	Deploy models and automate pipelines. Covers real-time, serverless, and asynchronous endpoints, Batch Transform, multi-model endpoints and auto scaling, plus orchestration with SageMaker Pipelines, Step Functions, EventBridge, and CI/CD via CodePipeline, infrastructure as code with CloudFormation/CDK, and container images in Amazon ECR.
ML Solution Monitoring, Maintenance, and Security	24%	Monitor, maintain, and secure ML solutions in production. Includes SageMaker Model Monitor for data and model quality drift, observability with CloudWatch and CloudTrail, cost and performance optimization, and security through IAM, KMS, VPC, Secrets Manager, PrivateLink, and Macie for sensitive data.

Who it’s for: ML engineers and data professionals who build, deploy, and operate machine learning solutions on AWS in a hands-on, associate-level role.

Study & test-day tips

Map every scenario to the AWS-managed service first: prefer SageMaker built-in tooling (Data Wrangler, Autopilot, Pipelines) over custom code when the question rewards lower operational overhead.
Know endpoint trade-offs cold: real-time for low-latency steady traffic, serverless for intermittent/spiky traffic, asynchronous for large payloads or long inference, and Batch Transform for offline scoring of whole datasets.
Match the metric to the problem: precision/recall/F1 and AUC for imbalanced classification, accuracy only for balanced classes, and RMSE/MAE for regression. Watch for questions that penalize false positives vs false negatives.
For class imbalance, recognize techniques like resampling, SMOTE, class weighting, and stratified splits, and pair them with the right evaluation metric rather than raw accuracy.
Use SageMaker Clarify for both pre-training bias detection and post-training feature attribution/explainability; use Debugger for training-time issues like vanishing gradients or overfitting.
For drift, remember Model Monitor compares live traffic against a baseline for data quality, model quality, bias drift, and feature attribution drift, emitting CloudWatch metrics that can trigger retraining.
Choose ingestion by pattern: Kinesis Data Streams for custom real-time consumers, Data Firehose for near-real-time delivery to S3/Redshift, and MSK for Kafka workloads; use Glue/DataBrew for batch ETL and Athena for serverless SQL on S3.
For MLOps automation, use SageMaker Pipelines for ML-native DAGs, Step Functions for broader AWS service orchestration, EventBridge for event triggers, and CodePipeline for CI/CD of model build and deploy.
Default to least privilege and encryption: scope IAM roles tightly, encrypt with KMS, isolate with VPC and PrivateLink, store credentials in Secrets Manager, and detect sensitive data with Macie.
Save with Feature Store and multi-model endpoints: use the offline store for training and online store for low-latency lookups, and host many models behind one endpoint to cut cost when models are infrequently invoked.

Cheat sheet

Data Preparation and Feature Engineering

Amazon S3 as the central data lake; Lake Formation for fine-grained access governance.
AWS Glue and Glue DataBrew for ETL and visual data prep; EMR/Spark for large-scale processing.
Athena for serverless SQL queries directly on S3 data.
SageMaker Data Wrangler for visual feature engineering; Feature Store for online/offline feature reuse.
SageMaker Ground Truth for data labeling and human-in-the-loop annotation.
Handle class imbalance (resampling, SMOTE, class weights) and split data with stratification.

Streaming and Ingestion

Kinesis Data Streams: real-time, custom consumers, replayable shards.
Amazon Data Firehose: near-real-time delivery to S3, Redshift, OpenSearch with transforms.
Amazon MSK: managed Apache Kafka for high-throughput event streaming.
Glue/DataBrew for scheduled batch ingestion and transformation into the lake.

Model Development and Training

SageMaker built-in algorithms (XGBoost, Linear Learner, etc.) and JumpStart pretrained models.
SageMaker training jobs with managed spot training to reduce cost.
Automatic Model Tuning (hyperparameter optimization) and Autopilot for AutoML.
SageMaker Experiments to track runs, parameters, and metrics.
Debugger for training issues; Clarify for bias detection and explainability.
Evaluation metrics: accuracy, precision, recall, F1, AUC for classification; RMSE/MAE for regression.

Deployment and Inference

Real-time endpoints for low-latency, steady traffic with auto scaling.
Serverless endpoints for intermittent traffic with no idle cost.
Asynchronous endpoints for large payloads and long-running inference.
Batch Transform for offline scoring of entire datasets.
Multi-model endpoints to host many models behind one endpoint and cut cost.

Orchestration and MLOps

SageMaker Pipelines for ML-native CI/CD workflows and model registry.
Step Functions for orchestrating across AWS services; EventBridge for event triggers.
CodePipeline for CI/CD of build, train, and deploy stages.
CloudFormation and AWS CDK for infrastructure as code.
Amazon ECR to store and version custom training/inference container images.

Monitoring, Maintenance, and Security

SageMaker Model Monitor for data quality, model quality, and drift detection against a baseline.
CloudWatch for metrics, logs, and alarms; CloudTrail for API audit trails.
IAM least-privilege roles and policies for all ML resources.
KMS encryption, VPC isolation, and PrivateLink for private connectivity.
Secrets Manager for credentials; Macie to discover and protect sensitive data (PII).

Glossary

Amazon S3: Object storage service that serves as the central data lake for ML datasets, models, and artifacts.
AWS Glue: Serverless ETL service for discovering, cataloging, and transforming data at scale.
AWS Glue DataBrew: Visual data preparation tool for cleaning and normalizing data without writing code.
Amazon Athena: Serverless interactive query service that runs SQL directly against data in S3.
Amazon EMR: Managed big data platform running Apache Spark, Hadoop, and related frameworks for large-scale processing.
Amazon Kinesis Data Streams: Real-time streaming data service with replayable shards for custom consumer applications.
Amazon Data Firehose: Near-real-time streaming delivery service that loads data into S3, Redshift, and other destinations.
Amazon MSK: Managed Streaming for Apache Kafka, providing fully managed Kafka clusters for event streaming.
SageMaker Data Wrangler: Visual tool for importing, exploring, and engineering features from data for ML.
SageMaker Feature Store: Repository for storing, sharing, and serving ML features with online and offline stores.
SageMaker Ground Truth: Data labeling service with human and automated workflows to create training datasets.
AWS Lake Formation: Service for building and governing data lakes with fine-grained access control.
SageMaker training job: Managed compute job that trains a model on provided data and outputs model artifacts to S3.
Built-in algorithms: Prebuilt, optimized SageMaker algorithms such as XGBoost and Linear Learner for common ML tasks.
Automatic Model Tuning: SageMaker hyperparameter optimization that searches parameter combinations to maximize an objective metric.
SageMaker Autopilot: AutoML capability that automatically builds, trains, and tunes candidate models with transparency.
SageMaker JumpStart: Hub of pretrained models and solution templates for fast fine-tuning and deployment.
SageMaker Experiments: Capability to organize, track, and compare training runs, parameters, and metrics.
SageMaker Debugger: Tool that captures training tensors to detect issues like overfitting and vanishing gradients.
SageMaker Clarify: Service that detects bias in data and models and explains predictions via feature attribution.
Real-time endpoint: Persistent SageMaker endpoint serving low-latency, synchronous predictions with auto scaling.
Serverless endpoint: SageMaker inference option that scales to zero, ideal for intermittent or spiky traffic.
Asynchronous endpoint: SageMaker endpoint that queues requests for large payloads and long-running inference.
Batch Transform: SageMaker feature for offline, high-throughput inference over an entire dataset.
Multi-model endpoint: Single SageMaker endpoint that dynamically loads and hosts many models to reduce cost.
SageMaker Pipelines: Purpose-built CI/CD service for building, automating, and managing ML workflows.
AWS Step Functions: Serverless orchestration service that coordinates multiple AWS services into workflows.
Amazon EventBridge: Event bus service that triggers ML workflows and actions based on events.
Amazon ECR: Elastic Container Registry for storing and versioning Docker images for training and inference.
SageMaker Model Monitor: Service that monitors deployed models for data quality, model quality, and drift against a baseline.
Model drift: Degradation in model performance over time as live data diverges from training data distribution.
Amazon Macie: Security service that uses ML to discover and protect sensitive data such as PII in S3.
AWS PrivateLink: Service that provides private connectivity to AWS services without exposing traffic to the public internet.
AWS KMS: Key Management Service for creating and controlling encryption keys protecting data at rest and in transit.

HOW TO // AI is not affiliated with or endorsed by Amazon Web Services. AWS Certified Machine Learning Engineer – Associate and MLA-C01 are certifications of Amazon.com, Inc. or its affiliates; we reference them descriptively. All questions are original.