Workshop Flow

Incremental Data Processing

In this lab, we will perform incremental data processing on the data sets on Amazon S3. Source data for the data lake comes from an Amazon Aurora database. We will use Amazon Data Migration Service (DMS) to pull the full load and change data capture (CDC) from Aurora cluster (source) into Amazon S3 (target). As files arrive on S3, a AWS Lambda function is triggered that reads the data from the files and puts them on a Amazon MSK topic. A Spark Streaming application reads the data from Amazon MSK topic write the data into a table on S3 in Apache Hudi format.