Architecture

Following is the architecture/flow of the data pipeline that you will be working with. It starts with data pulled from an OLTP database such as Amazon Aurora using Amazon Data Migration Service (DMS). DMS deposited the data files into an S3 datalake raw tier bucket in parquet format. DMS task is configured in such a way that the full load files as well as change data captures is pulled into the raw tier S3 bucket. Data is then read by Spark running Amazon EMR cluster and written as Apache Hudi dataset to an S3 bucket in analytics tier of the datalake. Hudi can write the data in two different storage types and exposes the written data in three different views - a.) A read optimized view that can queried using tools like Presto running on Amazon EMR or used for Machine Learning activities b.) a real time view that can support dashboards and c.) an incremental view that can be used to populate datawarehouses such as Redshift