Linkedin

Managing Inserts and Upserts in a Serverless Data Lake

Project Overview

Project Detail

Data is ingested from the source systems using either batch, CDC, Streaming, etc. into RAW layer in Amazon S3. Once the data persisted in the RAW data lake on S3, the data will be crawled and will be populated in the AWS Glue Data Catalog using Crawler. The RAW data is pulled into an Amazon EMR cluster and read using Hive and Spark, for cleaning and transformation. Apache Hudi running on Amazon EMR, will read the data using Spark APIs and perform inserts and upserts on the required data sets. The cleaned and transformed data is persisted back into the Amazon S3 processed and reportable buckets.  The reportable data is consumed on demand using Amazon Athena or loaded into Amazon Redshift, and can be consumed by different users, tools and resources. The complete data movement, spinning on-demand Amazon EMR clusters (in case of batch data) and loading the data is handled by a workflow orchestration u

http://chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://d1.awsstatic.com/architecture-diagrams/ArchitectureDiagrams/big-data-inserts-and-upserts-ra.pdf?did=wp_card&trk=wp_card

To know more about this project connect with us

Managing Inserts and Upserts in a Serverless Data Lake