Project Overview

This pattern shows you how to optimize the ingestion step of the extract, transform, and load (ETL) process for big data and Apache Spark workloads on AWS Glue by optimizing file size before processing your data. Use this pattern to prevent or resolve the small files problem. That is, when a large number of small files slows down data processing due to the aggregate size of the files. For example, hundreds of files that are only a few hundred kilobytes each can significantly slow down data processing speeds for your AWS Glue jobs. This is because AWS Glue must perform internal list functions on Amazon Simple Storage Service (Amazon S3) and YARN (Yet Another Resource Negotiator) must store a large amount of metadata. To improve data processing speeds, you can use grouping to enable your ETL tasks to read a group of input files into a single in-memory partition. The partition automatically groups smaller files together. Alternatively, you can use custom code to add batch logic to your existing files.

Project Detail

https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/optimize-the-etl-ingestion-of-input-file-size-on-aws.html?did=pg_card&trk=pg_card

To know more about this project connect with us

Name

Phone

Message

Course Name

Course Name

Course Name

Course Name

Ekascloud Courses

Course Category

Project Overview

Project Detail

To know more about this project connect with us

Optimize the ETL ingestion of input file size on AWS