Linkedin

  • Home >
  • Use SageMaker Processing for distributed feature engineering of terabyte-scale ML datasets

Use SageMaker Processing for distributed feature engineering of terabyte-scale ML datasets

Project Overview

Project Detail

Many terabyte-scale or larger datasets often consist of a hierarchical folder structure, and the files in the dataset sometimes share interdependencies. For this reason, machine learning (ML) engineers and data scientists must make thoughtful design decisions to prepare such data for model training and inference. This pattern demonstrates how you can use manual macrosharding and microsharding techniques in combination with Amazon SageMaker Processing and virtual CPU (vCPU) parallelization to efficiently scale feature engineering processes for complicated big data ML datasets. 

This pattern defines macrosharding as the splitting of data directories across multiple machines for processing, and microsharding as the splitting of data on each machine across multiple processing threads. The pattern demonstrates these techniques by using Amazon SageMaker with sample time-series waveform records from the PhysioNet MIMIC-III dataset. By implementing the techniques in this pattern, you can minimize the processing time and costs for feature engineering while maximizing resource utilization and throughput efficiency. These optimizations rely on distributed SageMaker Processing on Amazon Elastic Compute Cloud (Amazon EC2) instances and vCPUs for similar, large datasets, regardless of data type.

https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/use-sagemaker-processing-for-distributed-feature-engineering-of-terabyte-scale-ml-datasets.html?did=pg_card&trk=pg_card

To know more about this project connect with us

Use SageMaker Processing for distributed feature engineering of terabyte-scale ML datasets