Linkedin

Qubole Open Data Lake Platform on AWS

Project Overview

Project Detail

1 Run ad hoc, streaming, and machine learning workloads on a data lake using cost-effective Amazon EC2 Spot Instances infrastructure hosting Apache Spark, Presto, Hive, and Airflow engines. 3 2 1 4 5 AWS Reference Architecture 2 3 4 5 Install and configure requirements for Amazon EC2 On-Demand and Spot Instances in Qubole Open Data Lake Platform. Configure Identity and Access Management (IAM) Roles and AWS Accounts to ensure that the platform can access customer’s compute and storage. Qubole’s open data lake platform manages AWS infrastructure according to workload-driven service level agreements (SLAs) and performance without user involvement. Administrators can set up cluster management and configuration for OnDemand, Spot, and Spot Blocks on AWS in the customer’s virtual private cloud (VPC). Qubole’s Platform Runtime services include Workload-Aware Autoscaling, Intelligent Spot Management, Automated Cluster Lifecycle Management, and Heterogeneous Cluster Management, to manage the AWS compute automatically for total cost optimization (TCO) as per workload and SLA requirements. Qubole uses AWS Glue Data Catalog as an external Hive metastore, ensuring a single source of truth for all metadata related to the customer data in Amazon S3. Using AWS Glue sync agent, QDS clusters can synchronize metadata changes from their Hive metastore to AWS Glue Data Catalog. Syncing metadata to the AWS Glue Data Catalog allows users to to query their data using AWS analytics services such as Amazon Athena and Amazon Redshift. Users can also retrieve data from Amazon S3 using native s3n

http://chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://d1.awsstatic.com/architecture-diagrams/ArchitectureDiagrams/qubole-open-data-lake-platform-aws-ra.pdf?did=wp_card&trk=wp_card

To know more about this project connect with us

Qubole Open Data Lake Platform on AWS