Full width home advertisement

Post Page Advertisement [Top]

 

Image Source

Amazon Web Services (AWS) provides a dizzying array of cloud services, from the well known Elastic Compute Cloud (EC2) and Simple Storage Service (S3) to platform as a service (PaaS) offerings covering almost every aspect of modern computing.

Specifically, AWS provides a mature big data architecture with services covering the entire data processing pipeline — from ingestion through treatment and pre-processing, ETL, querying and analysis, to visualization and dashboarding. AWS lets you manage big data seamlessly and effortlessly, without having to set up complex infrastructure or deploy software solutions like Spark or Hadoop.

In this article I’ll cover five Amazon services, each covering an essential element of the modern data science workflow.

1. Amazon EMR

The Amazon EMR managed cluster platform takes most of the complexity out of running big data frameworks like Apache Hadoop and Spark. You can use it to process and analyze big data on AWS resources, including EC2 instances and low cost spot instances. Amazon EMR also allows you to transform and migrate big data between AWS databases (such as DynamoDB) and data stores (such as S3).

Storage

There are various file systems in the storage layer, with different storage options including:

  • Hadoop Distributed File System (HDFS) — a scalable, distributed file system for Hadoop that stores multiple copies of data across instances in a cluster. This ensures that the data is not lost if one instance fails. HDFS offers ephemeral storage that you can use to cache intermediate results for your workloads.
  • EMR File System (EMRFS) — provides capabilities for accessing data stored in Amazon S3 directly, similar to HDFS. Both S3 or HDFS can be used as your cluster’s file system, but Amazon S3 is typically used for storing I/O data while HDFS is used for storing intermediate results.

Data Processing Frameworks

Data processing frameworks are the engine for processing and analyzing data. Frameworks can run on YARN or manage resources independently. Different frameworks have different capabilities (e.g. batch processing, streaming, interactive analysis, in-memory processing). The framework you choose affects the interfaces and languages used by your applications to interact with the data being processed.

The main open-source frameworks supported by Amazon EMR are:

  • Hadoop MapReduce — a programming framework for distributed computing. You provide Map and Reduce functions and it handles all the logic involved in writing distributed applications. Use the Map function to map the data to intermediate results and the Reduce function to combine them and produce a final output.
  • Apache Spark — a programming model and cluster framework used to process big data. It is a high performance distributed processing system that handles data sets with in-memory caching and execution plans with directed acyclic graphs.

Amazon EMR lets you launch a cluster, develop your distributed processing apps, submit work to the cluster and view results — without having to set up hardware infrastructure, deploy and configure big data frameworks.

2. AWS Glue

AWS Glue is an extract, transform and load (ETL) service that facilitates data management. It is fully managed and cost-effective, allowing you to classify, clean, enrich, and transfer data. AWS Glue is serverless and includes a Data Catalog, a scheduler and an ETL engine that generates Scala or Python code automatically.

AWS Glue handles semi-structured data, providing dynamic frames that can be used in ETL scripts. Dynamic frames are a form of data abstraction, which you can use to arrange your data. They offer schema flexibility and advanced transformations and are compatible with Spark dataframes.

The AWS Glue console lets you discover data sources, transform data, and monitor ETL processes. You can also access Glue from AWS services or other applications using the AWS Glue API.

You specify the ETL tasks you want AWS Glue to perform to move data from the source to the target. You can set up jobs to respond to a trigger you define, or you can run them on demand. To transform your data, you can provide a script via the console or API or use the script auto-generated by AWS Glue. You can define crawlers to scan sources in a data store and populate the Data Catalog with metadata.

3. Amazon SageMaker

This fully managed MLOps solution allows you to build and train machine learning (ML) models and easily deploy them directly to a production environment. You can use a Jupyter notebook instance to easily access data sources without having to manage servers.

SageMaker offers built-in ML algorithms optimized for big data in distributed environments, and lets you bring your own custom algorithms. Use the SageMaker Console or SageMaker Studio to deploy your model into a scalable, secure environment. As with most Amazon services, costs for data training and hosting are calculated according to actual usage, and there are no upfront or minimum fees.

To train your model, you create a training job including details such:

  • URL of the S3 buckets where the training data is stored
  • Where to store the output
  • Compute resources (ML compute instances)
  • Path that stores the training code in Amazon Elastic Container Registry (ECR). This can be one of the built-in algorithms, or your custom Python code.

Finally, once training jobs are running, you can tune training data, parameters, and model code using SageMaker Debugger.

4. Amazon Kinesis Video Streams

Much of the content being created and managed by organizations is transitioning to video, creating a need to process and analyze video content. Amazon Kinesis Video Streams is a fully managed service for streaming live video to the AWS Cloud, process videos in real time, and perform batch-oriented analytics.

You can use the service to store video data, access video content in real time as it is uploaded to the cloud, and monitor live streams.

Kinesis Video Streams allows you to capture large amounts of live data from millions of devices. This includes both video and other data such as thermal imagery and audio data. Your applications can access and process this data with low latency. You can also integrate Kinesis with a variety of video APIs for additional processing and treatment of video content. Kinesis can be configured to store data for a specified retention period, with encryption for data at rest.

The following components interact:

  • Producer — a source that provides data for the video stream.This can be any device that generates video or non-video audiovisual data.
  • Kinesis video stream — enables transfers of live video data, and makes it available in real time or on an ad hoc or batch basis.
  • Consumer — the recipient of the data, usually an application used to view, process or analyze video data. Consumer applications can run on Amazon EC2 instances.

5. Amazon QuickSight

This is a fully managed, cloud-based business intelligence (BI) service. Amazon QuickSight combines data from multiple sources and presents it in a single dashboard. It provides a high level of security, built-in redundancy and global availability, as well as management tools you can use to manage large numbers of users. You can get up and running instantly without deploying or managing any infrastructure.

You can access QuickSight dashboards securely from any mobile or network device.

You can use Amazon QuickSight to access data, prepare it for analysis, and hold prepared data as a direct query or in SPICE memory (QuickSight’s Super-fast Parallel, In-memory Calculation Engine). To create a new analysis, you add existing or new datasets; create charts, tables or insights; add variables using extended features; and publish it to users as a dashboard.

Conclusion

In this article, I discussed AWS services that fulfill essential functions of modern data science projects:

  • Amazon EMR — Hadoop and Spark as a service, running at any scale with no complex setup.
  • AWS Glue — serverless ETL engine for semi-structured data.
  • Amazon SageMaker — machine-learning-in-a-box, letting you assemble machine learning pipelines and deploy them to production.
  • Amazon Kinesis Video Streams — letting you process and analyze video data, the new data source most organizations are scrambling to master.
  • Amazon QuickSight — quick and easy visualization and dashboards, with no complex integrations.

I hope this will be of help as you evaluate the role of the cloud in your data science journey.

No comments:

Post a Comment

Bottom Ad [Post Page]

| Designed by Colorlib