Saturday, April 30, 2022

5 AWS Services Every Data Scientist Should Use

 

Image Source

Amazon Web Services (AWS) provides a dizzying array of cloud services, from the well known Elastic Compute Cloud (EC2) and Simple Storage Service (S3) to platform as a service (PaaS) offerings covering almost every aspect of modern computing.

Specifically, AWS provides a mature big data architecture with services covering the entire data processing pipeline — from ingestion through treatment and pre-processing, ETL, querying and analysis, to visualization and dashboarding. AWS lets you manage big data seamlessly and effortlessly, without having to set up complex infrastructure or deploy software solutions like Spark or Hadoop.

In this article I’ll cover five Amazon services, each covering an essential element of the modern data science workflow.

1. Amazon EMR

The Amazon EMR managed cluster platform takes most of the complexity out of running big data frameworks like Apache Hadoop and Spark. You can use it to process and analyze big data on AWS resources, including EC2 instances and low cost spot instances. Amazon EMR also allows you to transform and migrate big data between AWS databases (such as DynamoDB) and data stores (such as S3).

Storage

There are various file systems in the storage layer, with different storage options including:

  • Hadoop Distributed File System (HDFS) — a scalable, distributed file system for Hadoop that stores multiple copies of data across instances in a cluster. This ensures that the data is not lost if one instance fails. HDFS offers ephemeral storage that you can use to cache intermediate results for your workloads.
  • EMR File System (EMRFS) — provides capabilities for accessing data stored in Amazon S3 directly, similar to HDFS. Both S3 or HDFS can be used as your cluster’s file system, but Amazon S3 is typically used for storing I/O data while HDFS is used for storing intermediate results.

Data Processing Frameworks

Data processing frameworks are the engine for processing and analyzing data. Frameworks can run on YARN or manage resources independently. Different frameworks have different capabilities (e.g. batch processing, streaming, interactive analysis, in-memory processing). The framework you choose affects the interfaces and languages used by your applications to interact with the data being processed.

The main open-source frameworks supported by Amazon EMR are:

  • Hadoop MapReduce — a programming framework for distributed computing. You provide Map and Reduce functions and it handles all the logic involved in writing distributed applications. Use the Map function to map the data to intermediate results and the Reduce function to combine them and produce a final output.
  • Apache Spark — a programming model and cluster framework used to process big data. It is a high performance distributed processing system that handles data sets with in-memory caching and execution plans with directed acyclic graphs.

Amazon EMR lets you launch a cluster, develop your distributed processing apps, submit work to the cluster and view results — without having to set up hardware infrastructure, deploy and configure big data frameworks.

2. AWS Glue

AWS Glue is an extract, transform and load (ETL) service that facilitates data management. It is fully managed and cost-effective, allowing you to classify, clean, enrich, and transfer data. AWS Glue is serverless and includes a Data Catalog, a scheduler and an ETL engine that generates Scala or Python code automatically.

AWS Glue handles semi-structured data, providing dynamic frames that can be used in ETL scripts. Dynamic frames are a form of data abstraction, which you can use to arrange your data. They offer schema flexibility and advanced transformations and are compatible with Spark dataframes.

The AWS Glue console lets you discover data sources, transform data, and monitor ETL processes. You can also access Glue from AWS services or other applications using the AWS Glue API.

You specify the ETL tasks you want AWS Glue to perform to move data from the source to the target. You can set up jobs to respond to a trigger you define, or you can run them on demand. To transform your data, you can provide a script via the console or API or use the script auto-generated by AWS Glue. You can define crawlers to scan sources in a data store and populate the Data Catalog with metadata.

3. Amazon SageMaker

This fully managed MLOps solution allows you to build and train machine learning (ML) models and easily deploy them directly to a production environment. You can use a Jupyter notebook instance to easily access data sources without having to manage servers.

SageMaker offers built-in ML algorithms optimized for big data in distributed environments, and lets you bring your own custom algorithms. Use the SageMaker Console or SageMaker Studio to deploy your model into a scalable, secure environment. As with most Amazon services, costs for data training and hosting are calculated according to actual usage, and there are no upfront or minimum fees.

To train your model, you create a training job including details such:

  • URL of the S3 buckets where the training data is stored
  • Where to store the output
  • Compute resources (ML compute instances)
  • Path that stores the training code in Amazon Elastic Container Registry (ECR). This can be one of the built-in algorithms, or your custom Python code.

Finally, once training jobs are running, you can tune training data, parameters, and model code using SageMaker Debugger.

4. Amazon Kinesis Video Streams

Much of the content being created and managed by organizations is transitioning to video, creating a need to process and analyze video content. Amazon Kinesis Video Streams is a fully managed service for streaming live video to the AWS Cloud, process videos in real time, and perform batch-oriented analytics.

You can use the service to store video data, access video content in real time as it is uploaded to the cloud, and monitor live streams.

Kinesis Video Streams allows you to capture large amounts of live data from millions of devices. This includes both video and other data such as thermal imagery and audio data. Your applications can access and process this data with low latency. You can also integrate Kinesis with a variety of video APIs for additional processing and treatment of video content. Kinesis can be configured to store data for a specified retention period, with encryption for data at rest.

The following components interact:

  • Producer — a source that provides data for the video stream.This can be any device that generates video or non-video audiovisual data.
  • Kinesis video stream — enables transfers of live video data, and makes it available in real time or on an ad hoc or batch basis.
  • Consumer — the recipient of the data, usually an application used to view, process or analyze video data. Consumer applications can run on Amazon EC2 instances.

5. Amazon QuickSight

This is a fully managed, cloud-based business intelligence (BI) service. Amazon QuickSight combines data from multiple sources and presents it in a single dashboard. It provides a high level of security, built-in redundancy and global availability, as well as management tools you can use to manage large numbers of users. You can get up and running instantly without deploying or managing any infrastructure.

You can access QuickSight dashboards securely from any mobile or network device.

You can use Amazon QuickSight to access data, prepare it for analysis, and hold prepared data as a direct query or in SPICE memory (QuickSight’s Super-fast Parallel, In-memory Calculation Engine). To create a new analysis, you add existing or new datasets; create charts, tables or insights; add variables using extended features; and publish it to users as a dashboard.

Conclusion

In this article, I discussed AWS services that fulfill essential functions of modern data science projects:

  • Amazon EMR — Hadoop and Spark as a service, running at any scale with no complex setup.
  • AWS Glue — serverless ETL engine for semi-structured data.
  • Amazon SageMaker — machine-learning-in-a-box, letting you assemble machine learning pipelines and deploy them to production.
  • Amazon Kinesis Video Streams — letting you process and analyze video data, the new data source most organizations are scrambling to master.
  • Amazon QuickSight — quick and easy visualization and dashboards, with no complex integrations.

I hope this will be of help as you evaluate the role of the cloud in your data science journey.

Labels: , ,

Tuesday, April 19, 2022

Different ways of creating DataFrame in Pandas

 

5 Ways to Create Pandas DataFrame in Python

Improve your Python skills and learn how to create a DataFrame in different ways

DataFrame is a two-dimensional labeled data structures with columns of potentially different types. In general, DataFrame like a spreadsheet and it contains three components: index, columns and data. Dataframes can be created by different ways.

This blog shows you five different ways to create pandas DataFrame. Let’s start…

If you want to create the DataFrame in the image below, you do this using one of the following methods.

1. Create pandas DataFrame from dictionary of lists

The dictionary keys represent the columns names and each list represents a column contents.

# Import pandas library
import pandas as pd
# Create a dictionary of list
dictionary_of_lists = {
'Name': ['Emma', 'Oliver', 'Harry', 'Sophia'],
'Age': [29, 25, 33, 24],
'Department': ['HR', 'Finance', 'Marketing', 'IT']}
# Create the DataFrame
df1 = pd.DataFrame(dictionary_of_lists)
df1

2. Create pandas DataFrame from dictionary of numpy array.

The dictionary keys represent the columns names and each array element represents a column contents.

# Import pandas and numpy libraries
import pandas as pd
import numpy as np
# Create a numpy array
nparray = np.array(
[['Emma', 'Oliver', 'Harry', 'Sophia'],
[29, 25, 33, 24],
['HR', 'Finance', 'Marketing', 'IT']])
# Create a dictionary of nparray
dictionary_of_nparray = {
'Name': nparray[0],
'Age': nparray[1],
'Department': nparray[2]}
# Create the DataFrame
df2 = pd.DataFrame(dictionary_of_nparray)
df2

3. Create pandas DataFrame from list of lists

Each inner list represents one row.

# Import pandas library
import pandas as pd
# Create a list of lists
list_of_lists = [
['Emma', 29, 'HR'],
['Oliver', 25, 'Finance'],
['Harry', 33, 'Marketing'],
['Sophia', 24, 'IT']]
# Create the DataFrame
df3 = pd.DataFrame(list_of_lists, columns = ['Name', 'Age', 'Department'])
df3

4. Create pandas DataFrame from list of dictionaries

Each dictionary represents one row and the keys are the columns names.

# Import pandas library
import pandas as pd
# Create a list of dictionaries
list_of_dictionaries = [
{'Name': 'Emma', 'Age': 29, 'Department': 'HR'},
{'Name': 'Oliver', 'Age': 25, 'Department': 'Finance'},
{'Name': 'Harry', 'Age': 33, 'Department': 'Marketing'},
{'Name': 'Sophia', 'Age': 24, 'Department': 'IT'}]
# Create the DataFrame
df4 = pd.DataFrame(list_of_dictionaries)
df4

5. Create pandas Dataframe from dictionary of pandas Series

The dictionary keys represent the columns names and each Series represents a column contents.

# Import pandas library
import pandas as pd
# Create Series
series1 = pd.Series(['Emma', 'Oliver', 'Harry', 'Sophia'])
series2 = pd.Series([29, 25, 33, 24])
series3 = pd.Series(['HR', 'Finance', 'Marketing', 'IT'])
# Create a dictionary of Series
dictionary_of_nparray = {'Name': series1, 'Age': series2, 'Department':series3}
# Create the DataFrame
df5 = pd.DataFrame(dictionary_of_nparray)
df5

I hope you find this blog is useful. Thank you for reading :)

Labels: ,

Friday, April 1, 2022

Understanding the Confusion Matrix from Scikit learn

 

Clear representation of output of confusion matrix

What is the default output of confusion_matrix from sklearn? Image by Author

INTRODUCTION

In one of my recent projects — a transaction monitoring system generates a lot of False Positive alerts (these alerts are then manually investigated by the investigation team). We were required to use machine learning to auto close those false alerts. Evaluation criteria for the machine learning model was a metric Negative Predicted Value that means out of total negative predictions by the model how many cases it has identified correctly.

NPV = True Negative / (True Negative + False Negative)

The cost of false-negative is extremely high because these are the cases where our model is saying they are not-fraudulent but in reality, they are fraudulent transactions.

To get into action I would quickly display the confusion_matrix and below is the output from the jupyter notebook. My binary classification model is built with target = 1 (for fraud transactions) so target= 0 (for non fraud).

cm = confusion_matrix(y_test_actual, y_test_pred)
print(cm)
----- Output -----
[[230, 33]
[24, 74]

Depending upon how you interpret the confusion matrix, you can either get an NPV of 90% or 76%. Because —

TN = cm[0][0] or cm[1][1] ie. 230 or 74

FN = cm[1][0] ie. 24

Wikipedia Representation

I referred to confusion matrix representation from Wikipedia.

Confusion Matrix from Wikipedia

This image from Wikipedia shows that predicted labels are on the horizontal levels and actual labels are on the verticals levels. This implies,

TN = cm[1][1] ie. 76

FN = cm[1][0] ie. 24

NPV = 76%

Sklearn Representation

Scikit learn documentation says — Wikipedia and other references may use a different convention for axes.

Oh Wait! documentation doesn’t mention anything clear, isn’t it? They say Wikipedia and other references may use a different convention for axes.

What do you mean by “may use a different convention for axes”? We have seen that if you use the wrong convention for axes your model evaluation metric may completely go off the track.

If you read through the documentation and towards the bottom you will find this example

tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()

Here, they have flattened the matrix output. On our example this implies that,

TN = cm[0][0] ie. 230

FN = cm[1][0] ie. 24

NPV = 90%

UNDERSTANDING THE STRUCTURE OF CONFUSION MATRIX

Clearly understanding the structure of the confusion matrix is of utmost importance. Even though you can directly use the formula for most of the standard metrics like accuracy, precision, recall, etc. Many times you are required to compute the metrics like negative predictive value, false-positive rate, false-negative rate which are not available in the package out of the box.

Now, if I ask you to pick the correct option for the confusion matrix that is the output of confusion_matrix. Which one would you pick?

Confusion on Confusion Matrix. Image by Author

Would your answer be “A” because that’s what Wikipedia says or would it be “C” because sklearn documentation says so?

LET’S FIND OUT

Consider these are your y_true and y_pred values.

y_true = [0, 1, 0, 1, 0, 1, 0]
y_pred = [1, 1, 1, 0, 1, 0, 1]

By looking at the given lists, we can calculate the following:

TP (True Positive) = 1

FP (False Positive) = 4

TN (True Negative) = 0

FN (False Negative) = 2

For your classic Machine Learning Model for binary classification, mostly you would run the following code to get the confusion matrix.

from sklearn.metrics import confusion_matrixconfusion_matrix(y_true, y_pred)
Output of the confusion matrix

If we fill it back to the confusion matrix, we get the confusion matrix as below

Hence the correct answer is “D”
cm = confusion_matrix(y_true, y_pred)
print (cm)
--- Output ---
[[0,4]
[2,1]]
which translates to this: predicted
0 1
----- -----
0| 0 | 4
actual ----- -----
1| 2 | 1

TN (True Negative) = cm[0][0] = 0

FN (False Negative) = cm[1][0] = 2

TP (True Positive) = cm[1][1] = 1

FP (False Positive) = cm[0][1] = 4

Referring back to original image. Option D is the default output. Image by Author

However, if you were to add a simple parameter “labels”.

cm = confusion_matrix(y_true, y_pred, labels=[1,0])
print (cm)
--- Output ---
[[1,2]
[4,0]]
which translates to this: predicted
1 0
----- -----
1| 1 | 2
actual ----- -----
0| 4 | 0

TP (True Positive) = cm[0][0] = 1

FP (False Positive) = cm[1][0] = 4

TN (True Negative) = cm[1][1] = 0

FN (False Negative) = cm[0][1] = 2

Referring back to original image. Option C is something one can get with parameter ‘label’

CONCLUSION:

The correct representation of the default output of the confusion matrix from sklearn is below. Actual labels on the horizontal axes and Predicted labels on the vertical axes.

  1. Default output
#1. Default output
confusion_matrix(y_true, y_pred)

2. By adding the labels parameter, you can get the following output

#2. Using labels parameter
confusion_matrix(y_true, y_pred, labels=[1,0])

Thanks for reading!

Labels: , ,