Monday, March 28, 2022

Statistics Books for Machine Learning

 Statistical methods are used at each step in an applied machine learning project.

This means it is important to have a strong grasp of the fundamentals of the key findings from statistics and a working knowledge of relevant statistical methods.

Unfortunately, statistics is not covered in many computer science and software engineering degree programs. Even if it is, it may be taught in a bottom-up, theory-first manner, making it unclear which parts are relevant on a given project.

In this post, you will discover some top introductory books to statistics that I recommend if you are looking to jump-start your understanding of applied statistics.

I own copies of all of these books, but I don’t recommend you buy and read them all. As a start, pick one book, but then really read it.

Kick-start your project with my new book Statistics for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Statistics Books for Machine Learning

Statistics Books for Machine Learning
Photo by Luis Rogelio HM, some rights reserved.

Overview

This section is divided into 3 parts; they are:

  1. Popular Science
  2. Statistics Textbooks
  3. Statistical Research Methods

Popular Science

Popular science books on statistics are those books that wrap up the important findings from statistics, like the normal distribution and the central limit theorem in stories and anecdotes.

Do not overlook these types of books.

I read them all the time even though I’ve pawed through statistics textbooks. The reasons I recommend them are:

  • They’re a quick and fun to read.
  • They often give a fresh perspective on dry material.
  • They’re for the lay audience.

They will help show you why a working knowledge of statistics is important in a way that you will be able to connect to your specific needs in applied machine learning.

There are many great popular science books on statistics; the three I would recommend are:

Naked Statistics: Stripping the Dread from the Data

Naked Statistics - Stripping the Dread from the Data

Naked Statistics – Stripping the Dread from the Data

Written by Charles Wheelan.

For those who slept through Stats 101, this book is a lifesaver. Wheelan strips away the arcane and technical details and focuses on the underlying intuition that drives statistical analysis. He clarifies key concepts such as inference, correlation, and regression analysis, reveals how biased or careless parties can manipulate or misrepresent data, and shows us how brilliant and creative researchers are exploiting the valuable data from natural experiments to tackle thorny questions.

The Drunkard’s Walk: How Randomness Rules Our Lives

The Drunkard’s Walk - How Randomness Rules Our Lives

The Drunkard’s Walk – How Randomness Rules Our Lives

Written by Leonard Mlodinow.

With the born storyteller’s command of narrative and imaginative approach, Leonard Mlodinow vividly demonstrates how our lives are profoundly informed by chance and randomness and how everything from wine ratings and corporate success to school grades and political polls are less reliable than we believe.

The Signal and the Noise: Why So Many Predictions Fail – but Some Don’t

The Signal and the Noise- Why So Many Predictions Fail – but Some Don’t

The Signal and the Noise- Why So Many Predictions Fail – but Some Don’t

Written by Nate Silver.

Drawing on his own groundbreaking work, Silver examines the world of prediction, investigating how we can distinguish a true signal from a universe of noisy data. Most predictions fail, often at great cost to society, because most of us have a poor understanding of probability and uncertainty. Both experts and laypeople mistake more confident predictions for more accurate ones. But overconfidence is often the reason for failure. If our appreciation of uncertainty improves, our predictions can get better too. This is the “prediction paradox”: The more humility we have about our ability to make predictions, the more successful we can be in planning for the future.

Do you have a favorite popular science book on statistics?
Let me know in the comments below.

(Softer) Statistics Textbooks

You need a solid reference text.

A textbook contains the theory, the explanations, and the equations for the methods you need to know.

Do not read these books cover to cover; rather, once you know what you need, dip into these books to learn about those methods.

In this section, I have included a mixture of books including (in order) a proper statistics textbook, a text for those with a non-math background, and a book for those with a programming background.

Pick one book that suits your background.

All of Statistics: A Concise Course in Statistical Inference

All of Statistics- A Concise Course in Statistical Inference

All of Statistics- A Concise Course in Statistical Inference

Written by Larry Wasserman.

The book includes modern topics like non-parametric curve estimation, bootstrapping, and classification, topics that are usually relegated to follow-up courses. The reader is presumed to know calculus and a little linear algebra. No previous knowledge of probability and statistics is required. Statistics, data mining, and machine learning are all concerned with collecting and analysing data.

Statistics in Plain English

Statistics in Plain English

Statistics in Plain English

Written by Timothy C. Urdan.

This introductory textbook provides an inexpensive, brief overview of statistics to help readers gain a better understanding of how statistics work and how to interpret them correctly. Each chapter describes a different statistical technique, ranging from basic concepts like central tendency and describing distributions to more advanced concepts such as t tests, regression, repeated measures ANOVA, and factor analysis. Each chapter begins with a short description of the statistic and when it should be used. This is followed by a more in-depth explanation of how the statistic works. Finally, each chapter ends with an example of the statistic in use, and a sample of how the results of analyses using the statistic might be written up for publication. A glossary of statistical terms and symbols is also included. Using the author’s own data and examples from published research and the popular media, the book is a straightforward and accessible guide to statistics.

Practical Statistics for Data Scientists: 50 Essential Concepts

Practical Statistics for Data Scientists- 50 Essential Concepts

Practical Statistics for Data Scientists- 50 Essential Concepts

Written by Peter Bruce and Andrew Bruce (Author)

Statistical methods are a key part of of data science, yet very few data scientists have any formal statistics training. Courses and books on basic statistics rarely cover the topic from a data science perspective. This practical guide explains how to apply various statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what’s important and what’s not.

Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R programming language, and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

What is your favorite statistics textbook?
Let me know in the comments below.

Statistical Research Methods

Once you have the foundations under control, you need to know what statistical methods to use in different circumstances.

A lot of applied machine learning involves designing and executing experiments, and statistical methods are required for effectively designing those experiments and interpreting the results.

This means that you require a solid grasp of statistical methods in research context.

This section provides a few key books on this topic.

It is hard to find good books on this topic that are not too theoretical or focused on the proprietary SPSS software platform. The first book is highly recommend and general, the second uses the free R platform, and the last is a classic textbook on the topic.

Empirical Methods for Artificial Intelligence

Empirical Methods for Artificial Intelligence

Empirical Methods for Artificial Intelligence

Written by Paul R. Cohen.

Computer science and artificial intelligence in particular have no curriculum in research methods, as other sciences do. This book presents empirical methods for studying complex computer programs: exploratory tools to help find patterns in data, experiment designs and hypothesis-testing tools to help data speak convincingly, and modeling tools to help explain data. Although many of these techniques are statistical, the book discusses statistics in the context of the broader empirical enterprise. The first three chapters introduce empirical questions, exploratory data analysis, and experiment design. The blunt interrogation of statistical hypothesis testing is postponed until chapters 4 and 5, which present classical parametric methods and computer-intensive (Monte Carlo) resampling methods, respectively. This is one of few books to present these new, flexible resampling techniques in an accurate, accessible manner.

Statistical Research Methods: A Guide for Non-Statisticians

Statistical Research Methods- A Guide for Non-Statisticians

Statistical Research Methods- A Guide for Non-Statisticians

Written by Roy Sabo and Edward Boone.

This textbook will help graduate students in non-statistics disciplines, advanced undergraduate researchers, and research faculty in the health sciences to learn, use and communicate results from many commonly used statistical methods. The material covered, and the manner in which it is presented, describe the entire data analysis process from hypothesis generation to writing the results in a manuscript. Chapters cover, among other topics: one and two-sample proportions, multi-category data, one and two-sample means, analysis of variance, and regression. Throughout the text, the authors explain statistical procedures and concepts using a non-statistical language. This accessible approach is complete with real-world examples and sample write-ups for the Methods and Results sections of scholarly papers. The text also allows for the concurrent use of the programming language R, which is an open-source program created, maintained and updated by the statistical community. R is freely available and easy to download.

Statistics for Experimenters: Design, Innovation, and Discovery

Statistics for Experimenters- Design, Innovation, and Discovery

Statistics for Experimenters- Design, Innovation, and Discovery

Written by George E. P. Box,‎ J. Stuart Hunter, and,‎ William G. Hunter.

Rewritten and updated, this new edition of Statistics for Experimenters adopts the same approaches as the landmark First Edition by teaching with examples, readily understood graphics, and the appropriate use of computers. Catalyzing innovation, problem solving, and discovery, the Second Edition provides experimenters with the scientific and statistical tools needed to maximize the knowledge gained from research data, illustrating how these tools may best be utilized during all stages of the investigative process. The authors’ practical approach starts with a problem that needs to be solved and then examines the appropriate statistical methods of design and analysis.

Do you have a favorite book on statistical research methods?
Let me know in the comments below?

Summary

You need to have a grounding in statistics to be effective at applied machine learning.

This grounding does not have to come first, but it needs to happen some time on your journey.

I think your path through statistics should start with a book, but really must involve a lot of practice. It is an applied field. I recommend developing code examples for every key concept that you learn along the way

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Have you read any great books on statistics?
Let me know in the comments below.


Labels: , ,

Saturday, March 26, 2022

Data Science Road map for 100 days

  Are you interested in learning Data Science but not sure where to start? If Yes, then you have reached the right place.

I have come across many people who were very passionate about learning Data Science but just in a few weeks, they quit learning. I was wondering why could someone so much passionate about a field but not pursuing it? By talking to some of them I understood that the main reasons for people to drop out from learning were,

  • They were overwhelmed by the number of topics to learn to become a Data Scientist
  • They come across gatekeepers who say that to become a Data Scientist one needs to be a talented programmer, an expert in mathematics, a master in applied statistics, and very skillful in using pandas, NumPy, and other python libraries.

These are enough to scare an experienced Data Scientist, no wonder they made people attempting to learn Data Science quit. Each of the above topics is like an ocean and when someone tries to master them quickly they get frustrated and quit the learning journey. The actual truth one needs to know just enough of the above topics to become a successful Data Scientist or to get hired as a Data Scientist.

Show me the Path to Learn Data Science

Photo by Joshua Earle on Unsplash

To become a Data Scientist one needs to learn just enough from the below topics

  • Basics of Python or R programming
  • If you choose Python then libraries like Pandas and Numpy
  • Visualization libraries like ggplot, Seaborn, and Plotly.
  • Statistics
  • SQL Programming
  • Mathematics especially Linear Algebra and Calculus

In this below video I have mentioned the step-by-step guide to learn Data Science. I have explained the depth of knowledge required to reach different levels of expertise in Data Science.

How to plan the learning? Which topics should be covered first?

Let me clearly explain the plan to learn Data Science in 100 Days. Below is a Day-by-Day plan to learn Data Science using Python, this plan spans 100 days and it is required to spend at least an hour each day

Day 1: Installation of Tool

Just ensure that the required tools are installed and you become comfortable with the tool you are going to use for the next few weeks/months. If you choose Python then install Anaconda, which would also install the IDEs Jupyter Notebook and Spyder. In case you choose ‘R’ then install RStudio. Try to just play around the IDE and become comfortable in using it. Like, try to understand about the installation of packages/libraries, executing a portion of the code, clearing the memory, and so on.

Day 2 to Day 7: Basic Programming for Data Science

The next step is to learn basic programming, below are some of the topics that should be learned

  • Creation of Variable
  • String Data Type and operation commonly performed on a String Data type
  • Numeric Data type, Boolean and Operators
  • Collection data types List, Tuple, Sets, and Dictionary — It is important to understand the uniqueness and difference between each one of them’
  • If-Then-Else Conditions, For Loop and While Loop Implementations
  • Functions and Lambda Function — Benefits of each of them and their differences

Day 8 to Day 17: Pandas Library

Learn about the Pandas library, some of the topics that one needs to learn in pandas are,

  • Creating a data frame, reading data from a file, and writing a data frame to a file
  • Indexing and Selection of data from a data frame
  • Iteration and Sorting
  • Aggregation and Group By
  • Missing Values and handling of missing values
  • Renaming and Replacing in Pandas
  • Concatenating, Merging, and Joining in a data frame
  • Summary Analysis, Cross Tabulation, and Pivot
  • Date, Categorical and Sparse Data

Spend a good 10 days in thoroughly learning the above topics as these topics will be very useful when you perform exploratory data analysis. While covering these topics try to go into the granular details like understanding the differences between merge and join, Crosstab and Pivot by that you not only learn about each one of them but also know when and where to use them.

Why should I learn Pandas? If you take any Data Science project they always begin with exploratory data analysis to understand the data better and these topics you have covered in pandas would come in handy. Also, because Pandas would help in reading data from different sources and formats, they are fast and efficient and also provides easy functionalities to perform various operation on the dataset.

Day 18 to Day 22: Numpy Library

Having learned Pandas the next important library to be learned is Numpy. The reason to learn Numpy is that they are very fast as compared to List. The topics to cover in Numpy would include

  • Creation of an Array
  • Indexing and Slicing
  • Data Types
  • Joining and splitting
  • Searching and Sorting
  • Filtering required data elements

Why is it important to learn Numpy? Numpy enables the performance of scientific operations on the data in a fast and efficient way. It supports efficient matrix operations which are commonly used in machine learning algorithms and also pandas library extensively uses Numpy

Day 23 to Day 25: Visualizations

Now its time to spend some quality time understanding and using some of the key visualization libraries like ggplot, Plotly, and Seaborn. Use a sample dataset and try different visualization like Bar Chart, Line/Trend Chart, Box Plot, Scatter Plots, Heatmaps, Pie Chart, Histogram, Bubble Charts, and other interesting or interactive visualizations

Photo by Luke Chesser on Unsplash

The key in a Data Science project is the communication of insights to the stakeholders and visualizations are a great tool for that purpose.

Day 26 to Day 35: Statistics, Implementation, and Use-cases

The next important topic to be covered is Statistics, explore the descriptive statistics techniques that are commonly used such as Mean, Median, Mode, Range Analysis, Standard Deviations, and Variances.

Then cover slightly deeper techniques such as identifying the Outliers in the Dataset and measuring the Margin of Error.

As a final step start exploring the various statistics test such as below, understand the application of these statistical tests in real-life

  • F-test
  • ANOVA
  • Chi-Squared Test
  • T-Test
  • Z-Test

Day 36 to Day 40: SQL for Data Analysis

Now time to learn some SQL, this is important because in most corporate use-cases the data will be stored in a Database, and knowing SQL will greatly help in querying the required data from the system for the analysis.

You can start by installing an opensource Database such as MySQL, it would come with some default databases just play around with the data and learn SQL. It will be good if you could focus on learning the below

  • Selecting data from a table
  • Joining data from different tables based on a key
  • Performing Group by and Aggregation functions on data
  • Use of case statements and filter conditions

Day 41 to Day 50: Exploratory Data Analysis (EDA)

In any Data Science project, about 80% of the time is spent in this activity so it is best to spend time learning this topic thoroughly. In order to learn the Exploratory Data Analysis, there isn’t a specific set of functionalities or topics to be covered but the dataset and the use-case would drive the analysis. Hence it would be preferred to use some sample dataset from competitions hosted in kaggle and learn to perform exploratory analysis.

Another method to learn exploratory data analysis is to write your questions about the dataset and try to find answers for them from the dataset. Like if I consider the most popular Titanic Dataset then try to find answers for questions like people of which gender/age/deck had a higher probability of dying and so on. Your ability to perform a thorough analysis would improve with time so be patient and learn slowly and confidently.

By now you have learned all the core skills required for a Data Scientist, now you are ready to learn Algorithms.

What happened to Mathematics?

Yes it is important to know about Linear Algebra and Calculus but I would prefer not to spend time learning mathematics concepts but as and when they are required you can refer and brush up your skills, High-School Level of mathematics would be sufficient. For example, let’s say you are learning about Gradient Descent then while learning the algorithm you can spend time learning about the mathematics behind it. Because if you start learning the important concepts in mathematics then it could be very time consuming and moreover by learning as and when required you would learn just enough required for the time but instead if you start learning all concepts in mathematics then you would be spending way more time and would be learning way more than what is required.

Day 51 to Day 70: Supervised Learning and Project Implementation

Spend the first 10 days in knowing some of the key algorithms in Supervised Learning, understand the math behind them, and in the next 10 days focus on learning by developing a project. Some of the Algorithms that should be covered in this period are,

  • Linear Regression and Logistic Regression
  • Decision Tree / Random Forest
  • Support Vector Machine (SVM)

In the first 10 days, the focus should be on understanding the theory behind the algorithms you have chosen. Then spend some time understanding the scenarios where each of the algorithms would be more suitable as compared with others like Decision Trees are best when there are a lot of categorical attributes in the dataset.

Then pick a solved example in Kaggle, you will be able to find ample solved examples try to re-execute them but carefully understand each and every line in the code and understand the reason for them. By now you have got good theoretical knowledge as well as working knowledge from the solved examples.

As a final step, pick a project, and implement a supervised learning algorithm, start with data collection, exploratory analysis, feature engineering, model building, and model validation. There will definitely be a lot of questions and issues but when you complete the project you would have got a very good knowledge about the algorithm and the methodologies

Day 71 to Day 90: Unsupervised Learning and Project Implementation

Now its time to focus on unsupervised learning, similar to the method used in supervised learning spend the initial days in understanding the concepts behind the algorithms you have chosen in unsupervised learning, and then learn by implementing a project.

The algorithm that should be covered here are,

  • Clustering Algorithm — Used to identify Clusters in the dataset
  • Association Analysis — Used to identify patterns in the data
  • Principal Components Analysis — Used to reduce the number of attributes
  • Recommendation System — Used to identify similar users/products and to make recommendations

In the initial days, the focus should be on understanding each of the above algorithms and techniques also to understand the purpose of each of them and the scenarios where they can be used like principal components analysis generally used for dimensionality reduction when the dataset you are working is having a very large number of columns and you would want to reduce it but still retain the information from them and recommendation systems are popular in e-commerce where based on the purchase patterns of a customer other items that they would likely be interested in could be recommended to increase sales.

When you are comfortable with the theory and the scenarios where they can be used then it is time to pick a solved example and learn by reverse engineering them that is understanding each and every line of code and re-executing them.

As a final step now its time to pick a use-case and implement based on your learnings so far. On completing the project/use-case you would have learned a lot and you would have gained a much better understanding of these algorithms and that would remain with you forever.

Day 91 to Day 100: Natural Language Processing Basics

Make use of this time to focus on analysis and use-cases for unstructured / text data. Few things worth spending time here would include

  • Learn to use API to fetch data from the public sources
  • Perform a few basic sentiments analysis — Data from twitter API can be used to extract tweets of a particular hashtag and then the sentiment and the emotions behind those tweets can be computed
  • Topic Modelling — This is useful when there are a large number of document and you want to group them into different categories that this method would come handy

That’s it! You have now covered all the important concepts and ready to apply for any Data Science Jobs. I have started this journey of learning Data Science in 100 Days on my YouTube Channel, if you are interested please join me and start your journey to learn Data Science here

Start your journey here

FAQs

Can anyone become a Data Scientist in 100 Days?

Yes, Just like anyone can learn swimming in just a few days, anyone can learn Data Science in 100 days or even less. But just like in swimming one can become an elite swimmer or Olympic swimmer only through hard work and continuous practice, the same goes for Data Science as well, with practice and hard work you can become an expert.

If I follow this Journey, How much would I have learned?

By end of this journey, you will have enough knowledge to work on a typical Data Science project. And you would have broken the learning barrier and hence with minimal effort and support, you would be able to continue learning advanced topics in Data Science.

Final Message before Sign-Off

At first, things might look too complicated don’t get overwhelmed just take one step at a time and continue your learning journey it might take some time but you will definitely reach your destination.

Labels: , ,