Saturday, March 26, 2022

Data Science Road map for 100 days

  Are you interested in learning Data Science but not sure where to start? If Yes, then you have reached the right place.

I have come across many people who were very passionate about learning Data Science but just in a few weeks, they quit learning. I was wondering why could someone so much passionate about a field but not pursuing it? By talking to some of them I understood that the main reasons for people to drop out from learning were,

  • They were overwhelmed by the number of topics to learn to become a Data Scientist
  • They come across gatekeepers who say that to become a Data Scientist one needs to be a talented programmer, an expert in mathematics, a master in applied statistics, and very skillful in using pandas, NumPy, and other python libraries.

These are enough to scare an experienced Data Scientist, no wonder they made people attempting to learn Data Science quit. Each of the above topics is like an ocean and when someone tries to master them quickly they get frustrated and quit the learning journey. The actual truth one needs to know just enough of the above topics to become a successful Data Scientist or to get hired as a Data Scientist.

Show me the Path to Learn Data Science

Photo by Joshua Earle on Unsplash

To become a Data Scientist one needs to learn just enough from the below topics

  • Basics of Python or R programming
  • If you choose Python then libraries like Pandas and Numpy
  • Visualization libraries like ggplot, Seaborn, and Plotly.
  • Statistics
  • SQL Programming
  • Mathematics especially Linear Algebra and Calculus

In this below video I have mentioned the step-by-step guide to learn Data Science. I have explained the depth of knowledge required to reach different levels of expertise in Data Science.

How to plan the learning? Which topics should be covered first?

Let me clearly explain the plan to learn Data Science in 100 Days. Below is a Day-by-Day plan to learn Data Science using Python, this plan spans 100 days and it is required to spend at least an hour each day

Day 1: Installation of Tool

Just ensure that the required tools are installed and you become comfortable with the tool you are going to use for the next few weeks/months. If you choose Python then install Anaconda, which would also install the IDEs Jupyter Notebook and Spyder. In case you choose ‘R’ then install RStudio. Try to just play around the IDE and become comfortable in using it. Like, try to understand about the installation of packages/libraries, executing a portion of the code, clearing the memory, and so on.

Day 2 to Day 7: Basic Programming for Data Science

The next step is to learn basic programming, below are some of the topics that should be learned

  • Creation of Variable
  • String Data Type and operation commonly performed on a String Data type
  • Numeric Data type, Boolean and Operators
  • Collection data types List, Tuple, Sets, and Dictionary — It is important to understand the uniqueness and difference between each one of them’
  • If-Then-Else Conditions, For Loop and While Loop Implementations
  • Functions and Lambda Function — Benefits of each of them and their differences

Day 8 to Day 17: Pandas Library

Learn about the Pandas library, some of the topics that one needs to learn in pandas are,

  • Creating a data frame, reading data from a file, and writing a data frame to a file
  • Indexing and Selection of data from a data frame
  • Iteration and Sorting
  • Aggregation and Group By
  • Missing Values and handling of missing values
  • Renaming and Replacing in Pandas
  • Concatenating, Merging, and Joining in a data frame
  • Summary Analysis, Cross Tabulation, and Pivot
  • Date, Categorical and Sparse Data

Spend a good 10 days in thoroughly learning the above topics as these topics will be very useful when you perform exploratory data analysis. While covering these topics try to go into the granular details like understanding the differences between merge and join, Crosstab and Pivot by that you not only learn about each one of them but also know when and where to use them.

Why should I learn Pandas? If you take any Data Science project they always begin with exploratory data analysis to understand the data better and these topics you have covered in pandas would come in handy. Also, because Pandas would help in reading data from different sources and formats, they are fast and efficient and also provides easy functionalities to perform various operation on the dataset.

Day 18 to Day 22: Numpy Library

Having learned Pandas the next important library to be learned is Numpy. The reason to learn Numpy is that they are very fast as compared to List. The topics to cover in Numpy would include

  • Creation of an Array
  • Indexing and Slicing
  • Data Types
  • Joining and splitting
  • Searching and Sorting
  • Filtering required data elements

Why is it important to learn Numpy? Numpy enables the performance of scientific operations on the data in a fast and efficient way. It supports efficient matrix operations which are commonly used in machine learning algorithms and also pandas library extensively uses Numpy

Day 23 to Day 25: Visualizations

Now its time to spend some quality time understanding and using some of the key visualization libraries like ggplot, Plotly, and Seaborn. Use a sample dataset and try different visualization like Bar Chart, Line/Trend Chart, Box Plot, Scatter Plots, Heatmaps, Pie Chart, Histogram, Bubble Charts, and other interesting or interactive visualizations

Photo by Luke Chesser on Unsplash

The key in a Data Science project is the communication of insights to the stakeholders and visualizations are a great tool for that purpose.

Day 26 to Day 35: Statistics, Implementation, and Use-cases

The next important topic to be covered is Statistics, explore the descriptive statistics techniques that are commonly used such as Mean, Median, Mode, Range Analysis, Standard Deviations, and Variances.

Then cover slightly deeper techniques such as identifying the Outliers in the Dataset and measuring the Margin of Error.

As a final step start exploring the various statistics test such as below, understand the application of these statistical tests in real-life

  • F-test
  • ANOVA
  • Chi-Squared Test
  • T-Test
  • Z-Test

Day 36 to Day 40: SQL for Data Analysis

Now time to learn some SQL, this is important because in most corporate use-cases the data will be stored in a Database, and knowing SQL will greatly help in querying the required data from the system for the analysis.

You can start by installing an opensource Database such as MySQL, it would come with some default databases just play around with the data and learn SQL. It will be good if you could focus on learning the below

  • Selecting data from a table
  • Joining data from different tables based on a key
  • Performing Group by and Aggregation functions on data
  • Use of case statements and filter conditions

Day 41 to Day 50: Exploratory Data Analysis (EDA)

In any Data Science project, about 80% of the time is spent in this activity so it is best to spend time learning this topic thoroughly. In order to learn the Exploratory Data Analysis, there isn’t a specific set of functionalities or topics to be covered but the dataset and the use-case would drive the analysis. Hence it would be preferred to use some sample dataset from competitions hosted in kaggle and learn to perform exploratory analysis.

Another method to learn exploratory data analysis is to write your questions about the dataset and try to find answers for them from the dataset. Like if I consider the most popular Titanic Dataset then try to find answers for questions like people of which gender/age/deck had a higher probability of dying and so on. Your ability to perform a thorough analysis would improve with time so be patient and learn slowly and confidently.

By now you have learned all the core skills required for a Data Scientist, now you are ready to learn Algorithms.

What happened to Mathematics?

Yes it is important to know about Linear Algebra and Calculus but I would prefer not to spend time learning mathematics concepts but as and when they are required you can refer and brush up your skills, High-School Level of mathematics would be sufficient. For example, let’s say you are learning about Gradient Descent then while learning the algorithm you can spend time learning about the mathematics behind it. Because if you start learning the important concepts in mathematics then it could be very time consuming and moreover by learning as and when required you would learn just enough required for the time but instead if you start learning all concepts in mathematics then you would be spending way more time and would be learning way more than what is required.

Day 51 to Day 70: Supervised Learning and Project Implementation

Spend the first 10 days in knowing some of the key algorithms in Supervised Learning, understand the math behind them, and in the next 10 days focus on learning by developing a project. Some of the Algorithms that should be covered in this period are,

  • Linear Regression and Logistic Regression
  • Decision Tree / Random Forest
  • Support Vector Machine (SVM)

In the first 10 days, the focus should be on understanding the theory behind the algorithms you have chosen. Then spend some time understanding the scenarios where each of the algorithms would be more suitable as compared with others like Decision Trees are best when there are a lot of categorical attributes in the dataset.

Then pick a solved example in Kaggle, you will be able to find ample solved examples try to re-execute them but carefully understand each and every line in the code and understand the reason for them. By now you have got good theoretical knowledge as well as working knowledge from the solved examples.

As a final step, pick a project, and implement a supervised learning algorithm, start with data collection, exploratory analysis, feature engineering, model building, and model validation. There will definitely be a lot of questions and issues but when you complete the project you would have got a very good knowledge about the algorithm and the methodologies

Day 71 to Day 90: Unsupervised Learning and Project Implementation

Now its time to focus on unsupervised learning, similar to the method used in supervised learning spend the initial days in understanding the concepts behind the algorithms you have chosen in unsupervised learning, and then learn by implementing a project.

The algorithm that should be covered here are,

  • Clustering Algorithm — Used to identify Clusters in the dataset
  • Association Analysis — Used to identify patterns in the data
  • Principal Components Analysis — Used to reduce the number of attributes
  • Recommendation System — Used to identify similar users/products and to make recommendations

In the initial days, the focus should be on understanding each of the above algorithms and techniques also to understand the purpose of each of them and the scenarios where they can be used like principal components analysis generally used for dimensionality reduction when the dataset you are working is having a very large number of columns and you would want to reduce it but still retain the information from them and recommendation systems are popular in e-commerce where based on the purchase patterns of a customer other items that they would likely be interested in could be recommended to increase sales.

When you are comfortable with the theory and the scenarios where they can be used then it is time to pick a solved example and learn by reverse engineering them that is understanding each and every line of code and re-executing them.

As a final step now its time to pick a use-case and implement based on your learnings so far. On completing the project/use-case you would have learned a lot and you would have gained a much better understanding of these algorithms and that would remain with you forever.

Day 91 to Day 100: Natural Language Processing Basics

Make use of this time to focus on analysis and use-cases for unstructured / text data. Few things worth spending time here would include

  • Learn to use API to fetch data from the public sources
  • Perform a few basic sentiments analysis — Data from twitter API can be used to extract tweets of a particular hashtag and then the sentiment and the emotions behind those tweets can be computed
  • Topic Modelling — This is useful when there are a large number of document and you want to group them into different categories that this method would come handy

That’s it! You have now covered all the important concepts and ready to apply for any Data Science Jobs. I have started this journey of learning Data Science in 100 Days on my YouTube Channel, if you are interested please join me and start your journey to learn Data Science here

Start your journey here

FAQs

Can anyone become a Data Scientist in 100 Days?

Yes, Just like anyone can learn swimming in just a few days, anyone can learn Data Science in 100 days or even less. But just like in swimming one can become an elite swimmer or Olympic swimmer only through hard work and continuous practice, the same goes for Data Science as well, with practice and hard work you can become an expert.

If I follow this Journey, How much would I have learned?

By end of this journey, you will have enough knowledge to work on a typical Data Science project. And you would have broken the learning barrier and hence with minimal effort and support, you would be able to continue learning advanced topics in Data Science.

Final Message before Sign-Off

At first, things might look too complicated don’t get overwhelmed just take one step at a time and continue your learning journey it might take some time but you will definitely reach your destination.

Labels: , ,

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home