Data Science Roadmap, Part 2: Data Analysis and Machine Learning

Bala Priya C
October 3, 2022
4:30 pm

In part one of the data science roadmap, you’ve seen a structured outline to learn Python, SQL, and math for data science.

Data Science Roadmap, Part 1: Learn Python, SQL, and Math

In this second part, we’ll go over a structured learning path for data analysis and machine learning in Python. As a data scientist, you should be able to analyze existing data and answer business questions, analyze trends, and more. Data analysis and visualization are, therefore, integral parts of a data scientist’s toolbox.

Here, we present a step-by-step guide to learning powerful Python libraries for numeric computing and data analysis: NumPy and pandas. For data visualization, we suggest learning matplotlib and seaborn. You’ll end the section with a capstone project on Exploratory Data Analysis (EDA).

Data Analysis and Visualization

Week 1: Getting Started with NumPy

Understanding multidimensional arrays and axes
Creating NumPy arrays
Indexing and slicing NumPy arrays
Useful built-in functions: min(),max(), argmax(),and more
Operations along different axes
Reshaping arrays
Broadcasting in NumPy arrays

Resources

Python NumPy Tutorial, CS231n @Stanford

Week 2, 3: Data Manipulation and Analysis with pandas

Loading data from multiple sources such as csv, parquet
Basics of panda series and data frames
Summary statistics on data frames
Indexing and slicing of data frames
Filtering data frames
pandas groupby()
pandas apply()
Pivot tables and joins (Advanced)

Resources

Data Analysis with pandas, Dataschool

Week 4: Data Visualization with Matplotlib

Understanding the matplotlib figure object
Creating line plots and scatter plots
Understanding legends and subplots
Styling matplotlib plots

Week 5: Data Visualization with Seaborn

Review pandas and matplotlib
Learn seaborn distribution and pair plots
Learn boxplots and their interpretation

Capstone Project: Apply learnings from Week 1 – 5

Exploratory Data Analysis (EDA) on a dataset from Kaggle. Here are some fun, beginner-friendly Kaggle datasets to choose from.
Apply all pandas methods you’ve learned
Use seaborn to create explanatory plots

Machine Learning: From Data Cleaning to Algorithms

So far, you’ve learnt the foundational Python, SQL, and math skills, and you now know how to perform exploratory data analysis on a dataset. In addition, you should also be able to build machine learning models that can be used to solve business problems.

You’ve already learned how to scrape data from the web and import data from various sources into your working environment. This section outlines a guided learning path for you to progress through the machine learning pipeline from data cleaning to machine learning algorithms.

Week 1: Data cleaning and preprocessing (Beginner)

Understanding missing values
Dealing with missing values: Imputation techniques
KNN imputer, Iterative imputer in scikit-learn
Understanding outliers
Detecting and removing outliers
Encoding categorical variables

Practice:

Choose a dataset.
Perform EDA applying data analysis and visualization techniques
Use suitable imputation techniques to handle missing values

Week 2: Machine Learning Basics (Beginner)

Steps in the machine learning pipeline
Supervised vs unsupervised learning
Regression vs classification
Review: Gradient descent optimization
Understanding loss functions, hyperparameter, train-test split, and validation dataset
Getting started with scikit-learn library

Practice:

Read sklearn docs
Summarize your understanding of loss, gradients, parameters, and hyperparameters as a Jupyter notebook report

Week 3: Feature Engineering for Machine Learning (Intermediate)

Need for feature engineering
Revise: pandas groupby() and apply() methods
Use the above methods to apply a transformation to features
Feature space reduction using PCA (You learned PCA in the Math (Linear Algebra) for Data Science section

Practice:

For the dataset you chose in week 1, understand existing features and perform feature engineering by applying the suitable techniques from above.

Week 4: Intro to Supervised Learning (Intermediate)

Linear regression
K Nearest Neighbors (KNN)
Logistic regression
Understanding regularization
L1, L2 regularization (Ridge and Lasso regression)
Understanding Bias Variance Tradeoff

Practice:

Reassess your dataset and the problem you’re trying to solve by training a model on the dataset.
Is it a regression problem? Build a simple baseline linear regression model
If it’s a classification problem, build a baseline logistic regression model

Week 5: Model Evaluation and Hyperparameter Tuning (Intermediate)

Evaluation metrics for regression
Evaluation metrics for classification
Importance of cross-validation
K-fold cross-validation
Grid search, Randomized search

Practice:

Evaluate the model built in week 4
What do the metrics suggest?
Identify the best model hyperparameters using randomized search
Re-evaluate the model now
Is there improvement in performance?

Week 6: Unsupervised Learning (Intermediate)

Clustering algorithms
Hierarchical clustering
K-means clustering
Density-based clustering DBSCAN

Practice:

Choose a dataset suitable for clustering
Identify which clustering is well-suited for the task at hand and use the same

Week 7: Other Classification Algorithms (Intermediate)

Support Vector Machines (SVM)
Naive Bayes

Practice:

Choose a dataset to perform text analysis
Build a simple text classification model using Naive Bayes algorithm

Week 8: Tree-Based and Ensemble Models (Advanced)

Decision tree
Random forests
Boosting and Bagging
XGBoost

Practice:

Train a baseline classification model
Apply ensemble modeling, is there performance improvement?

I hope you found this data science roadmap helpful. It’s time to start coding your way through the curriculum. Happy learning and coding! ‍

Data Science Roadmap, Part 2: Data Analysis and Machine Learning

Data Analysis and Visualization

Week 1: Getting Started with NumPy

Resources

Week 2, 3: Data Manipulation and Analysis with pandas

Resources

Week 4: Data Visualization with Matplotlib

Week 5: Data Visualization with Seaborn

Machine Learning: From Data Cleaning to Algorithms

Week 1: Data cleaning and preprocessing (Beginner)

Week 2: Machine Learning Basics (Beginner)

Week 3: Feature Engineering for Machine Learning (Intermediate)

Week 4: Intro to Supervised Learning (Intermediate)

Week 5: Model Evaluation and Hyperparameter Tuning (Intermediate)

Week 6: Unsupervised Learning (Intermediate)

Week 7: Other Classification Algorithms (Intermediate)

Week 8: Tree-Based and Ensemble Models (Advanced)

Unlimited access to educational materials for subscribers

Resources

Information

Social Media

We Accept