loader image

Data Science Roadmap, Part 2: Learn Data Analysis and Machine Learning

In part one of the data science roadmap, you’ve seen a structured outline to learn Python, SQL, and math for data science.

In this second part, we’ll go over a structured learning path for data analysis and machine learning in Python. As a data scientist, you should be able to analyze existing data and answer business questions, analyze trends, and more. Data analysis and visualization are, therefore, integral parts of a data scientist’s toolbox.

Here, we present a step-by-step guide to learning powerful Python libraries for numeric computing and data analysis: NumPy and pandas. For data visualization, we suggest learning matplotlib and seaborn. You’ll end the section with a capstone project on Exploratory Data Analysis (EDA).

Data Analysis and Visualization

Week 1: Getting Started with NumPy

  • Understanding multidimensional arrays and axes
  • Creating NumPy arrays
  • Indexing and slicing NumPy arrays
  • Useful built-in functions: min(),max(), argmax(),and more
  • Operations along different axes
  • Reshaping arrays
  • Broadcasting in NumPy arrays

Resources

Python NumPy Tutorial, CS231n @Stanford

Week 2, 3: Data Manipulation and Analysis with pandas

  • Loading data from multiple sources such as csv, parquet
  • Basics of panda series and data frames
  • Summary statistics on data frames
  • Indexing and slicing of data frames
  • Filtering data frames
  • pandas groupby()
  • pandas apply()
  • Pivot tables and joins (Advanced)

Resources

Data Analysis with pandas, Dataschool

Week 4: Data Visualization with Matplotlib

  • Understanding the matplotlib figure object
  • Creating line plots and scatter plots
  • Understanding legends and subplots
  • Styling matplotlib plots

Week 5: Data Visualization with Seaborn

  • Review pandas and matplotlib
  • Learn seaborn distribution and pair plots
  • Learn boxplots and their interpretation

Capstone Project: Apply learnings from Week 1 – 5

  • Exploratory Data Analysis (EDA) on a dataset from Kaggle. Here are some fun, beginner-friendly Kaggle datasets to choose from.
  • Apply all pandas methods you’ve learned
  • Use seaborn to create explanatory plots

Machine Learning: From Data Cleaning to Algorithms

So far, you’ve learnt the foundational Python, SQL, and math skills, and you now know how to perform exploratory data analysis on a dataset. In addition, you should also be able to build machine learning models that can be used to solve business problems.

You’ve already learned how to scrape data from the web and import data from various sources into your working environment. This section outlines a guided learning path for you to progress through the machine learning pipeline from data cleaning to machine learning algorithms.

Week 1: Data cleaning and preprocessing (Beginner)

  • Understanding missing values
  • Dealing with missing values: Imputation techniques
  • KNN imputer, Iterative imputer in scikit-learn
  • Understanding outliers
  • Detecting and removing outliers
  • Encoding categorical variables

Practice:

  • Choose a dataset.
  • Perform EDA applying data analysis and visualization techniques
  • Use suitable imputation techniques to handle missing values

Week 2: Machine Learning Basics (Beginner)

  • Steps in the machine learning pipeline
  • Supervised vs unsupervised learning
  • Regression vs classification
  • Review: Gradient descent optimization
  • Understanding loss functions, hyperparameter, train-test split, and validation dataset
  • Getting started with scikit-learn library

Practice:

  • Read sklearn docs
  • Summarize your understanding of loss, gradients, parameters, and hyperparameters as a Jupyter notebook report

Week 3: Feature Engineering for Machine Learning (Intermediate)

  • Need for feature engineering
  • Revise: pandas groupby() and apply() methods
  • Use the above methods to apply a transformation to features
  • Feature space reduction using PCA (You learned PCA in the Math (Linear Algebra) for Data Science section

Practice:

  • For the dataset you chose in week 1, understand existing features and perform feature engineering by applying the suitable techniques from above.

Week 4: Intro to Supervised Learning (Intermediate)

  • Linear regression
  • K Nearest Neighbors (KNN)
  • Logistic regression
  • Understanding regularization
  • L1, L2 regularization (Ridge and Lasso regression)
  • Understanding Bias Variance Tradeoff

Practice:

  • Reassess your dataset and the problem you’re trying to solve by training a model on the dataset.
  • Is it a regression problem? Build a simple baseline linear regression model
  • If it’s a classification problem, build a baseline logistic regression model

Week 5: Model Evaluation and Hyperparameter Tuning (Intermediate)

  • Evaluation metrics for regression
  • Evaluation metrics for classification
  • Importance of cross-validation
  • K-fold cross-validation
  • Grid search, Randomized search

Practice:

  • Evaluate the model built in week 4
  • What do the metrics suggest?
  • Identify the best model hyperparameters using randomized search
  • Re-evaluate the model now
  • Is there improvement in performance?

Week 6: Unsupervised Learning (Intermediate)

  • Clustering algorithms
  • Hierarchical clustering
  • K-means clustering
  • Density-based clustering DBSCAN

Practice:

Week 7: Other Classification Algorithms (Intermediate)

  • Support Vector Machines (SVM)
  • Naive Bayes

Practice:

Week 8: Tree-Based and Ensemble Models (Advanced)

  • Decision tree
  • Random forests
  • Boosting and Bagging
  • XGBoost

Practice:

  • Train a baseline classification model
  • Apply ensemble modeling, is there performance improvement?

I hope you found this data science roadmap helpful. It’s time to start coding your way through the curriculum. Happy learning and coding! ‍

Facebook
Twitter

Leave a Reply

Your email address will not be published. Required fields are marked *

Unlimited access to educational materials for subscribers

Ask ChatGPT
Set ChatGPT API key
Find your Secret API key in your ChatGPT User settings and paste it here to connect ChatGPT with your Tutor LMS website.
Hi, Welcome back!
Forgot?
Don't have an account?  Register Now