 #### Data Science Roadmap, Part 2: Learn Data Analysis and Machine Learning In part one of the data science roadmap, you’ve seen a structured outline to learn Python, SQL, and math for data science.

In this second part, we’ll go over a structured learning path for data analysis and machine learning in Python. As a data scientist, you should be able to analyze existing data and answer business questions, analyze trends, and more. Data analysis and visualization are, therefore, integral parts of a data scientist’s toolbox.

Here, we present a step-by-step guide to learning powerful Python libraries for numeric computing and data analysis: NumPy and pandas. For data visualization, we suggest learning matplotlib and seaborn. You’ll end the section with a capstone project on Exploratory Data Analysis (EDA).

# Data Analysis and Visualization

## Week 1: Getting Started with NumPy

• Understanding multidimensional arrays and axes
• Creating NumPy arrays
• Indexing and slicing NumPy arrays
• Useful built-in functions: min(),max(), argmax(),and more
• Operations along different axes
• Reshaping arrays

### Resources

Python NumPy Tutorial, CS231n @Stanford

## Week 2, 3: Data Manipulation and Analysis with pandas

• Basics of panda series and data frames
• Summary statistics on data frames
• Indexing and slicing of data frames
• Filtering data frames
• pandas `groupby()`
• pandas `apply()`
• Pivot tables and joins (Advanced)

### Resources

Data Analysis with pandas, Dataschool

## Week 4: Data Visualization with Matplotlib

• Understanding the matplotlib figure object
• Creating line plots and scatter plots
• Understanding legends and subplots
• Styling matplotlib plots

## Week 5: Data Visualization with Seaborn

• Review pandas and matplotlib
• Learn seaborn distribution and pair plots
• Learn boxplots and their interpretation

Capstone Project: Apply learnings from Week 1 – 5

• Exploratory Data Analysis (EDA) on a dataset from Kaggle. Here are some fun, beginner-friendly Kaggle datasets to choose from.
• Apply all pandas methods you’ve learned
• Use seaborn to create explanatory plots

# Machine Learning: From Data Cleaning to Algorithms

So far, you’ve learnt the foundational Python, SQL, and math skills, and you now know how to perform exploratory data analysis on a dataset. In addition, you should also be able to build machine learning models that can be used to solve business problems.

You’ve already learned how to scrape data from the web and import data from various sources into your working environment. This section outlines a guided learning path for you to progress through the machine learning pipeline from data cleaning to machine learning algorithms.

### Week 1: Data cleaning and preprocessing (Beginner)

• Understanding missing values
• Dealing with missing values: Imputation techniques
• KNN imputer, Iterative imputer in scikit-learn
• Understanding outliers
• Detecting and removing outliers
• Encoding categorical variables

Practice:

• Choose a dataset.
• Perform EDA applying data analysis and visualization techniques
• Use suitable imputation techniques to handle missing values

### Week 2: Machine Learning Basics (Beginner)

• Steps in the machine learning pipeline
• Supervised vs unsupervised learning
• Regression vs classification
• Understanding loss functions, hyperparameter, train-test split, and validation dataset
• Getting started with scikit-learn library

Practice:

• Summarize your understanding of loss, gradients, parameters, and hyperparameters as a Jupyter notebook report

### Week 3: Feature Engineering for Machine Learning (Intermediate)

• Need for feature engineering
• Revise: pandas `groupby()` and `apply()` methods
• Use the above methods to apply a transformation to features
• Feature space reduction using PCA (You learned PCA in the Math (Linear Algebra) for Data Science section

Practice:

• For the dataset you chose in week 1, understand existing features and perform feature engineering by applying the suitable techniques from above.

### Week 4: Intro to Supervised Learning (Intermediate)

• Linear regression
• K Nearest Neighbors (KNN)
• Logistic regression
• Understanding regularization
• L1, L2 regularization (Ridge and Lasso regression)

Practice:

• Reassess your dataset and the problem you’re trying to solve by training a model on the dataset.
• Is it a regression problem? Build a simple baseline linear regression model
• If it’s a classification problem, build a baseline logistic regression model

### Week 5: Model Evaluation and Hyperparameter Tuning (Intermediate)

• Evaluation metrics for regression
• Evaluation metrics for classification
• Importance of cross-validation
• K-fold cross-validation
• Grid search, Randomized search

Practice:

• Evaluate the model built in week 4
• What do the metrics suggest?
• Identify the best model hyperparameters using randomized search
• Re-evaluate the model now
• Is there improvement in performance?

### Week 6: Unsupervised Learning (Intermediate)

• Clustering algorithms
• Hierarchical clustering
• K-means clustering
• Density-based clustering DBSCAN

Practice:

### Week 7: Other Classification Algorithms (Intermediate)

• Support Vector Machines (SVM)
• Naive Bayes

Practice:

### Week 8: Tree-Based and Ensemble Models (Advanced)

• Decision tree
• Random forests
• Boosting and Bagging
• XGBoost

Practice:

• Train a baseline classification model
• Apply ensemble modeling, is there performance improvement?

I hope you found this data science roadmap helpful. It’s time to start coding your way through the curriculum. Happy learning and coding! ‍