In part one of the data science roadmap, you’ve seen a structured outline to learn Python, SQL, and math for data science.
In this second part, we’ll go over a structured learning path for data analysis and machine learning in Python. As a data scientist, you should be able to analyze existing data and answer business questions, analyze trends, and more. Data analysis and visualization are, therefore, integral parts of a data scientist’s toolbox.
Here, we present a step-by-step guide to learning powerful Python libraries for numeric computing and data analysis: NumPy and pandas. For data visualization, we suggest learning matplotlib and seaborn. You’ll end the section with a capstone project on Exploratory Data Analysis (EDA).
Data Analysis and Visualization
![](https://i0.wp.com/datavalley.technology/wp-content/uploads/2022/10/2.png?resize=768%2C432&ssl=1)
Week 1: Getting Started with NumPy
- Understanding multidimensional arrays and axes
- Creating NumPy arrays
- Indexing and slicing NumPy arrays
- Useful built-in functions: min(),max(), argmax(),and more
- Operations along different axes
- Reshaping arrays
- Broadcasting in NumPy arrays
Resources
Python NumPy Tutorial, CS231n @Stanford
Week 2, 3: Data Manipulation and Analysis with pandas
- Loading data from multiple sources such as csv, parquet
- Basics of panda series and data frames
- Summary statistics on data frames
- Indexing and slicing of data frames
- Filtering data frames
- pandas
groupby()
- pandas
apply()
- Pivot tables and joins (Advanced)
Resources
Data Analysis with pandas, Dataschool
Week 4: Data Visualization with Matplotlib
- Understanding the matplotlib figure object
- Creating line plots and scatter plots
- Understanding legends and subplots
- Styling matplotlib plots
Week 5: Data Visualization with Seaborn
- Review pandas and matplotlib
- Learn seaborn distribution and pair plots
- Learn boxplots and their interpretation
Capstone Project: Apply learnings from Week 1 – 5
- Exploratory Data Analysis (EDA) on a dataset from Kaggle. Here are some fun, beginner-friendly Kaggle datasets to choose from.
- Apply all pandas methods you’ve learned
- Use seaborn to create explanatory plots
Machine Learning: From Data Cleaning to Algorithms
So far, you’ve learnt the foundational Python, SQL, and math skills, and you now know how to perform exploratory data analysis on a dataset. In addition, you should also be able to build machine learning models that can be used to solve business problems.
![](https://i0.wp.com/datavalley.technology/wp-content/uploads/2022/10/1.png?resize=768%2C432&ssl=1)
You’ve already learned how to scrape data from the web and import data from various sources into your working environment. This section outlines a guided learning path for you to progress through the machine learning pipeline from data cleaning to machine learning algorithms.
Week 1: Data cleaning and preprocessing (Beginner)
- Understanding missing values
- Dealing with missing values: Imputation techniques
- KNN imputer, Iterative imputer in scikit-learn
- Understanding outliers
- Detecting and removing outliers
- Encoding categorical variables
Practice:
- Choose a dataset.
- Perform EDA applying data analysis and visualization techniques
- Use suitable imputation techniques to handle missing values
Week 2: Machine Learning Basics (Beginner)
- Steps in the machine learning pipeline
- Supervised vs unsupervised learning
- Regression vs classification
- Review: Gradient descent optimization
- Understanding loss functions, hyperparameter, train-test split, and validation dataset
- Getting started with scikit-learn library
Practice:
- Read sklearn docs
- Summarize your understanding of loss, gradients, parameters, and hyperparameters as a Jupyter notebook report
Week 3: Feature Engineering for Machine Learning (Intermediate)
- Need for feature engineering
- Revise: pandas
groupby()
andapply()
methods - Use the above methods to apply a transformation to features
- Feature space reduction using PCA (You learned PCA in the Math (Linear Algebra) for Data Science section
Practice:
- For the dataset you chose in week 1, understand existing features and perform feature engineering by applying the suitable techniques from above.
Week 4: Intro to Supervised Learning (Intermediate)
- Linear regression
- K Nearest Neighbors (KNN)
- Logistic regression
- Understanding regularization
- L1, L2 regularization (Ridge and Lasso regression)
- Understanding Bias Variance Tradeoff
Practice:
- Reassess your dataset and the problem you’re trying to solve by training a model on the dataset.
- Is it a regression problem? Build a simple baseline linear regression model
- If it’s a classification problem, build a baseline logistic regression model
Week 5: Model Evaluation and Hyperparameter Tuning (Intermediate)
- Evaluation metrics for regression
- Evaluation metrics for classification
- Importance of cross-validation
- K-fold cross-validation
- Grid search, Randomized search
Practice:
- Evaluate the model built in week 4
- What do the metrics suggest?
- Identify the best model hyperparameters using randomized search
- Re-evaluate the model now
- Is there improvement in performance?
Week 6: Unsupervised Learning (Intermediate)
- Clustering algorithms
- Hierarchical clustering
- K-means clustering
- Density-based clustering DBSCAN
Practice:
- Choose a dataset suitable for clustering
- Identify which clustering is well-suited for the task at hand and use the same
Week 7: Other Classification Algorithms (Intermediate)
- Support Vector Machines (SVM)
- Naive Bayes
Practice:
- Choose a dataset to perform text analysis
- Build a simple text classification model using Naive Bayes algorithm
Week 8: Tree-Based and Ensemble Models (Advanced)
- Decision tree
- Random forests
- Boosting and Bagging
- XGBoost
Practice:
- Train a baseline classification model
- Apply ensemble modeling, is there performance improvement?
I hope you found this data science roadmap helpful. It’s time to start coding your way through the curriculum. Happy learning and coding!