In part one of the data science roadmap, you’ve seen a structured outline to learn Python, SQL, and math for data science.
In this second part, we’ll go over a structured learning path for data analysis and machine learning in Python. As a data scientist, you should be able to analyze existing data and answer business questions, analyze trends, and more. Data analysis and visualization are, therefore, integral parts of a data scientist’s toolbox.
Here, we present a step-by-step guide to learning powerful Python libraries for numeric computing and data analysis: NumPy and pandas. For data visualization, we suggest learning matplotlib and seaborn. You’ll end the section with a capstone project on Exploratory Data Analysis (EDA).
Data Analysis and Visualization
Week 1: Getting Started with NumPy
- Understanding multidimensional arrays and axes
- Creating NumPy arrays
- Indexing and slicing NumPy arrays
- Useful built-in functions: min(),max(), argmax(),and more
- Operations along different axes
- Reshaping arrays
- Broadcasting in NumPy arrays
Resources
Python NumPy Tutorial, CS231n @Stanford
Week 2, 3: Data Manipulation and Analysis with pandas
- Loading data from multiple sources such as csv, parquet
- Basics of panda series and data frames
- Summary statistics on data frames
- Indexing and slicing of data frames
- Filtering data frames
- pandas
groupby()
- pandas
apply()
- Pivot tables and joins (Advanced)
Resources
Data Analysis with pandas, Dataschool
Week 4: Data Visualization with Matplotlib
- Understanding the matplotlib figure object
- Creating line plots and scatter plots
- Understanding legends and subplots
- Styling matplotlib plots
Week 5: Data Visualization with Seaborn
- Review pandas and matplotlib
- Learn seaborn distribution and pair plots
- Learn boxplots and their interpretation
Capstone Project: Apply learnings from Week 1 – 5
- Exploratory Data Analysis (EDA) on a dataset from Kaggle. Here are some fun, beginner-friendly Kaggle datasets to choose from.
- Apply all pandas methods you’ve learned
- Use seaborn to create explanatory plots
Machine Learning: From Data Cleaning to Algorithms
So far, you’ve learnt the foundational Python, SQL, and math skills, and you now know how to perform exploratory data analysis on a dataset. In addition, you should also be able to build machine learning models that can be used to solve business problems.
You’ve already learned how to scrape data from the web and import data from various sources into your working environment. This section outlines a guided learning path for you to progress through the machine learning pipeline from data cleaning to machine learning algorithms.
Week 1: Data cleaning and preprocessing (Beginner)
- Understanding missing values
- Dealing with missing values: Imputation techniques
- KNN imputer, Iterative imputer in scikit-learn
- Understanding outliers
- Detecting and removing outliers
- Encoding categorical variables
Practice:
- Choose a dataset.
- Perform EDA applying data analysis and visualization techniques
- Use suitable imputation techniques to handle missing values
Week 2: Machine Learning Basics (Beginner)
- Steps in the machine learning pipeline
- Supervised vs unsupervised learning
- Regression vs classification
- Review: Gradient descent optimization
- Understanding loss functions, hyperparameter, train-test split, and validation dataset
- Getting started with scikit-learn library
Practice:
- Read sklearn docs
- Summarize your understanding of loss, gradients, parameters, and hyperparameters as a Jupyter notebook report
Week 3: Feature Engineering for Machine Learning (Intermediate)
- Need for feature engineering
- Revise: pandas
groupby()
andapply()
methods - Use the above methods to apply a transformation to features
- Feature space reduction using PCA (You learned PCA in the Math (Linear Algebra) for Data Science section
Practice:
- For the dataset you chose in week 1, understand existing features and perform feature engineering by applying the suitable techniques from above.
Week 4: Intro to Supervised Learning (Intermediate)
- Linear regression
- K Nearest Neighbors (KNN)
- Logistic regression
- Understanding regularization
- L1, L2 regularization (Ridge and Lasso regression)
- Understanding Bias Variance Tradeoff
Practice:
- Reassess your dataset and the problem you’re trying to solve by training a model on the dataset.
- Is it a regression problem? Build a simple baseline linear regression model
- If it’s a classification problem, build a baseline logistic regression model
Week 5: Model Evaluation and Hyperparameter Tuning (Intermediate)
- Evaluation metrics for regression
- Evaluation metrics for classification
- Importance of cross-validation
- K-fold cross-validation
- Grid search, Randomized search
Practice:
- Evaluate the model built in week 4
- What do the metrics suggest?
- Identify the best model hyperparameters using randomized search
- Re-evaluate the model now
- Is there improvement in performance?
Week 6: Unsupervised Learning (Intermediate)
- Clustering algorithms
- Hierarchical clustering
- K-means clustering
- Density-based clustering DBSCAN
Practice:
- Choose a dataset suitable for clustering
- Identify which clustering is well-suited for the task at hand and use the same
Week 7: Other Classification Algorithms (Intermediate)
- Support Vector Machines (SVM)
- Naive Bayes
Practice:
- Choose a dataset to perform text analysis
- Build a simple text classification model using Naive Bayes algorithm
Week 8: Tree-Based and Ensemble Models (Advanced)
- Decision tree
- Random forests
- Boosting and Bagging
- XGBoost
Practice:
- Train a baseline classification model
- Apply ensemble modeling, is there performance improvement?
I hope you found this data science roadmap helpful. It’s time to start coding your way through the curriculum. Happy learning and coding!