Scikit-learn Advanced Features | Data Science

DataValley Team
August 6, 2020
12:01 pm
No Comments

Neither Titanic dataset nor sklearn a new thing for any data scientist but there are some important features in scikit-learn that will make any model pre-processing and tuning easier, to be specific this notebook will cover the following concepts

ColumnTransformer
Pipeline
SimpleImputer
StandardScalar
OneHotEncoder
OrdinalEncoder
GridSearch

The dataset used in this article can be found in Titanic Dataset

Before diving into our use case, we will explore these concepts to figure our their importance for any Machine Learning Project.

Pipeline

The purpose of the pipeline is to assemble several steps that can be cross-validated while setting different parameters. For example, you may want to do some actions regarding your data, such as:

Impute Missing Values.
Scale the data.
Fit the data into a Machine Learning model.

Using the Pipeline concept in scikit-learn, you will be able to access each step, changing parameters, and integrating between multiple pipelines elegantly.

Grid Search

Grid Search is the process of performing hyperparameter tuning to determine the optimal values for a given model using exhaustive search methods. For example, when selecting the regularization parameter for SVM. You may be unsure about the best regularization parameter to use. Grid Search will try all your options and fit the model with each hyperparameter; then, it will return the best parameters that achieved the best model score.

Simple Imputer

SimpleImputer is a scikit-learn method that will let you handle missing values in your data efficiently. Also, it has a very flexible structure that allows us to manage the missing values using different approaches such as mean, median, and constant imputation.

Standard Scaler

One of the most basic approaches that let you scale your data for faster training and convergence. It will make the data follow the Normal Distribution with mean = 0 and standard deviation = 1.

One-Hot Encoder

It’s used for handling categorical data that are not following any order. That’s because any Machine Learning model does not accept anything but numbers. So, you have to find a way to transform any numbered data into a number form. The approach of one-hot encoder can be explained using the following figure:

Ordinal Encoder

It’s used for handling ordinal categorical data, just like One-Hot Encoder. Still, it uses a different approach for data where the order is important and should be reflected in the handling method. The following figure depicts the methodology:

Column Transformer

In handling more complex problems, you will need to create more than one Pipeline. For example, one to process numeric data, another to handle categorical data, another that handles ordinal data and so on. ColumnTransofrmer is doing this job correctly by allowing you to assign a specific pipeline to a particular set of columns. You will see this in action in our code example.

Import Packages

# Pandas for data reading and writing
import pandas as pd
# Numpy for Numerical operations
import numpy as np
# Import ColumnTransformer
from sklearn.compose import ColumnTransformer
# Import Pipeline
from sklearn.pipeline import Pipeline
# Import SimpleImputer
from sklearn.impute import SimpleImputer
# Import StandardScaler, OneHotEncodr and OrdinalEncoder
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
# Import Random Forest for Classification
from sklearn.ensemble import RandomForestClassifier
# Import train_test_split 
from sklearn.model_selection import train_test_split
# Import GridSearch
from sklearn.model_selection import GridSearchCV

Reading Data

In the following cells, we will read the train data and check for NaNs

# Read the train data
data = pd.read_csv("train.csv")
# See some info
data.info()

Splitting Data

# Split the data into predictors and target
# Note :: We will remove Survived column as it won't affect our model
X = data.drop(['Survived', 'Name'], axis = 1)
y = data['Survived']

# Split the data into train and test chunks 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Continuous Features Handling

It’s clear that we have some numerical features that have some missing values to be imputed and they have to be of the same scale also. In the following cell, we will handle the numerical features separately i.e “Age” and “Fare”

# Now, we will create a pipline for the numeric features
# Define a list with the numeric features
numeric_features = ['Age', 'Fare']
# Define a pipeline for numer"ic features
numeric_features_pipeline = Pipeline(steps= [
    ('imputer', SimpleImputer(strategy = 'median')), # Impute with median value for missing
    ('scaler', StandardScaler())                     # Conduct a scaling step
])

Categorical Features Handling

Also, we have some categorical features that have some missing values to be imputed and they have to be encoded using one-hot encoding. In the following cell, we will handle the categorical features separately i.e “Embarked” and “Sex”

Note: I choose simple imputer for the missing cells to impute with ‘missing’ word. My aim was to gather all missing cells in one category for further encoding.

# Now, we will create a pipline for the categorical features
# Define a list with the categorical features
categorical_features = ['Embarked', 'Sex']
# Define a pipeline for categorical features
categorical_features_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value = 'missing')), # Impute with the word 'missing' for missing values
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))     # Convert all categorical variables to one hot encoding
])

Ordinal Features Handling

Passenger class or ‘Pclass’ for short is an ordinal feature that must be handled keeping in mind that class 3 is much higher than 2 and so on.

# Now, we will create a pipline for the ordinal features
# Define a list with the ordinal features
ordinal_features = ['Pclass']
# Define a pipline for ordinal features 
ordinal_features_pipeline = Pipeline(steps=[
    ('ordinal', OrdinalEncoder(categories= [[1, 2, 3]]))
])

Construct a comprehended preprocessor

Now, we will create a preprocessor that can handle all columns in our dataset using ColumnTransformer.

# Now, we will create a transformer to handle all columns
preprocessor = ColumnTransformer(transformers= [
    # transformer with name 'num' that will apply
    # 'numeric_features_pipeline' to numeric_features
    ('num', numeric_features_pipeline, numeric_features),
    # transformer with name 'cat' that will apply 
    # 'categorical_features_pipeline' to categorical_features
    ('cat', categorical_features_pipeline, categorical_features),
    # transformer with name 'ord' that will apply 
    # 'ordinal_features_pipeline' to ordinal_features
    ('ord', ordinal_features_pipeline, ordinal_features) 
    ])

Prediction Pipeline

Now, we will create a full prediction pipeline that uses our preprocessor and then transfer it to our classifier of choice ‘Random Forest’.

# Now, we will create a full prediction pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                     ('classifier', RandomForestClassifier(n_estimators = 120, max_leaf_nodes = 100))])

Pipeline Training

Let’s train our pipeline now

# Let's fit our classifier
clf.fit(X_train, y_train)

Pipeline Tuning

The question now, can we push it a little bit further? i.e. can we tune every single part or our Pipeline?
Here, I will use GridSearch to decide three things:

Simple Imputer strategy: mean or median
n_estimators of Random Forest
max-leaf nodes of Random Forest

Note, you can access any parameter from the outer level to the next adjacent inner one, for Example, to access the strategy of the Simple Imputer you can do the following preprocessor__num__imputer__strategy. Let’s see this into action.

# Now, let's construct our parameters grid for the search
param_grid = {
    # Search between mean and median for missing numerical values handling
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    # Search about best number of estimator in our random forest model
    'classifier__n_estimators': [100, 120, 150, 170, 200],
    # Search about max_leaf_nodes in random forest model 
    'classifier__max_leaf_nodes' : [100, 120, 150, 170, 200]
}

# GridSearch for our classifier using our previously created param_grid with 10 cross validations
grid_search = GridSearchCV(clf, param_grid, cv=10)
# Fit our grid search object
grid_search.fit(X_train, y_train)

print(("Best random forest from grid search: %.3f"
       % grid_search.score(X_test, y_test)))
print('The best parameters of Simple Imputer and C are:')
print(grid_search.best_params_)

As you can see, with the best parameters we got 0.803 score on test set and our best parameters are:

Best max_leaf_nodes = 100
Best n_estimators = 200
Best strategy for imputation = median