Hyperparameter tuning in Random Forest Classifier using genetic algorithm

Sakil Ansari
7 min readOct 10, 2020

--

Introduction

This article presents an introduction on how to fine-tune the Machine Learning model using optimization technique: Genetic Algorithm in random Forest Classifier algorithm. The article encourages data scientists to use this technique in real-time scenarios. The main theme of the article is to give a glance at the implementation of the optimization technique: Genetic Algorithm.

Genetic Algorithm is an optimization technique, which tries to find out such values of input so that we get the best output values or results.

The working of a genetic algorithm is also derived from biology, which is as shown in the image below.

from IPython.display import Image
Image("genetic_algorithm.png")
png

ImageSourceLink

Application of Genetic Algorithm

  • Engineering Design
  • Traffic and Shipment Routing (Travelling Salesman Problem)
  • Robotics

Implementation of Genetic Algorithm using TPOT library

A basic pipeline structure is shown in the image below.

Image("pipeline-850x446.png")
png

For using the TPOT library, we first have to install some existing python libraries on which TPOT is built. So let us quickly install them.

#installing tpot

pip install tpot

Use Case: Heart Disease Prediction using Random Forest Classifier

The dataset is collected from Kaggle (https://www.kaggle.com/ronitf/heart-disease-uci) and I will be using the Machine Learning algorithm Random Forest Classifier to make predictions on whether a person is suffering from Heart Disease or not.

Import libraries

All the necessary libraries are imported. I’ll use NumPy and pandas to start with. For visualization, I will use pyplot subpackage of matplotlib, I have used rcParams to add styling to the plots and rainbow for colors. For implementing Machine Learning models and processing of data, I will use the sklearn library.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib.cm import rainbow
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

For processing the data, I’ll import a few libraries. To split the available dataset for testing and training, I’ll use the train_test_split method. As I am using an ensemble technique, feature scaling is not required.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Next,I am importing the Machine Learning algorithm

  1. Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

Import dataset¶

Note that we have all the libraries we need, now I can import the dataset and take a look at it. The dataset is stored in the file dataset.csv. I’ll use the pandas read_csv method to read the dataset.

dataset = pd.read_csv('dataset.csv')

The dataset is now loaded into the variable dataset. I’ll just take a glimpse of the data using the describe() and info() methods before I actually start processing and visualizing it.

dataset.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age 303 non-null int64
sex 303 non-null int64
cp 303 non-null int64
trestbps 303 non-null int64
chol 303 non-null int64
fbs 303 non-null int64
restecg 303 non-null int64
thalach 303 non-null int64
exang 303 non-null int64
oldpeak 303 non-null float64
slope 303 non-null int64
ca 303 non-null int64
thal 303 non-null int64
target 303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
dataset.shape(303, 14)

The dataset has a total of 303 rows and there are no missing values. There are a total of 13 features along with one target value that we wish to find.

dataset.describe()
png

Understanding the data

Now, we can use visualizations to better understand our data and then look at any processing we might want to do.

rcParams['figure.figsize'] = 20, 14
plt.matshow(dataset.corr())
plt.yticks(np.arange(dataset.shape[1]), dataset.columns)
plt.xticks(np.arange(dataset.shape[1]), dataset.columns)
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7f22d50628d0>
png

We can see that a few features have a negative correlation with the target value while some have positive.

Checking the size of each Class in the dataset

rcParams['figure.figsize'] = 8,6
plt.bar(dataset['target'].unique(), dataset['target'].value_counts(), color = ['red', 'green'])
plt.xticks([0, 1])
plt.xlabel('Target Classes')
plt.ylabel('Count')
plt.title('Count of each Target Class')
Text(0.5, 1.0, 'Count of each Target Class')
png

The two classes are not exactly 50% each but the ratio is good enough to continue without dropping/increasing our data.

Data Processing

We need to convert some categorical variables into dummy variables.I’ll use the get_dummies method to create dummy columns for categorical variables.

dataset = pd.get_dummies(dataset, columns = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal'])

Machine Learning

I’ll now import train_test_split to split our dataset into training and testing datasets. Then, I’ll import all Machine Learning models I’ll be using to train and test the data.

y = dataset['target']
X = dataset.drop(['target'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)

Random Forest Classifier

Now, I’ll use the ensemble method, Random Forest Classifier, to create the model.

from sklearn.ensemble import RandomForestClassifier
rf_classifier=RandomForestClassifier(n_estimators=10).fit(X_train,y_train)
prediction=rf_classifier.predict(X_test)
y.value_counts()1 165
0 138
Name: target, dtype: int64
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
print(confusion_matrix(y_test,prediction))
print(accuracy_score(y_test,prediction))
print(classification_report(y_test,prediction))
[[42 6]
[ 9 43]]
0.85
precision recall f1-score support

0 0.82 0.88 0.85 48
1 0.88 0.83 0.85 52

accuracy 0.85 100
macro avg 0.85 0.85 0.85 100
weighted avg 0.85 0.85 0.85 100

The main parameters used by a Random Forest Classifier are:

  • criterion = the function used to evaluate the quality of a split.
  • max_depth = maximum number of levels allowed in each tree.
  • max_features = maximum number of features considered when splitting a node.
  • min_samples_leaf = minimum number of samples which can be stored in a tree leaf.
  • min_samples_split = minimum number of samples necessary in a node to cause node splitting.
  • n_estimators = number of trees in the ensemble.

Hyperparameter tuning using a genetic algorithm

Genetic Algorithms tries to apply natural selection mechanisms to Machine Learning contexts.

Let’s consider we create a population of N Machine Learning models with some predefined Hyperparameters. Now We can calculate the accuracy of each model and we can decide to keep just half of the models (the ones that perform best). We can now generate some offsprings having similar Hyperparameters to the ones of the best models so that go get again a population of N models. At this point, we can again calculate the accuracy of each model and repeate the cycle for a defined number of generations. In this way, just the best models will survive at the end of the process.

import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 1000,10)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10,14]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,6,8]
# Create the random grid
param = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'criterion':['entropy','gini']}
print(param)
{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [10, 120, 230, 340, 450, 560, 670, 780, 890, 1000], 'min_samples_split': [2, 5, 10, 14], 'min_samples_leaf': [1, 2, 4, 6, 8], 'criterion': ['entropy', 'gini']}from tpot import TPOTClassifier

tpot_classifier = TPOTClassifier(generations= 5, population_size= 24, offspring_size= 12,
verbosity= 2, early_stop= 12,
config_dict={'sklearn.ensemble.RandomForestClassifier': param},
cv = 4, scoring = 'accuracy')
tpot_classifier.fit(X_train,y_train)
HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=84.0, style=ProgressStyle(des…



Generation 1 - Current best internal CV score: 0.8471568627450979
Generation 2 - Current best internal CV score: 0.8471568627450979
Generation 3 - Current best internal CV score: 0.8471568627450979
Generation 4 - Current best internal CV score: 0.8521568627450979
Generation 5 - Current best internal CV score: 0.8521568627450979
Best pipeline: RandomForestClassifier(RandomForestClassifier(input_matrix, criterion=entropy, max_depth=340, max_features=log2, min_samples_leaf=8, min_samples_split=2, n_estimators=800), criterion=entropy, max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=5, n_estimators=1600)





TPOTClassifier(config_dict={'sklearn.ensemble.RandomForestClassifier': {'criterion': ['entropy',
'gini'],
'max_depth': [10,
120,
230,
340,
450,
560,
670,
780,
890,
1000],
'max_features': ['auto',
'sqrt',
'log2'],
'min_samples_leaf': [1,
2,
4,
6,
8],
'min_samples_split': [2,
5,
10,
14],
'n_estimators': [200,
400,
600,
800,
1000,
1200,
1400,
1600,
1800,
2000]}},
cv=4, early_stop=12, generations=5,
log_file=<ipykernel.iostream.OutStream object at 0x7f231c259550>,
offspring_size=12, population_size=24, scoring='accuracy',
verbosity=2)
accuracy = tpot_classifier.score(X_test, y_test)
print(accuracy)
0.85

Conclusion

In this article, hyperparameter tuning in Random Forest Classifier using a genetic algorithm is implemented considering a use case.A brief introduction about the genetic algorithm is presented and also a sufficient amount of insights is given about the use case.

Jupyter Notebook Link:

You can find the Jupiter notebook from the following link:

Genetic Algorithm

You can find my previous articles from the following links:

Text Summarization

Embed a Live Microsoft Power BI Report in Jupyter Notebook

Epoch to human readable date using Python

Feel Free to use my opensource Machine learning libraries

tweetdatalib6

textSummarizationLibrary

EnglishNepaliTextGeneration

You Can find my Research Papers from the following Links

  1. Experimental Analysis of the Changes in Speech while Normal Speaking, Walking, Running, and Eating
  2. Image Quality Assessment for Fake Biometric Detection:Application to Fingerprint and Face Recognition
  3. Coin Recognition System using Artificial Neural Network on Static Image Dataset

4.Facial Expression Recognition

5.Automatic Cloth pattern and Color Recognition for Visually Impaired People using SVM algorithm

6.Pattern Recognition Techniques: A Review

7.Speaker Diarization of Broadcast News audios

8.Implementation of Magnitude and Phase Spectrum Compensation to Achieve an Enhanced Speech Signal

9.TWITTER SENTIMENT ANALYSIS USING MACHINE LEARNING

  1. Pattern Recognition :Introduction

Follow on

LinkedLn

Github

Medium

Twitter

--

--

Sakil Ansari
Sakil Ansari

Written by Sakil Ansari

Working as a Data Scientist/ML/NLP/Speech Recognition/Deep learning