Random Forest

Random Forest is an ensemble of Decision Trees. With a few exceptions, a RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control the ensemble itself.

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model. The following BaggingClassifier is roughly equivalent to the previous RandomForestClassifier. Run the cell below to visualize a single estimator from a random forest model, using the Iris dataset to classify the data into the appropriate species.

conda install graphviz

In [1]:
from sklearn.datasets import load_iris
iris = load_iris()

# Model (can also use single decision tree)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10)

# Train
model.fit(iris.data, iris.target)
# Extract single tree
estimator = model.estimators_[5]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = iris.feature_names,
                class_names = iris.target_names,
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')
Out[1]:

Notice how each split seperates the data into buckets of similar observations. This is a single tree and a relatively simple classification dataset, but the same method is used in a more complex dataset with greater depth to the trees.

Coronavirus

Coronavirus disease (COVID-19) is an infectious disease caused by a new virus. The disease causes respiratory illness (like the flu) with symptoms such as a cough, fever, and in more severe cases, difficulty breathing. You can protect yourself by washing your hands frequently, avoiding touching your face, and avoiding close contact (1 meter or 3 feet) with people who are unwell. An outbreak of COVID-19 started in December 2019 and at the time of the creation of this project was continuing to spread throughout the world. Many governments recommended only essential outings to public places and closed most business that do not serve food or sell essential items. An excellent spatial dashboard built by Johns Hopkins shows the daily confirmed cases by country.

This case study was designed to drive home the important role that data science plays in real-world situations like this pandemic. This case study uses the Random Forest Classifier and a dataset from the South Korean cases of COVID-19 provided on Kaggle to encourage research on this important topic. The goal of the case study is to build a Random Forest Classifier to predict the 'state' of the patient.

First, please load the needed packages and modules into Python. Next, load the data into a pandas dataframe for ease of use.

In [2]:
import os
import pandas as pd
from datetime import datetime,timedelta
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import plotly.graph_objects as go
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor
In [3]:
url ='SouthKoreacoronavirusdataset/PatientInfo.csv'
df = pd.read_csv(url)
df.head()
Out[3]:
patient_id global_num sex birth_year age country province city disease infection_case infection_order infected_by contact_number symptom_onset_date confirmed_date released_date deceased_date state
0 1000000001 2.0 male 1964.0 50s Korea Seoul Gangseo-gu NaN overseas inflow 1.0 NaN 75.0 2020-01-22 2020-01-23 2020-02-05 NaN released
1 1000000002 5.0 male 1987.0 30s Korea Seoul Jungnang-gu NaN overseas inflow 1.0 NaN 31.0 NaN 2020-01-30 2020-03-02 NaN released
2 1000000003 6.0 male 1964.0 50s Korea Seoul Jongno-gu NaN contact with patient 2.0 2.002000e+09 17.0 NaN 2020-01-30 2020-02-19 NaN released
3 1000000004 7.0 male 1991.0 20s Korea Seoul Mapo-gu NaN overseas inflow 1.0 NaN 9.0 2020-01-26 2020-01-30 2020-02-15 NaN released
4 1000000005 9.0 female 1992.0 20s Korea Seoul Seongbuk-gu NaN contact with patient 2.0 1.000000e+09 2.0 NaN 2020-01-31 2020-02-24 NaN released
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   patient_id          2218 non-null   int64  
 1   global_num          1314 non-null   float64
 2   sex                 2073 non-null   object 
 3   birth_year          1764 non-null   float64
 4   age                 1957 non-null   object 
 5   country             2218 non-null   object 
 6   province            2218 non-null   object 
 7   city                2153 non-null   object 
 8   disease             19 non-null     object 
 9   infection_case      1163 non-null   object 
 10  infection_order     42 non-null     float64
 11  infected_by         469 non-null    float64
 12  contact_number      411 non-null    float64
 13  symptom_onset_date  193 non-null    object 
 14  confirmed_date      2077 non-null   object 
 15  released_date       223 non-null    object 
 16  deceased_date       32 non-null     object 
 17  state               2130 non-null   object 
dtypes: float64(5), int64(1), object(12)
memory usage: 312.0+ KB
In [5]:
df.shape
Out[5]:
(2218, 18)
In [6]:
#Counts of null values 
na_df=pd.DataFrame(df.isnull().sum().sort_values(ascending=False)).reset_index()
na_df.columns = ['VarName', 'NullCount']
na_df[(na_df['NullCount']>0)]
Out[6]:
VarName NullCount
0 disease 2199
1 deceased_date 2186
2 infection_order 2176
3 symptom_onset_date 2025
4 released_date 1995
5 contact_number 1807
6 infected_by 1749
7 infection_case 1055
8 global_num 904
9 birth_year 454
10 age 261
11 sex 145
12 confirmed_date 141
13 state 88
14 city 65
In [7]:
#counts of response variable values
df.state.value_counts()
Out[7]:
isolated    1791
released     307
deceased      32
Name: state, dtype: int64

Create a new column named 'n_age' which is the calculated age based on the birth year column.

In [8]:
def calAge(cols):
    if pd.isnull(cols[0]) or pd.isnull(cols[1]):
        return np.nan
    else:
        return int(cols[0][0:4])-int(cols[1])
In [9]:
df['n_age'] = df[['confirmed_date','birth_year']].apply(calAge,axis=1)

Handle Missing Values

Print the number of missing values by column.

In [10]:
missing = pd.concat([df.isnull().sum(), 100 * df.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count', ascending=False)
Out[10]:
count %
disease 2199 99.143372
deceased_date 2186 98.557259
infection_order 2176 98.106402
symptom_onset_date 2025 91.298467
released_date 1995 89.945897
contact_number 1807 81.469793
infected_by 1749 78.854824
infection_case 1055 47.565374
global_num 904 40.757439
n_age 455 20.513977
birth_year 454 20.468891
age 261 11.767358
sex 145 6.537421
confirmed_date 141 6.357078
state 88 3.967538
city 65 2.930568
province 0 0.000000
country 0 0.000000
patient_id 0 0.000000

Fill the 'disease' missing values with 0 and remap the True values to 1.

In [11]:
df['disease'].replace({np.nan:0,True:1},inplace=True)
df['disease'].value_counts()
Out[11]:
0    2199
1      19
Name: disease, dtype: int64

Fill null values in the following columns with their mean: 'global_number','birth_year','infection_order','infected_by'and 'contact_number'

In [12]:
df.describe()
Out[12]:
patient_id global_num birth_year disease infection_order infected_by contact_number n_age
count 2.218000e+03 1314.000000 1764.000000 2218.000000 42.000000 4.690000e+02 411.000000 1763.000000
mean 4.014678e+09 4664.816591 1974.988662 0.008566 2.285714 2.600789e+09 24.128954 44.997164
std 2.192419e+09 2874.044464 19.412643 0.092178 1.254955 1.570638e+09 91.087792 19.409018
min 1.000000e+09 1.000000 1916.000000 0.000000 1.000000 1.000000e+09 0.000000 0.000000
25% 1.700000e+09 1908.500000 1962.000000 0.000000 1.250000 1.200000e+09 2.000000 27.000000
50% 6.001000e+09 5210.500000 1974.500000 0.000000 2.000000 2.000000e+09 5.000000 45.000000
75% 6.004000e+09 7481.500000 1993.000000 0.000000 3.000000 4.100000e+09 16.000000 58.000000
max 7.000000e+09 8717.000000 2020.000000 1.000000 6.000000 6.113000e+09 1160.000000 104.000000
In [13]:
meanFills = ['global_num','birth_year','infection_order','infected_by','contact_number']
for col in meanFills:
    df[col].replace({np.nan:df[col].mean()},inplace=True)
df[meanFills].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   global_num       2218 non-null   float64
 1   birth_year       2218 non-null   float64
 2   infection_order  2218 non-null   float64
 3   infected_by      2218 non-null   float64
 4   contact_number   2218 non-null   float64
dtypes: float64(5)
memory usage: 86.8 KB
In [14]:
df.describe()
Out[14]:
patient_id global_num birth_year disease infection_order infected_by contact_number n_age
count 2.218000e+03 2218.000000 2218.000000 2218.000000 2218.000000 2.218000e+03 2218.000000 1763.000000
mean 4.014678e+09 4664.816591 1974.988662 0.008566 2.285714 2.600789e+09 24.128954 44.997164
std 2.192419e+09 2211.785463 17.311232 0.092178 0.170662 7.216328e+08 39.171414 19.409018
min 1.000000e+09 1.000000 1916.000000 0.000000 1.000000 1.000000e+09 0.000000 0.000000
25% 1.700000e+09 4205.250000 1965.000000 0.000000 2.285714 2.600789e+09 24.128954 27.000000
50% 6.001000e+09 4664.816591 1974.988662 0.000000 2.285714 2.600789e+09 24.128954 45.000000
75% 6.004000e+09 5900.250000 1988.000000 0.000000 2.285714 2.600789e+09 24.128954 58.000000
max 7.000000e+09 8717.000000 2020.000000 1.000000 6.000000 6.113000e+09 1160.000000 104.000000

Fill the rest of the missing values with any method.

In [15]:
missing = pd.concat([df.isnull().sum(), 100 * df.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing = missing[missing['count']>0]
missing.sort_values(by='count', ascending=False)
Out[15]:
count %
deceased_date 2186 98.557259
symptom_onset_date 2025 91.298467
released_date 1995 89.945897
infection_case 1055 47.565374
n_age 455 20.513977
age 261 11.767358
sex 145 6.537421
confirmed_date 141 6.357078
state 88 3.967538
city 65 2.930568
In [16]:
df.mode().iloc[0]
Out[16]:
patient_id                      1000000001
global_num                     4664.816591
sex                                 female
birth_year                     1974.988662
age                                    20s
country                              Korea
province                  Gyeongsangbuk-do
city                          Gyeongsan-si
disease                                0.0
infection_case        contact with patient
infection_order                   2.285714
infected_by              2600788987.586354
contact_number                   24.128954
symptom_onset_date              2020-02-27
confirmed_date                  2020-03-01
released_date                   2020-03-13
deceased_date                   2020-02-23
state                             isolated
n_age                                 51.0
Name: 0, dtype: object
In [17]:
df.fillna(df.mode().iloc[0],inplace=True)

Check for any remaining null values.

In [18]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   patient_id          2218 non-null   int64  
 1   global_num          2218 non-null   float64
 2   sex                 2218 non-null   object 
 3   birth_year          2218 non-null   float64
 4   age                 2218 non-null   object 
 5   country             2218 non-null   object 
 6   province            2218 non-null   object 
 7   city                2218 non-null   object 
 8   disease             2218 non-null   int64  
 9   infection_case      2218 non-null   object 
 10  infection_order     2218 non-null   float64
 11  infected_by         2218 non-null   float64
 12  contact_number      2218 non-null   float64
 13  symptom_onset_date  2218 non-null   object 
 14  confirmed_date      2218 non-null   object 
 15  released_date       2218 non-null   object 
 16  deceased_date       2218 non-null   object 
 17  state               2218 non-null   object 
 18  n_age               2218 non-null   float64
dtypes: float64(6), int64(2), object(11)
memory usage: 329.4+ KB
In [19]:
df.head()
Out[19]:
patient_id global_num sex birth_year age country province city disease infection_case infection_order infected_by contact_number symptom_onset_date confirmed_date released_date deceased_date state n_age
0 1000000001 2.0 male 1964.0 50s Korea Seoul Gangseo-gu 0 overseas inflow 1.0 2.600789e+09 75.0 2020-01-22 2020-01-23 2020-02-05 2020-02-23 released 56.0
1 1000000002 5.0 male 1987.0 30s Korea Seoul Jungnang-gu 0 overseas inflow 1.0 2.600789e+09 31.0 2020-02-27 2020-01-30 2020-03-02 2020-02-23 released 33.0
2 1000000003 6.0 male 1964.0 50s Korea Seoul Jongno-gu 0 contact with patient 2.0 2.002000e+09 17.0 2020-02-27 2020-01-30 2020-02-19 2020-02-23 released 56.0
3 1000000004 7.0 male 1991.0 20s Korea Seoul Mapo-gu 0 overseas inflow 1.0 2.600789e+09 9.0 2020-01-26 2020-01-30 2020-02-15 2020-02-23 released 29.0
4 1000000005 9.0 female 1992.0 20s Korea Seoul Seongbuk-gu 0 contact with patient 2.0 1.000000e+09 2.0 2020-02-27 2020-01-31 2020-02-24 2020-02-23 released 28.0

Remove date columns from the data.

In [20]:
df = df.drop(['symptom_onset_date','confirmed_date','released_date','deceased_date'],axis =1)

Review the count of unique values by column.

In [21]:
print(df.nunique())
patient_id         2218
global_num         1304
sex                   2
birth_year           97
age                  11
country               4
province             17
city                134
disease               2
infection_case       16
infection_order       7
infected_by         207
contact_number       73
state                 3
n_age                96
dtype: int64

Review the percent of unique values by column.

In [22]:
print(df.nunique()/df.shape[0])
patient_id         1.000000
global_num         0.587917
sex                0.000902
birth_year         0.043733
age                0.004959
country            0.001803
province           0.007665
city               0.060415
disease            0.000902
infection_case     0.007214
infection_order    0.003156
infected_by        0.093327
contact_number     0.032913
state              0.001353
n_age              0.043282
dtype: float64

Review the range of values per column.

In [23]:
df.describe().T
Out[23]:
count mean std min 25% 50% 75% max
patient_id 2218.0 4.014678e+09 2.192419e+09 1.000000e+09 1.700000e+09 6.001000e+09 6.004000e+09 7.000000e+09
global_num 2218.0 4.664817e+03 2.211785e+03 1.000000e+00 4.205250e+03 4.664817e+03 5.900250e+03 8.717000e+03
birth_year 2218.0 1.974989e+03 1.731123e+01 1.916000e+03 1.965000e+03 1.974989e+03 1.988000e+03 2.020000e+03
disease 2218.0 8.566276e-03 9.217769e-02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
infection_order 2218.0 2.285714e+00 1.706622e-01 1.000000e+00 2.285714e+00 2.285714e+00 2.285714e+00 6.000000e+00
infected_by 2218.0 2.600789e+09 7.216328e+08 1.000000e+09 2.600789e+09 2.600789e+09 2.600789e+09 6.113000e+09
contact_number 2218.0 2.412895e+01 3.917141e+01 0.000000e+00 2.412895e+01 2.412895e+01 2.412895e+01 1.160000e+03
n_age 2218.0 4.622858e+01 1.747213e+01 0.000000e+00 3.200000e+01 5.100000e+01 5.500000e+01 1.040000e+02

Check for duplicated rows

In [24]:
duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF
Out[24]:
patient_id global_num sex birth_year age country province city disease infection_case infection_order infected_by contact_number state n_age

Print the categorical columns and their associated levels.

In [25]:
dfo = df.select_dtypes(include=['object'], exclude=['datetime'])
dfo.shape
#get levels for all variables
vn = pd.DataFrame(dfo.nunique()).reset_index()
vn.columns = ['VarName', 'LevelsCount']
vn.sort_values(by='LevelsCount', ascending =False)
vn
Out[25]:
VarName LevelsCount
0 sex 2
1 age 11
2 country 4
3 province 17
4 city 134
5 infection_case 16
6 state 3

Plot the correlation heat map for the features.

In [26]:
sns.heatmap(df.corr(),annot=True,cmap='coolwarm',fmt=".1f",annot_kws={'size':10});

Plot the boxplots to check for outliers.

In [27]:
fig, axes = plt.subplots(nrows = 2,ncols = 4,figsize = (10,5), dpi=77)
plt.tight_layout()
for i,col in enumerate(df.select_dtypes(exclude = ['object']).columns):
    axes[i//4, i%4].boxplot(df[col])
    axes[i//4, i%4].set_xlabel(col)
    axes[i//4, i%4].set_xticklabels([])
plt.show()

Create dummy features for object type features.

In [28]:
df = pd.get_dummies(df,drop_first=True,prefix='')
df.head()
Out[28]:
patient_id global_num birth_year disease infection_order infected_by contact_number n_age _male _100s ... _Seongdong-gu APT _Shincheonji Church _Suyeong-gu Kindergarten _contact with patient _etc _gym facility in Cheonan _gym facility in Sejong _overseas inflow _isolated _released
0 1000000001 2.0 1964.0 0 1.0 2.600789e+09 75.0 56.0 1 0 ... 0 0 0 0 0 0 0 1 0 1
1 1000000002 5.0 1987.0 0 1.0 2.600789e+09 31.0 33.0 1 0 ... 0 0 0 0 0 0 0 1 0 1
2 1000000003 6.0 1964.0 0 2.0 2.002000e+09 17.0 56.0 1 0 ... 0 0 0 1 0 0 0 0 0 1
3 1000000004 7.0 1991.0 0 1.0 2.600789e+09 9.0 29.0 1 0 ... 0 0 0 0 0 0 0 1 0 1
4 1000000005 9.0 1992.0 0 2.0 1.000000e+09 2.0 28.0 0 0 ... 0 0 0 1 0 0 0 0 0 1

5 rows × 188 columns

Split the data into test and train subsamples

In [29]:
df.drop('disease',axis=1)
Out[29]:
patient_id global_num birth_year infection_order infected_by contact_number n_age _male _100s _10s ... _Seongdong-gu APT _Shincheonji Church _Suyeong-gu Kindergarten _contact with patient _etc _gym facility in Cheonan _gym facility in Sejong _overseas inflow _isolated _released
0 1000000001 2.000000 1964.0 1.000000 2.600789e+09 75.000000 56.0 1 0 0 ... 0 0 0 0 0 0 0 1 0 1
1 1000000002 5.000000 1987.0 1.000000 2.600789e+09 31.000000 33.0 1 0 0 ... 0 0 0 0 0 0 0 1 0 1
2 1000000003 6.000000 1964.0 2.000000 2.002000e+09 17.000000 56.0 1 0 0 ... 0 0 0 1 0 0 0 0 0 1
3 1000000004 7.000000 1991.0 1.000000 2.600789e+09 9.000000 29.0 1 0 0 ... 0 0 0 0 0 0 0 1 0 1
4 1000000005 9.000000 1992.0 2.000000 1.000000e+09 2.000000 28.0 0 0 0 ... 0 0 0 1 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2213 6100000085 4664.816591 1990.0 2.285714 2.600789e+09 24.128954 30.0 1 0 0 ... 0 0 0 1 0 0 0 0 1 0
2214 7000000001 139.000000 1998.0 2.285714 2.600789e+09 87.000000 22.0 1 0 0 ... 0 0 0 0 1 0 0 0 1 0
2215 7000000002 222.000000 1998.0 2.285714 2.600789e+09 84.000000 22.0 0 0 0 ... 0 0 0 0 1 0 0 0 0 1
2216 7000000003 4345.000000 1972.0 2.285714 2.600789e+09 21.000000 48.0 0 0 0 ... 0 0 0 0 1 0 0 0 0 1
2217 7000000004 5534.000000 1974.0 2.285714 2.600789e+09 74.000000 46.0 1 0 0 ... 0 0 0 0 1 0 0 0 1 0

2218 rows × 187 columns

In [30]:
from sklearn.model_selection import train_test_split

# dont forget to define your X and y
X=df.drop('disease',axis=1)
y=df.disease

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)

Scale data to prep for model creation

In [31]:
#scale data
from sklearn import preprocessing
import numpy as np
# build scaler based on training data and apply it to test data to then also scale the test data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)
In [32]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,roc_auc_score
from sklearn.metrics import accuracy_score,log_loss
from matplotlib import pyplot

Fit Random Forest Classifier

The fit model shows an overall accuracy of 80% which is great and indicates our model was effectively able to identify the status of a patients in the South Korea dataset.

In [33]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=300, random_state = 1,n_jobs=-1)
model_res = clf.fit(X_train_scaled, y_train)
y_pred = model_res.predict(X_test_scaled)
y_pred_prob = model_res.predict_proba(X_test_scaled)
lr_probs = y_pred_prob[:,1]
ac = accuracy_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred, average='weighted')
cm = confusion_matrix(y_test, y_pred)

print('Random Forest: Accuracy=%.3f' % (ac))

print('Random Forest: f1-score=%.3f' % (f1))
Random Forest: Accuracy=0.989
Random Forest: f1-score=0.985

Create Confusion Matrix Plots

Confusion matrices are great ways to review your model performance for a multi-class classification problem. Being able to identify which class the misclassified observations end up in is a great way to determine if you need to build additional features to improve your overall model. In the example below we plot a regular counts confusion matrix as well as a weighted percent confusion matrix. The percent confusion matrix is particulary helpful when you have unbalanced class sizes.

In [34]:
class_names=['isolated','released','missing','deceased'] # name  of classes
In [35]:
import itertools
import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()


# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')
#plt.savefig('figures/RF_cm_multi_class.png')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')
#plt.savefig('figures/RF_cm_proportion_multi_class.png', bbox_inches="tight")
plt.show()
Confusion matrix, without normalization
[[439   1]
 [  4   0]]
Normalized confusion matrix
[[1. 0.]
 [1. 0.]]

Plot feature importances

The random forest algorithm can be used as a regression or classification model. In either case it tends to be a bit of a black box, where understanding what's happening under the hood can be difficult. Plotting the feature importances is one way that you can gain a perspective on which features are driving the model predictions.

In [36]:
feature_importance = clf.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())[:30]
sorted_idx = np.argsort(feature_importance)[:30]

pos = np.arange(sorted_idx.shape[0]) + .5
print(pos.size)
sorted_idx.size
plt.figure(figsize=(10,10))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()
30

The popularity of random forest is primarily due to how well it performs in a multitude of data situations. It tends to handle highly correlated features well, where as a linear regression model would not. In this case study we demonstrate the performance ability even with only a few features and almost all of them being highly correlated with each other. Random Forest is also used as an efficient way to investigate the importance of a set of features with a large data set. Consider random forest to be one of your first choices when building a decision tree, especially for multiclass classifications.