Random Forest¶

Random Forest is an ensemble of Decision Trees. With a few exceptions, a RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control the ensemble itself.

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model. The following BaggingClassifier is roughly equivalent to the previous RandomForestClassifier. Run the cell below to visualize a single estimator from a random forest model, using the Iris dataset to classify the data into the appropriate species.

conda install graphviz

from sklearn.datasets import load_iris
iris = load_iris()

# Model (can also use single decision tree)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10)

# Train
model.fit(iris.data, iris.target)
# Extract single tree
estimator = model.estimators_[5]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = iris.feature_names,
                class_names = iris.target_names,
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

Notice how each split seperates the data into buckets of similar observations. This is a single tree and a relatively simple classification dataset, but the same method is used in a more complex dataset with greater depth to the trees.

Coronavirus¶

Coronavirus disease (COVID-19) is an infectious disease caused by a new virus. The disease causes respiratory illness (like the flu) with symptoms such as a cough, fever, and in more severe cases, difficulty breathing. You can protect yourself by washing your hands frequently, avoiding touching your face, and avoiding close contact (1 meter or 3 feet) with people who are unwell. An outbreak of COVID-19 started in December 2019 and at the time of the creation of this project was continuing to spread throughout the world. Many governments recommended only essential outings to public places and closed most business that do not serve food or sell essential items. An excellent spatial dashboard built by Johns Hopkins shows the daily confirmed cases by country.

This case study was designed to drive home the important role that data science plays in real-world situations like this pandemic. This case study uses the Random Forest Classifier and a dataset from the South Korean cases of COVID-19 provided on Kaggle to encourage research on this important topic. The goal of the case study is to build a Random Forest Classifier to predict the 'state' of the patient.

First, please load the needed packages and modules into Python. Next, load the data into a pandas dataframe for ease of use.

import os
import pandas as pd
from datetime import datetime,timedelta
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import plotly.graph_objects as go
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor

url ='SouthKoreacoronavirusdataset/PatientInfo.csv'
df = pd.read_csv(url)
df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   patient_id          2218 non-null   int64  
 1   global_num          1314 non-null   float64
 2   sex                 2073 non-null   object 
 3   birth_year          1764 non-null   float64
 4   age                 1957 non-null   object 
 5   country             2218 non-null   object 
 6   province            2218 non-null   object 
 7   city                2153 non-null   object 
 8   disease             19 non-null     object 
 9   infection_case      1163 non-null   object 
 10  infection_order     42 non-null     float64
 11  infected_by         469 non-null    float64
 12  contact_number      411 non-null    float64
 13  symptom_onset_date  193 non-null    object 
 14  confirmed_date      2077 non-null   object 
 15  released_date       223 non-null    object 
 16  deceased_date       32 non-null     object 
 17  state               2130 non-null   object 
dtypes: float64(5), int64(1), object(12)
memory usage: 312.0+ KB

df.shape

(2218, 18)

#Counts of null values 
na_df=pd.DataFrame(df.isnull().sum().sort_values(ascending=False)).reset_index()
na_df.columns = ['VarName', 'NullCount']
na_df[(na_df['NullCount']>0)]

#counts of response variable values
df.state.value_counts()

isolated    1791
released     307
deceased      32
Name: state, dtype: int64

Create a new column named 'n_age' which is the calculated age based on the birth year column.

def calAge(cols):
    if pd.isnull(cols[0]) or pd.isnull(cols[1]):
        return np.nan
    else:
        return int(cols[0][0:4])-int(cols[1])

df['n_age'] = df[['confirmed_date','birth_year']].apply(calAge,axis=1)

Handle Missing Values¶

Print the number of missing values by column.

missing = pd.concat([df.isnull().sum(), 100 * df.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count', ascending=False)

Fill the 'disease' missing values with 0 and remap the True values to 1.

df['disease'].replace({np.nan:0,True:1},inplace=True)
df['disease'].value_counts()

0    2199
1      19
Name: disease, dtype: int64

Fill null values in the following columns with their mean: 'global_number','birth_year','infection_order','infected_by'and 'contact_number'

df.describe()

meanFills = ['global_num','birth_year','infection_order','infected_by','contact_number']
for col in meanFills:
    df[col].replace({np.nan:df[col].mean()},inplace=True)
df[meanFills].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   global_num       2218 non-null   float64
 1   birth_year       2218 non-null   float64
 2   infection_order  2218 non-null   float64
 3   infected_by      2218 non-null   float64
 4   contact_number   2218 non-null   float64
dtypes: float64(5)
memory usage: 86.8 KB

df.describe()

Fill the rest of the missing values with any method.

missing = pd.concat([df.isnull().sum(), 100 * df.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing = missing[missing['count']>0]
missing.sort_values(by='count', ascending=False)

df.mode().iloc[0]

patient_id                      1000000001
global_num                     4664.816591
sex                                 female
birth_year                     1974.988662
age                                    20s
country                              Korea
province                  Gyeongsangbuk-do
city                          Gyeongsan-si
disease                                0.0
infection_case        contact with patient
infection_order                   2.285714
infected_by              2600788987.586354
contact_number                   24.128954
symptom_onset_date              2020-02-27
confirmed_date                  2020-03-01
released_date                   2020-03-13
deceased_date                   2020-02-23
state                             isolated
n_age                                 51.0
Name: 0, dtype: object

df.fillna(df.mode().iloc[0],inplace=True)

Check for any remaining null values.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   patient_id          2218 non-null   int64  
 1   global_num          2218 non-null   float64
 2   sex                 2218 non-null   object 
 3   birth_year          2218 non-null   float64
 4   age                 2218 non-null   object 
 5   country             2218 non-null   object 
 6   province            2218 non-null   object 
 7   city                2218 non-null   object 
 8   disease             2218 non-null   int64  
 9   infection_case      2218 non-null   object 
 10  infection_order     2218 non-null   float64
 11  infected_by         2218 non-null   float64
 12  contact_number      2218 non-null   float64
 13  symptom_onset_date  2218 non-null   object 
 14  confirmed_date      2218 non-null   object 
 15  released_date       2218 non-null   object 
 16  deceased_date       2218 non-null   object 
 17  state               2218 non-null   object 
 18  n_age               2218 non-null   float64
dtypes: float64(6), int64(2), object(11)
memory usage: 329.4+ KB

df.head()

Remove date columns from the data.

df = df.drop(['symptom_onset_date','confirmed_date','released_date','deceased_date'],axis =1)

Review the count of unique values by column.

print(df.nunique())

patient_id         2218
global_num         1304
sex                   2
birth_year           97
age                  11
country               4
province             17
city                134
disease               2
infection_case       16
infection_order       7
infected_by         207
contact_number       73
state                 3
n_age                96
dtype: int64

Review the percent of unique values by column.

print(df.nunique()/df.shape[0])

patient_id         1.000000
global_num         0.587917
sex                0.000902
birth_year         0.043733
age                0.004959
country            0.001803
province           0.007665
city               0.060415
disease            0.000902
infection_case     0.007214
infection_order    0.003156
infected_by        0.093327
contact_number     0.032913
state              0.001353
n_age              0.043282
dtype: float64

Review the range of values per column.

df.describe().T

Check for duplicated rows¶

duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF

Print the categorical columns and their associated levels.

dfo = df.select_dtypes(include=['object'], exclude=['datetime'])
dfo.shape
#get levels for all variables
vn = pd.DataFrame(dfo.nunique()).reset_index()
vn.columns = ['VarName', 'LevelsCount']
vn.sort_values(by='LevelsCount', ascending =False)
vn

Plot the correlation heat map for the features.

sns.heatmap(df.corr(),annot=True,cmap='coolwarm',fmt=".1f",annot_kws={'size':10});

Plot the boxplots to check for outliers.

fig, axes = plt.subplots(nrows = 2,ncols = 4,figsize = (10,5), dpi=77)
plt.tight_layout()
for i,col in enumerate(df.select_dtypes(exclude = ['object']).columns):
    axes[i//4, i%4].boxplot(df[col])
    axes[i//4, i%4].set_xlabel(col)
    axes[i//4, i%4].set_xticklabels([])
plt.show()

Create dummy features for object type features.

df = pd.get_dummies(df,drop_first=True,prefix='')
df.head()

Split the data into test and train subsamples¶

df.drop('disease',axis=1)

from sklearn.model_selection import train_test_split

# dont forget to define your X and y
X=df.drop('disease',axis=1)
y=df.disease

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)

Scale data to prep for model creation¶

#scale data
from sklearn import preprocessing
import numpy as np
# build scaler based on training data and apply it to test data to then also scale the test data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,roc_auc_score
from sklearn.metrics import accuracy_score,log_loss
from matplotlib import pyplot

Fit Random Forest Classifier¶

The fit model shows an overall accuracy of 80% which is great and indicates our model was effectively able to identify the status of a patients in the South Korea dataset.

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=300, random_state = 1,n_jobs=-1)
model_res = clf.fit(X_train_scaled, y_train)
y_pred = model_res.predict(X_test_scaled)
y_pred_prob = model_res.predict_proba(X_test_scaled)
lr_probs = y_pred_prob[:,1]
ac = accuracy_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred, average='weighted')
cm = confusion_matrix(y_test, y_pred)

print('Random Forest: Accuracy=%.3f' % (ac))

print('Random Forest: f1-score=%.3f' % (f1))

Random Forest: Accuracy=0.989
Random Forest: f1-score=0.985

Create Confusion Matrix Plots¶

Confusion matrices are great ways to review your model performance for a multi-class classification problem. Being able to identify which class the misclassified observations end up in is a great way to determine if you need to build additional features to improve your overall model. In the example below we plot a regular counts confusion matrix as well as a weighted percent confusion matrix. The percent confusion matrix is particulary helpful when you have unbalanced class sizes.

class_names=['isolated','released','missing','deceased'] # name  of classes

import itertools
import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()


# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')
#plt.savefig('figures/RF_cm_multi_class.png')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')
#plt.savefig('figures/RF_cm_proportion_multi_class.png', bbox_inches="tight")
plt.show()

Confusion matrix, without normalization
[[439   1]
 [  4   0]]
Normalized confusion matrix
[[1. 0.]
 [1. 0.]]

Plot feature importances¶

The random forest algorithm can be used as a regression or classification model. In either case it tends to be a bit of a black box, where understanding what's happening under the hood can be difficult. Plotting the feature importances is one way that you can gain a perspective on which features are driving the model predictions.

feature_importance = clf.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())[:30]
sorted_idx = np.argsort(feature_importance)[:30]

pos = np.arange(sorted_idx.shape[0]) + .5
print(pos.size)
sorted_idx.size
plt.figure(figsize=(10,10))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

30

The popularity of random forest is primarily due to how well it performs in a multitude of data situations. It tends to handle highly correlated features well, where as a linear regression model would not. In this case study we demonstrate the performance ability even with only a few features and almost all of them being highly correlated with each other. Random Forest is also used as an efficient way to investigate the importance of a set of features with a large data set. Consider random forest to be one of your first choices when building a decision tree, especially for multiclass classifications.

	patient_id	global_num	sex	birth_year	age	country	province	city	disease	infection_case	infection_order	infected_by	contact_number	symptom_onset_date	confirmed_date	released_date	deceased_date	state
0	1000000001	2.0	male	1964.0	50s	Korea	Seoul	Gangseo-gu	NaN	overseas inflow	1.0	NaN	75.0	2020-01-22	2020-01-23	2020-02-05	NaN	released
1	1000000002	5.0	male	1987.0	30s	Korea	Seoul	Jungnang-gu	NaN	overseas inflow	1.0	NaN	31.0	NaN	2020-01-30	2020-03-02	NaN	released
2	1000000003	6.0	male	1964.0	50s	Korea	Seoul	Jongno-gu	NaN	contact with patient	2.0	2.002000e+09	17.0	NaN	2020-01-30	2020-02-19	NaN	released
3	1000000004	7.0	male	1991.0	20s	Korea	Seoul	Mapo-gu	NaN	overseas inflow	1.0	NaN	9.0	2020-01-26	2020-01-30	2020-02-15	NaN	released
4	1000000005	9.0	female	1992.0	20s	Korea	Seoul	Seongbuk-gu	NaN	contact with patient	2.0	1.000000e+09	2.0	NaN	2020-01-31	2020-02-24	NaN	released

	count	%
disease	2199	99.143372
deceased_date	2186	98.557259
infection_order	2176	98.106402
symptom_onset_date	2025	91.298467
released_date	1995	89.945897
contact_number	1807	81.469793
infected_by	1749	78.854824
infection_case	1055	47.565374
global_num	904	40.757439
n_age	455	20.513977
birth_year	454	20.468891
age	261	11.767358
sex	145	6.537421
confirmed_date	141	6.357078
state	88	3.967538
city	65	2.930568
province	0	0.000000
country	0	0.000000
patient_id	0	0.000000

	patient_id	global_num	birth_year	disease	infection_order	infected_by	contact_number	n_age
count	2.218000e+03	1314.000000	1764.000000	2218.000000	42.000000	4.690000e+02	411.000000	1763.000000
mean	4.014678e+09	4664.816591	1974.988662	0.008566	2.285714	2.600789e+09	24.128954	44.997164
std	2.192419e+09	2874.044464	19.412643	0.092178	1.254955	1.570638e+09	91.087792	19.409018
min	1.000000e+09	1.000000	1916.000000	0.000000	1.000000	1.000000e+09	0.000000	0.000000
25%	1.700000e+09	1908.500000	1962.000000	0.000000	1.250000	1.200000e+09	2.000000	27.000000
50%	6.001000e+09	5210.500000	1974.500000	0.000000	2.000000	2.000000e+09	5.000000	45.000000
75%	6.004000e+09	7481.500000	1993.000000	0.000000	3.000000	4.100000e+09	16.000000	58.000000
max	7.000000e+09	8717.000000	2020.000000	1.000000	6.000000	6.113000e+09	1160.000000	104.000000

	patient_id	global_num	birth_year	disease	infection_order	infected_by	contact_number	n_age
count	2.218000e+03	2218.000000	2218.000000	2218.000000	2218.000000	2.218000e+03	2218.000000	1763.000000
mean	4.014678e+09	4664.816591	1974.988662	0.008566	2.285714	2.600789e+09	24.128954	44.997164
std	2.192419e+09	2211.785463	17.311232	0.092178	0.170662	7.216328e+08	39.171414	19.409018
min	1.000000e+09	1.000000	1916.000000	0.000000	1.000000	1.000000e+09	0.000000	0.000000
25%	1.700000e+09	4205.250000	1965.000000	0.000000	2.285714	2.600789e+09	24.128954	27.000000
50%	6.001000e+09	4664.816591	1974.988662	0.000000	2.285714	2.600789e+09	24.128954	45.000000
75%	6.004000e+09	5900.250000	1988.000000	0.000000	2.285714	2.600789e+09	24.128954	58.000000
max	7.000000e+09	8717.000000	2020.000000	1.000000	6.000000	6.113000e+09	1160.000000	104.000000

	count	%
deceased_date	2186	98.557259
symptom_onset_date	2025	91.298467
released_date	1995	89.945897
infection_case	1055	47.565374
n_age	455	20.513977
age	261	11.767358
sex	145	6.537421
confirmed_date	141	6.357078
state	88	3.967538
city	65	2.930568

	count	mean	std	min	25%	50%	75%	max
patient_id	2218.0	4.014678e+09	2.192419e+09	1.000000e+09	1.700000e+09	6.001000e+09	6.004000e+09	7.000000e+09
global_num	2218.0	4.664817e+03	2.211785e+03	1.000000e+00	4.205250e+03	4.664817e+03	5.900250e+03	8.717000e+03
birth_year	2218.0	1.974989e+03	1.731123e+01	1.916000e+03	1.965000e+03	1.974989e+03	1.988000e+03	2.020000e+03
disease	2218.0	8.566276e-03	9.217769e-02	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	1.000000e+00
infection_order	2218.0	2.285714e+00	1.706622e-01	1.000000e+00	2.285714e+00	2.285714e+00	2.285714e+00	6.000000e+00
infected_by	2218.0	2.600789e+09	7.216328e+08	1.000000e+09	2.600789e+09	2.600789e+09	2.600789e+09	6.113000e+09
contact_number	2218.0	2.412895e+01	3.917141e+01	0.000000e+00	2.412895e+01	2.412895e+01	2.412895e+01	1.160000e+03
n_age	2218.0	4.622858e+01	1.747213e+01	0.000000e+00	3.200000e+01	5.100000e+01	5.500000e+01	1.040000e+02

	patient_id	global_num	birth_year	infection_order	infected_by	contact_number	n_age	_male	_100s	_10s	...	_Seongdong-gu APT	_Shincheonji Church	_Suyeong-gu Kindergarten	_contact with patient	_etc	_gym facility in Cheonan	_gym facility in Sejong	_overseas inflow	_isolated	_released
0	1000000001	2.000000	1964.0	1.000000	2.600789e+09	75.000000	56.0	1	0	0	...	0	0	0	0	0	0	0	1	0	1
1	1000000002	5.000000	1987.0	1.000000	2.600789e+09	31.000000	33.0	1	0	0	...	0	0	0	0	0	0	0	1	0	1
2	1000000003	6.000000	1964.0	2.000000	2.002000e+09	17.000000	56.0	1	0	0	...	0	0	0	1	0	0	0	0	0	1
3	1000000004	7.000000	1991.0	1.000000	2.600789e+09	9.000000	29.0	1	0	0	...	0	0	0	0	0	0	0	1	0	1
4	1000000005	9.000000	1992.0	2.000000	1.000000e+09	2.000000	28.0	0	0	0	...	0	0	0	1	0	0	0	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2213	6100000085	4664.816591	1990.0	2.285714	2.600789e+09	24.128954	30.0	1	0	0	...	0	0	0	1	0	0	0	0	1	0
2214	7000000001	139.000000	1998.0	2.285714	2.600789e+09	87.000000	22.0	1	0	0	...	0	0	0	0	1	0	0	0	1	0
2215	7000000002	222.000000	1998.0	2.285714	2.600789e+09	84.000000	22.0	0	0	0	...	0	0	0	0	1	0	0	0	0	1
2216	7000000003	4345.000000	1972.0	2.285714	2.600789e+09	21.000000	48.0	0	0	0	...	0	0	0	0	1	0	0	0	0	1
2217	7000000004	5534.000000	1974.0	2.285714	2.600789e+09	74.000000	46.0	1	0	0	...	0	0	0	0	1	0	0	0	1	0