Springboard Decision Tree Specialty Coffee Case Study - Tier 3

The Scenario

Imagine you've just finished the Springboard Data Science Career Track course, and have been hired by a rising popular specialty coffee company - RR Diner Coffee - as a data scientist. Congratulations!

RR Diner Coffee sells two types of thing:

  • specialty coffee beans, in bulk (by the kilogram only)
  • coffee equipment and merchandise (grinders, brewing equipment, mugs, books, t-shirts).

RR Diner Coffee has three stores, two in Europe and one in the USA. The flagshap store is in the USA, and everything is quality assessed there, before being shipped out. Customers further away from the USA flagship store have higher shipping charges.

You've been taken on at RR Diner Coffee because the company are turning towards using data science and machine learning to systematically make decisions about which coffee farmers they should strike deals with.

RR Diner Coffee typically buys coffee from farmers, processes it on site, brings it back to the USA, roasts it, packages it, markets it, and ships it (only in bulk, and after quality assurance) to customers internationally. These customers all own coffee shops in major cities like New York, Paris, London, Hong Kong, Tokyo, and Berlin.

Now, RR Diner Coffee has a decision about whether to strike a deal with a legendary coffee farm (known as the Hidden Farm) in rural China: there are rumours their coffee tastes of lychee and dark chocolate, while also being as sweet as apple juice.

It's a risky decision, as the deal will be expensive, and the coffee might not be bought by customers. The stakes are high: times are tough, stocks are low, farmers are reverting to old deals with the larger enterprises and the publicity of selling Hidden Farm coffee could save the RR Diner Coffee business.

Your first job, then, is to build a decision tree to predict how many units of the Hidden Farm Chinese coffee will be purchased by RR Diner Coffee's most loyal customers.

To this end, you and your team have conducted a survey of 710 of the most loyal RR Diner Coffee customers, collecting data on the customers':

  • age
  • gender
  • salary
  • whether they have bought at least one RR Diner Coffee product online
  • their distance from the flagship store in the USA (standardized to a number between 0 and 11)
  • how much they spent on RR Diner Coffee products on the week of the survey
  • how much they spent on RR Diner Coffee products in the month preeding the survey
  • the number of RR Diner coffee bean shipments each customer has ordered over the preceding year.

You also asked each customer participating in the survey whether they would buy the Hidden Farm coffee, and some (but not all) of the customers gave responses to that question.

You sit back and think: if more than 70% of the interviewed customers are likely to buy the Hidden Farm coffee, you will strike the deal with the local Hidden Farm farmers and sell the coffee. Otherwise, you won't strike the deal and the Hidden Farm coffee will remain in legends only. There's some doubt in your mind about whether 70% is a reasonable threshold, but it'll do for the moment.

To solve the problem, then, you will build a decision tree to implement a classification solution.


As ever, this notebook is tiered, meaning you can elect that tier that is right for your confidence and skill level. There are 3 tiers, with tier 1 being the easiest and tier 3 being the hardest. This is tier 3, so it will be challenging.

1. Sourcing and loading

  • Import packages
  • Load data
  • Explore the data

2. Cleaning, transforming and visualizing

  • Cleaning the data
  • Train/test split

3. Modelling

  • Model 1: Entropy model - no max_depth
  • Model 2: Gini impurity model - no max_depth
  • Model 3: Entropy model - max depth 3
  • Model 4: Gini impurity model - max depth 3

4. Evaluating and concluding

  • How many customers will buy Hidden Farm coffee?
  • Decision

5. Random Forest

  • Import necessary modules
  • Model
  • Revise conclusion

0. Overview

This notebook uses decision trees to determine whether the factors of salary, gender, age, how much money the customer spent last week and during the preceding month on RR Diner Coffee products, how many kilogram coffee bags the customer bought over the last year, whether they have bought at least one RR Diner Coffee product online, and their distance from the flagship store in the USA, could predict whether customers would purchase the Hidden Farm coffee if a deal with its farmers were struck.

1. Sourcing and loading

1a. Import Packages

In [1]:
import pandas as pd
import numpy as np
from sklearn import tree, metrics
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from io import StringIO  
from IPython.display import Image  
import pydotplus

1b. Load data

In [2]:
# Read in the data to a variable called coffeeData
coffeeData = pd.read_csv('data/RRDinerCoffeeData.csv')

1c. Explore the data

As we've seen, exploration entails doing things like checking out the initial appearance of the data with head(), the dimensions of our data with .shape, the data types of the variables with .info(), the number of non-null values, how much memory is being used to store the data, and finally the major summary statistcs capturing central tendancy, dispersion and the null-excluding shape of the dataset's distribution.

How much of this can you do yourself by this point in the course? Have a real go.

In [3]:
# Call head() on your data 
coffeeData.head()
Out[3]:
Age Gender num_coffeeBags_per_year spent_week spent_month SlrAY Distance Online Decision
0 36 Female 0 24 73 42789 0.003168 0 1.0
1 24 Male 0 44 164 74035 0.520906 0 NaN
2 24 Male 0 39 119 30563 0.916005 1 1.0
3 20 Male 0 30 107 13166 0.932098 1 NaN
4 24 Female 0 20 36 14244 0.965881 0 1.0
In [4]:
# Call .shape on your data
coffeeData.shape
Out[4]:
(702, 9)
In [5]:
# Call info() on your data
coffeeData.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      702 non-null    int64  
 1   Gender                   702 non-null    object 
 2   num_coffeeBags_per_year  702 non-null    int64  
 3   spent_week               702 non-null    int64  
 4   spent_month              702 non-null    int64  
 5   SlrAY                    702 non-null    int64  
 6   Distance                 702 non-null    float64
 7   Online                   702 non-null    int64  
 8   Decision                 474 non-null    float64
dtypes: float64(2), int64(6), object(1)
memory usage: 49.5+ KB
In [6]:
# Call describe() on your data to get the relevant summary statistics for your data 
coffeeData.describe(include='all').T
Out[6]:
count unique top freq mean std min 25% 50% 75% max
Age 702.0 NaN NaN NaN 34.24359 13.927945 16.0 23.0 28.0 46.0 90.0
Gender 702 9 Male 355 NaN NaN NaN NaN NaN NaN NaN
num_coffeeBags_per_year 702.0 NaN NaN NaN 2.710826 1.593629 0.0 1.0 3.0 4.0 5.0
spent_week 702.0 NaN NaN NaN 32.853276 15.731878 0.0 24.25 36.0 43.0 62.0
spent_month 702.0 NaN NaN NaN 107.923077 55.348485 0.0 62.0 113.5 150.75 210.0
SlrAY 702.0 NaN NaN NaN 43819.843305 26192.626943 1617.0 22812.25 41975.0 60223.0 182058.0
Distance 702.0 NaN NaN NaN 4.559186 3.116275 0.003168 1.877812 4.196167 6.712022 10.986203
Online 702.0 NaN NaN NaN 0.531339 0.499373 0.0 0.0 1.0 1.0 1.0
Decision 474.0 NaN NaN NaN 0.639241 0.480728 0.0 0.0 1.0 1.0 1.0

2. Cleaning, transforming and visualizing

2a. Cleaning the data

Some datasets don't require any cleaning, but almost all do. This one does. We need to replace '1.0' and '0.0' in the 'Decision' column by 'YES' and 'NO' respectively, clean up the values of the 'gender' column, and change the column names to words which maximize meaning and clarity.

First, let's change the name of spent_week, spent_month, and SlrAY to spent_last_week and spent_last_month and salary respectively.

In [7]:
# Check out the names of our data's columns 
coffeeData.columns
Out[7]:
Index(['Age', 'Gender', 'num_coffeeBags_per_year', 'spent_week', 'spent_month',
       'SlrAY', 'Distance', 'Online', 'Decision'],
      dtype='object')
In [8]:
# Make the relevant name changes to spent_week and spent_per_week.
coffeeData.rename(columns={'spent_week': 'spent_per_week'},inplace=True)
In [9]:
# Check out the column names
coffeeData.columns
Out[9]:
Index(['Age', 'Gender', 'num_coffeeBags_per_year', 'spent_per_week',
       'spent_month', 'SlrAY', 'Distance', 'Online', 'Decision'],
      dtype='object')
In [10]:
# Let's have a closer look at the gender column. Its values need cleaning.
coffeeData['Gender'].value_counts()
Out[10]:
Male      355
Female    340
MALE        1
female      1
M           1
male        1
F           1
f           1
FEMALE      1
Name: Gender, dtype: int64
In [11]:
# See the gender column's unique values 
coffeeData['Gender'].unique()
Out[11]:
array(['Female', 'Male', 'female', 'F', 'f ', 'FEMALE', 'MALE', 'male',
       'M'], dtype=object)

We can see a bunch of inconsistency here.

Use replace() to make the values of the gender column just Female and Male.

In [12]:
# Replace all alternate values for the Female entry with 'Female'
coffeeData['Gender'] = coffeeData['Gender'].str.strip()
coffeeData['Gender'] = coffeeData['Gender'].replace(['male','M','MALE'],'Male')
coffeeData['Gender'] = coffeeData['Gender'].replace(['female','f','F','FEMALE'],'Female')
coffeeData['Gender'].value_counts()
Out[12]:
Male      358
Female    344
Name: Gender, dtype: int64
In [13]:
# Check out the unique values for the 'gender' column
coffeeData['Gender'].unique()
Out[13]:
array(['Female', 'Male'], dtype=object)
In [14]:
# Replace all alternate values with "Male"
# done above
In [15]:
# Let's check the unique values of the column "gender"
# done above
In [16]:
# Check out the unique values of the column 'Decision'
coffeeData['Decision'].unique()
Out[16]:
array([ 1., nan,  0.])

We now want to replace 1.0 and 0.0 in the Decision column by YES and NO respectively.

In [17]:
# Replace 1.0 and 0.0 by 'Yes' and 'No'
coffeeData['Decision'] = coffeeData['Decision'].replace(1.0,'YES')
coffeeData['Decision'] = coffeeData['Decision'].replace(0,'NO')
In [18]:
# Check that our replacing those values with 'YES' and 'NO' worked, with unique()
coffeeData['Decision'].unique()
Out[18]:
array(['YES', nan, 'NO'], dtype=object)
In [19]:
coffeeData['Decision'].isna().sum()
Out[19]:
228

2b. Train/test split

To execute the train/test split properly, we need to do five things:

  1. Drop all rows with a null value in the Decision column, and save the result as NOPrediction: a dataset that will contain all known values for the decision
  2. Visualize the data using scatter and boxplots of several variables in the y-axis and the decision on the x-axis
  3. Get the subset of coffeeData with null values in the Decision column, and save that subset as Prediction
  4. Divide the NOPrediction subset into X and y, and then further divide those subsets into train and test subsets for X and y respectively
  5. Create dummy variables to deal with categorical inputs

1. Drop all null values within the Decision column, and save the result as NoPrediction

In [20]:
# NoPrediction will contain all known values for the decision
# Call dropna() on coffeeData, and store the result in a variable NOPrediction 
# Call describe() on the Decision column of NoPrediction after calling dropna() on coffeeData
NOPrediction = coffeeData.dropna(subset=['Decision'])
NOPrediction['Decision'].dropna().describe(include='all')
Out[20]:
count     474
unique      2
top       YES
freq      303
Name: Decision, dtype: object

2. Visualize the data using scatter and boxplots of several variables in the y-axis and the decision on the x-axis

In [21]:
# Exploring our new NOPrediction dataset
# Make a boxplot on NOPrediction where the x axis is Decision, and the y axis is spent_last_week
import cufflinks as cf
cf.go_offline()
In [22]:
NOPrediction.columns
Out[22]:
Index(['Age', 'Gender', 'num_coffeeBags_per_year', 'spent_per_week',
       'spent_month', 'SlrAY', 'Distance', 'Online', 'Decision'],
      dtype='object')
In [23]:
NOPrediction.head()
Out[23]:
Age Gender num_coffeeBags_per_year spent_per_week spent_month SlrAY Distance Online Decision
0 36 Female 0 24 73 42789 0.003168 0 YES
2 24 Male 0 39 119 30563 0.916005 1 YES
4 24 Female 0 20 36 14244 0.965881 0 YES
5 20 Female 0 23 28 14293 1.036346 1 YES
6 34 Female 0 55 202 91035 1.134851 0 YES
In [24]:
# NOPrediction.groupby('Decision').sum()['spent_per_week']
NOPrediction.pivot(columns='Decision', values='spent_per_week').iplot(kind='box')

Can you admissibly conclude anything from this boxplot? Write your answer here:

In [25]:
# Make a scatterplot on NOPrediction, where x is distance, y is spent_last_month and hue is Decision 
NOPrediction.iplot(kind='scatter',x='Distance',y='spent_month',mode='markers',
                   size=10, 
                   categories="Decision",
                   xTitle='Distance',
                   yTitle='spent_month',) 
/Users/akiofukashima/miniforge3/envs/tf_m1/lib/python3.8/site-packages/cufflinks/plotlytools.py:807: FutureWarning:

The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead

/Users/akiofukashima/miniforge3/envs/tf_m1/lib/python3.8/site-packages/cufflinks/plotlytools.py:810: FutureWarning:

The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead

Can you admissibly conclude anything from this scatterplot? Remember: we are trying to build a tree to classify unseen examples. Write your answer here:

Two things can be observed.

Firstly, the longer Distance, the more ratio of Yes Decision comapred to No Decision. Secondly, the shorter Distance, the more pay in month.

Both observaiton is generally intuitive and it makes sense.

3. Get the subset of coffeeData with null values in the Decision column, and save that subset as Prediction

In [26]:
# Get just those rows whose value for the Decision column is null  
Prediction = coffeeData[coffeeData['Decision'].isnull()]
Prediction.head()
Out[26]:
Age Gender num_coffeeBags_per_year spent_per_week spent_month SlrAY Distance Online Decision
1 24 Male 0 44 164 74035 0.520906 0 NaN
3 20 Male 0 30 107 13166 0.932098 1 NaN
7 24 Female 0 20 34 17425 1.193188 0 NaN
11 24 Female 0 40 153 84803 1.655096 1 NaN
12 21 Female 0 38 122 42338 1.714179 1 NaN
In [27]:
# Call describe() on Prediction
Prediction.describe().T
Out[27]:
count mean std min 25% 50% 75% max
Age 228.0 31.802632 14.302293 16.000000 22.000000 25.000000 39.000000 67.000000
num_coffeeBags_per_year 228.0 2.960526 1.585514 0.000000 2.000000 3.000000 4.000000 5.000000
spent_per_week 228.0 33.394737 15.697930 0.000000 25.750000 37.000000 44.000000 62.000000
spent_month 228.0 110.407895 53.786536 0.000000 65.000000 113.500000 151.250000 210.000000
SlrAY 228.0 41923.741228 27406.768360 1617.000000 15911.500000 40987.500000 58537.000000 182058.000000
Distance 228.0 3.428836 2.153102 0.010048 1.699408 3.208673 5.261184 10.871566
Online 228.0 0.570175 0.496140 0.000000 0.000000 1.000000 1.000000 1.000000

4. Divide the NOPrediction subset into X and y

In [28]:
# Check the names of the columns of NOPrediction
NOPrediction.columns
Out[28]:
Index(['Age', 'Gender', 'num_coffeeBags_per_year', 'spent_per_week',
       'spent_month', 'SlrAY', 'Distance', 'Online', 'Decision'],
      dtype='object')
In [29]:
# Let's do our feature selection.
# Make a variable called 'features', and a list containing the strings of every column except "Decision"
features = ['Age', 'Gender', 'num_coffeeBags_per_year', 'spent_per_week',
       'spent_month', 'SlrAY', 'Distance', 'Online']
# Make an explanatory variable called X, and assign it: NoPrediction[features]
X = NOPrediction[features]

# Make a dependent variable called y, and assign it: NoPrediction.Decision
y = NOPrediction.Decision

5. Create dummy variables to deal with categorical inputs

One-hot encoding replaces each unique value of a given column with a new column, and puts a 1 in the new column for a given row just if its initial value for the original column matches the new column. Check out this resource if you haven't seen one-hot-encoding before.

Note: We will do this before we do our train/test split as to do it after could mean that some categories only end up in the train or test split of our data by chance and this would then lead to different shapes of data for our X_train and X_test which could/would cause downstream issues when fitting or predicting using a trained model.

In [30]:
# One-hot encode all features in X.
X = pd.get_dummies(X,drop_first=True,prefix='dummy')
X
Out[30]:
Age num_coffeeBags_per_year spent_per_week spent_month SlrAY Distance Online dummy_Male
0 36 0 24 73 42789 0.003168 0 0
2 24 0 39 119 30563 0.916005 1 1
4 24 0 20 36 14244 0.965881 0 0
5 20 0 23 28 14293 1.036346 1 0
6 34 0 55 202 91035 1.134851 0 0
... ... ... ... ... ... ... ... ...
696 29 5 20 74 29799 10.455068 0 0
697 45 5 61 201 80260 10.476341 0 0
698 54 5 44 116 44077 10.693889 1 1
699 63 5 33 117 43081 10.755194 1 1
701 90 5 39 170 15098 10.891566 0 1

474 rows × 8 columns

6. Further divide those subsets into train and test subsets for X and y respectively: X_train, X_test, y_train, y_test

In [31]:
# Call train_test_split on X, y. Make the test_size = 0.25, and random_state = 246
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=246)

3. Modelling

It's useful to look at the scikit-learn documentation on decision trees https://scikit-learn.org/stable/modules/tree.html before launching into applying them. If you haven't seen them before, take a look at that link, in particular the section 1.10.5.

Model 1: Entropy model - no max_depth

We'll give you a little more guidance here, as the Python is hard to deduce, and scikitlearn takes some getting used to.

Theoretically, let's remind ourselves of what's going on with a decision tree implementing an entropy model.

Ross Quinlan's ID3 Algorithm was one of the first, and one of the most basic, to use entropy as a metric.

Entropy is a measure of how uncertain we are about which category the data-points fall into at a given point in the tree. The Information gain of a specific feature with a threshold (such as 'spent_last_month <= 138.0') is the difference in entropy that exists before and after splitting on that feature; i.e., the information we gain about the categories of the data-points by splitting on that feature and that threshold.

Naturally, we want to minimize entropy and maximize information gain. Quinlan's ID3 algorithm is designed to output a tree such that the features at each node, starting from the root, and going all the way down to the leaves, have maximial information gain. We want a tree whose leaves have elements that are homogeneous, that is, all of the same category.

The first model will be the hardest. Persevere and you'll reap the rewards: you can use almost exactly the same code for the other models.

In [32]:
# Declare a variable called entr_model and use tree.DecisionTreeClassifier. 
entr_model = tree.DecisionTreeClassifier(criterion = 'entropy',random_state = 1234)

# Call fit() on entr_model
entr_model.fit(X_train,y_train)

# Call predict() on entr_model with X_test passed to it, and assign the result to a variable y_pred 
y_pred = entr_model.predict(X_test)

# Call Series on our y_pred variable with the following: pd.Series(y_pred)
y_pred = pd.Series(y_pred)

# Check out entr_model
entr_model
Out[32]:
DecisionTreeClassifier(criterion='entropy', random_state=1234)
In [33]:
entr_model.classes_
Out[33]:
array(['NO', 'YES'], dtype=object)
In [34]:
# Now we want to visualize the tree
dot_data = StringIO()

# We can do so with export_graphviz
tree.export_graphviz(entr_model, out_file=dot_data,  
                filled=True, rounded=True,special_characters=True, 
                feature_names=X_train.columns,
                class_names = entr_model.classes_
                )


# Alternatively for class_names use entr_model.classes_
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
Out[34]:

Model 1: Entropy model - no max_depth: Interpretation and evaluation

In [35]:
# Run this block for model evaluation metrics 
print("Model Entropy - no max depth")
# print("Accuracy:", metrics.accuracy_score(y_test,y_pred))
# print("Balanced accuracy:", metrics.balanced_accuracy_score(y_test,y_pred))
# print('Precision score for "Yes"' , metrics.precision_score(y_test,y_pred, pos_label = "YES"))
# print('Precision score for "No"' , metrics.precision_score(y_test,y_pred, pos_label = "NO"))
# print('Recall score for "Yes"' , metrics.recall_score(y_test,y_pred, pos_label = "YES"))
# print('Recall score for "No"' , metrics.recall_score(y_test,y_pred, pos_label = "NO"))

from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,y_pred))


from mlxtend.plotting import plot_confusion_matrix
fig, ax = plot_confusion_matrix(conf_mat=confusion_matrix(y_test,y_pred), figsize=(3, 3))
plt.show()
Model Entropy - no max depth
              precision    recall  f1-score   support

          NO       1.00      0.98      0.99        41
         YES       0.99      1.00      0.99        78

    accuracy                           0.99       119
   macro avg       0.99      0.99      0.99       119
weighted avg       0.99      0.99      0.99       119

What can you infer from these results? Write your conclusions here:

It's almost perfect. Tree structure looks complex probably due to the no limit of depth.

Model 2: Gini impurity model - no max_depth

Gini impurity, like entropy, is a measure of how well a given feature (and threshold) splits the data into categories.

Their equations are similar, but Gini impurity doesn't require logorathmic functions, which can be computationally expensive.

In [36]:
# Make a variable called gini_model, and assign it exactly what you assigned entr_model with above, but with the
# criterion changed to 'gini'
gini_model  = tree.DecisionTreeClassifier(criterion = 'gini',random_state = 1234)

# Call fit() on the gini_model as you did with the entr_model
gini_model.fit(X_train,y_train)

# Call predict() on the gini_model as you did with the entr_model 
y_pred = gini_model.predict(X_test)


# Turn y_pred into a series, as before
y_pred = pd.Series(y_pred)

# Check out gini_model
gini_model
Out[36]:
DecisionTreeClassifier(random_state=1234)
In [37]:
# As before, but make the model name gini_model
dot_data = StringIO()

tree.export_graphviz(gini_model, out_file=dot_data,  
                filled=True, rounded=True,special_characters=True, 
                feature_names=X_train.columns,
                class_names = gini_model.classes_
                )

# Alternatively for class_names use gini_model.classes_
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
Out[37]:
In [175]:
# Run this block for model evaluation
print("Model Gini impurity model")
# print("Accuracy:", metrics.accuracy_score(y_test,y_pred))
# print("Balanced accuracy:", metrics.balanced_accuracy_score(y_test,y_pred))
# print('Precision score' , metrics.precision_score(y_test,y_pred, pos_label = "YES"))
# print('Recall score' , metrics.recall_score(y_test,y_pred, pos_label = "NO"))

from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,y_pred))

from mlxtend.plotting import plot_confusion_matrix
fig, ax = plot_confusion_matrix(conf_mat=confusion_matrix(y_test,y_pred), figsize=(3, 3))
plt.show()
Model Gini impurity model
              precision    recall  f1-score   support

          NO       0.97      0.83      0.89        41
         YES       0.92      0.99      0.95        78

    accuracy                           0.93       119
   macro avg       0.94      0.91      0.92       119
weighted avg       0.94      0.93      0.93       119

How do the results here compare to the previous model? Write your judgements here:

It's less accurate than Model Entropy with no max depth, but it still looks a good result in this example. Tree structure looks unbalanced (left side looks heavier) probably due to this Gini impurity model.

Model 3: Entropy model - max depth 3

We're going to try to limit the depth of our decision tree, using entropy first.

As you know, we need to strike a balance with tree depth.

Insufficiently deep, and we're not giving the tree the opportunity to spot the right patterns in the training data.

Excessively deep, and we're probably going to make a tree that overfits to the training data, at the cost of very high error on the (hitherto unseen) test data.

Sophisticated data scientists use methods like random search with cross-validation to systematically find a good depth for their tree. We'll start with picking 3, and see how that goes.

In [80]:
# Made a model as before, but call it entr_model2, and make the max_depth parameter equal to 3. 
# Execute the fitting, predicting, and Series operations as before
entr_model2 = tree.DecisionTreeClassifier(criterion="entropy", max_depth = 3, random_state = 1234)
entr_model2.fit(X_train,y_train)

y_pred = entr_model2.predict(X_test)
y_pred = pd.Series(y_pred)
entr_model2
Out[80]:
DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=1234)
In [78]:
# As before, we need to visualize the tree to grasp its nature
dot_data = StringIO()

tree.export_graphviz(entr_model2, out_file=dot_data,  
                filled=True, rounded=True,special_characters=True, 
                feature_names=X_train.columns,
                class_names = entr_model2.classes_
                )

# Alternatively for class_names use entr_model2.classes_
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
Out[78]:
In [79]:
# Run this block for model evaluation 
print("Model Entropy model max depth 3")
# print("Accuracy:", metrics.accuracy_score(y_test,y_pred))
# print("Balanced accuracy:", metrics.balanced_accuracy_score(y_test,y_pred))
# print('Precision score for "Yes"' , metrics.precision_score(y_test,y_pred, pos_label = "YES"))
# print('Recall score for "No"' , metrics.recall_score(y_test,y_pred, pos_label = "NO"))


from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,y_pred))

from mlxtend.plotting import plot_confusion_matrix
fig, ax = plot_confusion_matrix(conf_mat=confusion_matrix(y_test,y_pred), figsize=(3, 3))
plt.show()
Model Entropy model max depth 3
              precision    recall  f1-score   support

          NO       1.00      0.73      0.85        41
         YES       0.88      1.00      0.93        78

    accuracy                           0.91       119
   macro avg       0.94      0.87      0.89       119
weighted avg       0.92      0.91      0.90       119

So our accuracy decreased, but is this certainly an inferior tree to the max depth original tree we did with Model 1? Write your conclusions here:

The result is worse than the frist one without max depth, but tree structure is very simple, so calcualtion must have been far more efficient than one without maxt depth.

Model 4: Gini impurity model - max depth 3

We're now going to try the same with the Gini impurity model.

In [92]:
# As before, make a variable, but call it gini_model2, and ensure the max_depth parameter is set to 3
gini_model2=tree.DecisionTreeClassifier(criterion='gini', random_state = 1234, max_depth = 3 )

# Do the fit, predict, and series transformations as before. 
gini_model2.fit(X_train,y_train)
y_pred=gini_model2.predict(X_test)
y_pred = pd.Series(y_pred)
gini_model2
Out[92]:
DecisionTreeClassifier(max_depth=3, random_state=1234)
In [93]:
dot_data = StringIO()

tree.export_graphviz(gini_model2, out_file=dot_data,  
                filled=True, rounded=True,special_characters=True, 
                feature_names=X_train.columns,
                class_names = gini_model2.classes_
                )

# Alternatively for class_names use gini_model2.classes_
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
Out[93]:
In [94]:
print("Gini impurity  model - max depth 3")
# print("Accuracy:", metrics.accuracy_score(y_test,y_pred))
# print("Balanced accuracy:", metrics.balanced_accuracy_score(y_test,y_pred))
# print('Precision score' , metrics.precision_score(y_test,y_pred, pos_label = "YES"))
# print('Recall score' , metrics.recall_score(y_test,y_pred, pos_label = "NO"))

from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,y_pred))

from mlxtend.plotting import plot_confusion_matrix
fig, ax = plot_confusion_matrix(conf_mat=confusion_matrix(y_test,y_pred), figsize=(3, 3))
plt.show()
Gini impurity  model - max depth 3
              precision    recall  f1-score   support

          NO       0.97      0.95      0.96        41
         YES       0.97      0.99      0.98        78

    accuracy                           0.97       119
   macro avg       0.97      0.97      0.97       119
weighted avg       0.97      0.97      0.97       119

Now this is an elegant tree. Its accuracy might not be the highest, but it's still the best model we've produced so far. Why is that? Write your answer here:

The result is slightly worse. But, although tree structure is much simpler in the same degree with Entropy model, decrease in reault has little difference. Furthermore, the calcualtion must also have been far more efficient than one without max depth.

4. Evaluating and concluding

4a. How many customers will buy Hidden Farm coffee?

Let's first ascertain how many loyal customers claimed, in the survey, that they will purchase the Hidden Farm coffee.

In [95]:
# Call value_counts() on the 'Decision' column of the original coffeeData
coffeeData['Decision'].value_counts()
Out[95]:
YES    303
NO     171
Name: Decision, dtype: int64

Let's now determine the number of people that, according to the model, will be willing to buy the Hidden Farm coffee.

  1. First we subset the Prediction dataset into new_X considering all the variables except Decision
  2. Use that dataset to predict a new variable called potential_buyers
In [96]:
coffeeData.columns.isin(features)
Out[96]:
array([ True,  True,  True,  True,  True,  True,  True,  True, False])
In [97]:
# Feature selection
# Make a variable called feature_cols, and assign it a list containing all the column names except 'Decision'
feature_cols = features
# Make a variable called new_X, and assign it the subset of Prediction, containing just the feature_cols 
new_X = Prediction[features]
In [98]:
# Call get_dummies() on the Pandas object pd, with new_X plugged in, to one-hot encode all features in the training set
new_X = pd.get_dummies(new_X,drop_first=True,prefix='dummy')

# Make a variable called potential_buyers, and assign it the result of calling predict() on a model of your choice; 
# don't forget to pass new_X to predict()
potential_buyers =  gini_model2.predict(new_X)
In [99]:
# Let's get the numbers of YES's and NO's in the potential buyers 
# Call unique() on np, and pass potential_buyers and return_counts=True 
np.unique(potential_buyers,return_counts=True)
Out[99]:
(array(['NO', 'YES'], dtype=object), array([ 45, 183]))

The total number of potential buyers is 303 + 183 = 486

In [100]:
# Print the total number of surveyed people 
total = len(y)+len(potential_buyers)
In [101]:
# Let's calculate the proportion of buyers
(sum(y=='YES')+sum(potential_buyers=='YES'))/total
Out[101]:
0.6923076923076923
In [102]:
# Print the percentage of people who want to buy the Hidden Farm coffee, by our model 
sum(potential_buyers=='YES'),len(potential_buyers),str(sum(potential_buyers=='YES')/len(potential_buyers)*100)+'%'
Out[102]:
(183, 228, '80.26315789473685%')

4b. Decision

Remember how you thought at the start: if more than 70% of the interviewed customers are likely to buy the Hidden Farm coffee, you will strike the deal with the local Hidden Farm farmers and sell the coffee. Otherwise, you won't strike the deal and the Hidden Farm coffee will remain in legends only. Well now's crunch time. Are you going to go ahead with that idea? If so, you won't be striking the deal with the Chinese farmers.

They're called decision trees, aren't they? So where's the decision? What should you do? (Cue existential cat emoji).

Ultimately, though, we can't write an algorithm to actually make the business decision for us. This is because such decisions depend on our values, what risks we are willing to take, the stakes of our decisions, and how important it us for us to know that we will succeed. What are you going to do with the models you've made? Are you going to risk everything, strike the deal with the Hidden Farm farmers, and sell the coffee?

The philosopher of language Jason Stanley once wrote that the number of doubts our evidence has to rule out in order for us to know a given proposition depends on our stakes: the higher our stakes, the more doubts our evidence has to rule out, and therefore the harder it is for us to know things. We can end up paralyzed in predicaments; sometimes, we can act to better our situation only if we already know certain things, which we can only if our stakes were lower and we'd already bettered our situation.

Data science and machine learning can't solve such problems. But what it can do is help us make great use of our data to help inform our decisions.

5. Random Forest

You might have noticed an important fact about decision trees. Each time we run a given decision tree algorithm to make a prediction (such as whether customers will buy the Hidden Farm coffee) we will actually get a slightly different result. This might seem weird, but it has a simple explanation: machine learning algorithms are by definition stochastic, in that their output is at least partly determined by randomness.

To account for this variability and ensure that we get the most accurate prediction, we might want to actually make lots of decision trees, and get a value that captures the centre or average of the outputs of those trees. Luckily, there's a method for this, known as the Random Forest.

Essentially, Random Forest involves making lots of trees with similar properties, and then performing summary statistics on the outputs of those trees to reach that central value. Random forests are hugely powerful classifers, and they can improve predictive accuracy and control over-fitting.

Why not try to inform your decision with random forest? You'll need to make use of the RandomForestClassifier function within the sklearn.ensemble module, found here.

5a. Import necessary modules

In [103]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

5b. Model

You'll use your X_train and y_train variables just as before.

You'll then need to make a variable (call it firstRFModel) to store your new Random Forest model. You'll assign this variable the result of calling RandomForestClassifier().

Then, just as before, you'll call fit() on that firstRFModel variable, and plug in X_train and y_train.

Finally, you should make a variable called y_pred, and assign it the result of calling the predict() method on your new firstRFModel, with the X_test data passed to it.

In [107]:
# Plug in appropriate max_depth and random_state parameters 
firstRFModel = RandomForestClassifier(max_depth=3,random_state=True)

# Model and fit
firstRFModel.fit(X_train,y_train)
y_pred = firstRFModel.predict(X_test)
y_pred = pd.Series(y_pred)
firstRFModel
Out[107]:
RandomForestClassifier(max_depth=3, random_state=True)
In [113]:
print("RandomForestClassifier, max_depth=3")
# print("Accuracy:", metrics.accuracy_score(y_test,y_pred))
# print("Balanced accuracy:", metrics.balanced_accuracy_score(y_test,y_pred))
# print('Precision score' , metrics.precision_score(y_test,y_pred, pos_label = "YES"))
# print('Recall score' , metrics.recall_score(y_test,y_pred, pos_label = "NO"))

from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,y_pred))

from mlxtend.plotting import plot_confusion_matrix
fig, ax = plot_confusion_matrix(conf_mat=confusion_matrix(y_test,y_pred), figsize=(3, 3))
plt.show()
RandomForestClassifier, max_depth=3
              precision    recall  f1-score   support

          NO       0.97      0.83      0.89        41
         YES       0.92      0.99      0.95        78

    accuracy                           0.93       119
   macro avg       0.94      0.91      0.92       119
weighted avg       0.94      0.93      0.93       119

In [144]:
# This may not the best way to view each estimator as it is small
fn=X_train.columns
cn=["NO", "YES"]
fig, axes = plt.subplots(nrows = 1,ncols = 5,figsize = (10,2), dpi=900)
for index in range(0, 5):
    tree.plot_tree(firstRFModel.estimators_[index],
                   feature_names = fn, 
                   class_names=cn,
                   filled = True,
                   ax = axes[index]);

    axes[index].set_title('Estimator: ' + str(index), fontsize = 11)
fig.savefig('rf_5trees.png')

# ref:https://stackoverflow.com/questions/40155128/plot-trees-for-a-random-forest-in-python-with-scikit-learn
In [145]:
# Make a prediction
potential_buyers =  firstRFModel.predict(new_X)

#  Let's get the numbers of YES's and NO's in the potential buyers 
# Call unique() on np, and pass potential_buyers and return_counts=True 
np.unique(potential_buyers,return_counts=True)
Out[145]:
(array(['NO', 'YES'], dtype=object), array([ 42, 186]))
In [146]:
# Print the percentage of people who want to buy the Hidden Farm coffee, by our model 
sum(potential_buyers=='YES'),len(potential_buyers),str(sum(potential_buyers=='YES')/len(potential_buyers)*100)+'%'
Out[146]:
(186, 228, '81.57894736842105%')

5c. Revise conclusion

Has your conclusion changed? Or is the result of executing random forest the same as your best model reached by a single decision tree?

Referring to the tables below, among the models with max-depth == 3, this random forest model(RFM) prediction has slightly worse result than Gini model, although it's better than Entropy model.

Theoretically speaking, RFM should perform in less biased way. I predict one of the reasons why we got worse result in RFM is that decision tree could more fit (overfit) to the features.

Model Entropy model max depth 3

          precision    recall  f1-score   support

      NO       1.00      0.73      0.85        41
     YES       0.88      1.00      0.93        78

accuracy                           0.91       119

macro avg 0.94 0.87 0.89 119 weighted avg 0.92 0.91 0.90 119

Gini impurity model - max depth 3

          precision    recall  f1-score   support

      NO       0.97      0.95      0.96        41
     YES       0.97      0.99      0.98        78

accuracy                           0.97       119

macro avg 0.97 0.97 0.97 119 weighted avg 0.97 0.97 0.97 119

RandomForestClassifier, max_depth=3

          precision    recall  f1-score   support

      NO       0.97      0.83      0.89        41
     YES       0.92      0.99      0.95        78

accuracy                           0.93       119

macro avg 0.94 0.91 0.92 119 weighted avg 0.94 0.93 0.93 119