Clustering Case Study: Customer Segmentation with K-Means - Tier 3¶

This case study is based on this blog post by the yhat blog. Please feel free to refer to the post for additional information, and solutions.

Structure of the mini-project:

Sourcing and loading
- Load the data
- Explore the data

Cleaning, transforming and visualizing
- Data Wrangling: Exercise Set 1
  - Creating a matrix with a binary indicator for whether they responded to a given offer
  - Ensure that in doing so, NAN values are dealt with appropriately

Modelling
- K-Means clustering: Exercise Sets 2 and 3
  - Choosing K: The Elbow method
  - Choosing K: The Silhouette method
  - Choosing K: The Gap statistic method
- Visualizing clusters with PCA: Exercise Sets 4 and 5

Conclusions and next steps
- Conclusions
- Other clustering algorithms (Exercise Set 6)

%matplotlib inline
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

# Setup Seaborn
sns.set_style("whitegrid")
sns.set_context("poster")

# Set the number of display
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

1. Sourcing and loading¶

1a. Load the data¶

The dataset contains information on marketing newsletters/e-mail campaigns (e-mail offers sent to customers) and transaction level data from customers. The transactional data shows which offer customers responded to, and what the customer ended up buying. The data is presented as an Excel workbook containing two worksheets. Each worksheet contains a different dataset.

df_offers = pd.read_excel("./WineKMC.xlsx", sheet_name=0)

/Users/akiofukashima/opt/anaconda3/envs/mypython3/lib/python3.7/site-packages/openpyxl/worksheet/_reader.py:312: UserWarning: Unknown extension is not supported and will be removed
  warn(msg)

1b. Explore the data¶

df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]
df_offers.head()

We see that the first dataset contains information about each offer such as the month it is in effect and several attributes about the wine that the offer refers to: the variety, minimum quantity, discount, country of origin and whether or not it is past peak. The second dataset in the second worksheet contains transactional data -- which offer each customer responded to.

df_transactions = pd.read_excel("./WineKMC.xlsx", sheet_name=1)
df_transactions.columns = ["customer_name", "offer_id"]
df_transactions['n'] = 1
df_transactions.head()

/Users/akiofukashima/opt/anaconda3/envs/mypython3/lib/python3.7/site-packages/openpyxl/worksheet/_reader.py:312: UserWarning: Unknown extension is not supported and will be removed
  warn(msg)

2. Cleaning, transforming and visualizing¶

2a. Data Wrangling¶

We're trying to learn more about how our customers behave, so we can use their behavior (whether or not they purchased something based on an offer) as a way to group similar minded customers together. We can then study those groups to look for patterns and trends which can help us formulate future offers.

The first thing we need is a way to compare customers. To do this, we're going to create a matrix that contains each customer and a 0/1 indicator for whether or not they responded to a given offer.

Checkup Exercise Set I

Exercise: Create a data frame where each row has the following columns (Use the pandas [`merge`]) and [`pivot_table`] functions for this purpose):

customer_name
One column for each offer, with a 1 if the customer responded to the offer

Make sure you also deal with any weird values such as `NaN`. Read the documentation to develop your solution.

#your turn
df = df_transactions.merge(df_offers,on='offer_id',how='left').drop(['campaign'],axis=1)
df.head()

df.describe()

import numpy as np
pivoted = df.pivot_table(
    values='n', 
    index='customer_name', 
    columns='offer_id', 
    fill_value=0,
    aggfunc = np.sum)
pivoted.head()

3. Modelling¶

3a. K-Means Clustering¶

Recall that in K-Means Clustering we want to maximize the distance between centroids and minimize the distance between data points and the respective centroid for the cluster they are in. True evaluation for unsupervised learning would require labeled data; however, we can use a variety of intuitive metrics to try to pick the number of clusters K. We will introduce two methods: the Elbow method, the Silhouette method and the gap statistic.

3ai. Choosing K: The Elbow Sum-of-Squares Method¶

The first method looks at the sum-of-squares error in each cluster against $K$. We compute the distance from each data point to the center of the cluster (centroid) to which the data point was assigned.

$$SS = \sum_k \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left( x_i - x_j \right)^2 = \sum_k \sum_{x_i \in C_k} \left( x_i - \mu_k \right)^2$$

where $x_i$ is a point, $C_k$ represents cluster $k$ and $\mu_k$ is the centroid for cluster $k$. We can plot SS vs. $K$ and choose the elbow point in the plot as the best value for $K$. The elbow point is the point at which the plot starts descending much more slowly.

Hint: the Elbow Method is discussed in part 2 of the Harvard Clustering lecture.

Checkup Exercise Set II

Exercise:

1. What values of $SS$ do you believe represent better clusterings? Why?
2. Create a numpy matrix `x_cols` with only the columns representing the offers (i.e. the 0/1 colums)
3. Write code that applies the [`KMeans`] clustering method from scikit-learn to this matrix.
4. Construct a plot showing $SS$ for each $K$ and pick $K$ using this plot. For simplicity, test $2 \le K \le 10$.
5. Make a bar chart showing the number of points in each cluster for k-means under the best $K$.
6. What challenges did you experience using the Elbow method to pick $K$?

# 2. a numpy matrix `x_cols` with only the columns representing the offers
X = pivoted.to_numpy()

# 3. Write code that applies the [`KMeans`] clustering method 
# from scikit-learn to this matrix.

# 4. Construct a plot showing  𝑆𝑆  for each  𝐾  
# and pick  𝐾  using this plot. 
# For simplicity, test  2≤𝐾≤10 .

from sklearn.cluster import KMeans


ks = range(2, 11)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters = k,random_state=10)
    
    # Fit a new model to samples
    model.fit(X)
    
    # Append the SS(inertia) to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
plt.plot(ks, inertias, '-*')
plt.xlabel('number of clusters, k')
plt.ylabel('Sum of Squares(inertia)')
plt.xticks(ks)
plt.show()

According to the SS graph, K=10 is the best K.

# 5. Make a bar chart showing the number of points in each cluster 
# for k-means under the best  𝐾 .

# Set the best K and fit a model
best_K = 10
model = KMeans(n_clusters = best_K,random_state=42)
model.fit(X)

counts = np.bincount(model.labels_)

# Print plt.bar
plt.bar(range(best_K), counts); 
plt.xlabel("Cluster ID")
plt.ylabel("Count")
plt.xticks(range(best_K))
plt.show()

6. What challenges did you experience using the Elbow method to pick 𝐾 ?¶

Choosing the best K number manually is difficult as it is difficult to define a reasnoable quantitative criteria to judge if the K is good or not, and after all it is human sense, which will vary from time to time.

3aii. Choosing K: The Silhouette Method¶

There exists another method that measures how well each datapoint $x_i$ "fits" its assigned cluster and also how poorly it fits into other clusters. This is a different way of looking at the same objective. Denote $a_{x_i}$ as the average distance from $x_i$ to all other points within its own cluster $k$. The lower the value, the better. On the other hand $b_{x_i}$ is the minimum average distance from $x_i$ to points in a different cluster, minimized over clusters. That is, compute separately for each cluster the average distance from $x_i$ to the points within that cluster, and then take the minimum. The silhouette $s(x_i)$ is defined as

$$s(x_i) = \frac{b_{x_i} - a_{x_i}}{\max{\left( a_{x_i}, b_{x_i}\right)}}$$

The silhouette score is computed on every datapoint in every cluster. The silhouette score ranges from -1 (a poor clustering) to +1 (a very dense clustering) with 0 denoting the situation where clusters overlap. Some criteria for the silhouette coefficient is provided in the table below.

Range	Interpretation
0.71 - 1.0	A strong structure has been found.
0.51 - 0.7	A reasonable structure has been found.
0.26 - 0.5	The structure is weak and could be artificial.
< 0.25	No substantial structure has been found.

</pre> Source: http://www.stat.berkeley.edu/~spector/s133/Clus.html

Hint: Scikit-learn provides a function to compute this for us (phew!) called sklearn.metrics.silhouette_score. Take a look at this article on picking $K$ in scikit-learn, as it will help you in the next exercise set.

[Youtube]Getting Started with Orange 13: Silhouette

Checkup Exercise Set III

Exercise1: Using the documentation for the `silhouette_score` function above, construct a series of silhouette plots like the ones in the article linked above.

Exercise2: Compute the average silhouette score for each $K$ and plot it. What $K$ does the plot suggest we should choose? Does it differ from what we found using the Elbow method?

# Your turn.
# 1.Using the documentation for the `silhouette_score` function above, 
# construct a series of silhouette plots 
# like the ones in the article linked above.

from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

range_n_clusters = range(2, 11)

ave_silhouette_score = []

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)
    
    # Add the silhouette average to the list
    ave_silhouette_score.append(silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Srange_n_clustersilhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

plt.plot(range_n_clusters,ave_silhouette_score)
plt.xlabel("K")
plt.ylabel("Average Silhouette Score")
    
plt.show()

For n_clusters = 2 The average silhouette_score is : 0.09174871508750351
For n_clusters = 3 The average silhouette_score is : 0.1107183912025193
For n_clusters = 4 The average silhouette_score is : 0.12349204708263416
For n_clusters = 5 The average silhouette_score is : 0.11482891379977885
For n_clusters = 6 The average silhouette_score is : 0.11879508142787866
For n_clusters = 7 The average silhouette_score is : 0.10874624428071616
For n_clusters = 8 The average silhouette_score is : 0.14097216560635834
For n_clusters = 9 The average silhouette_score is : 0.14178613108021745
For n_clusters = 10 The average silhouette_score is : 0.12819851045484193

2.Compute the average silhouette score for each 𝐾 and plot it. What 𝐾 does the plot suggest we should choose? Does it differ from what we found using the Elbow method?¶

Based on the chart, the best K is 9, and this result is different from what we found in the Elbow method.

3aiii. Choosing $K$: The Gap Statistic¶

There is one last method worth covering for picking $K$, the so-called Gap statistic. The computation for the gap statistic builds on the sum-of-squares established in the Elbow method discussion, and compares it to the sum-of-squares of a "null distribution," that is, a random set of points with no clustering. The estimate for the optimal number of clusters $K$ is the value for which $\log{SS}$ falls the farthest below that of the reference distribution:

$$G_k = E_n^*\{\log SS_k\} - \log SS_k$$

In other words a good clustering yields a much larger difference between the reference distribution and the clustered data. The reference distribution is a Monte Carlo (randomization) procedure that constructs $B$ random distributions of points within the bounding box (limits) of the original data and then applies K-means to this synthetic distribution of data points.. $E_n^*\{\log SS_k\}$ is just the average $SS_k$ over all $B$ replicates. We then compute the standard deviation $\sigma_{SS}$ of the values of $SS_k$ computed from the $B$ replicates of the reference distribution and compute

$$s_k = \sqrt{1+1/B}\sigma_{SS}$$

Finally, we choose $K=k$ such that $G_k \geq G_{k+1} - s_{k+1}$.

Aside: Choosing $K$ when we Have Labels¶

Unsupervised learning expects that we do not have the labels. In some situations, we may wish to cluster data that is labeled. Computing the optimal number of clusters is much easier if we have access to labels. There are several methods available. We will not go into the math or details since it is rare to have access to the labels, but we provide the names and references of these measures.

Adjusted Rand Index
Mutual Information
V-Measure
Fowlkes–Mallows index

Hint: See this article for more information about these metrics.

3b. Visualizing Clusters using PCA¶

How do we visualize clusters? If we only had two features, we could likely plot the data as is. But we have 100 data points each containing 32 features (dimensions). Principal Component Analysis (PCA) will help us reduce the dimensionality of our data from 32 to something lower. For a visualization on the coordinate plane, we will use 2 dimensions. In this exercise, we're going to use it to transform our multi-dimensional dataset into a 2 dimensional dataset.

This is only one use of PCA for dimension reduction. We can also use PCA when we want to perform regression but we have a set of highly correlated variables. PCA untangles these correlations into a smaller number of features/predictors all of which are orthogonal (not correlated). PCA is also used to reduce a large set of variables into a much smaller one.

Hint: PCA was discussed in the previous subunit. If you need help with it, consult this useful article and this visual explanation.

Checkup Exercise Set IV

Exercise1: Use PCA to plot your clusters:

1.Use scikit-learn's [`PCA`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) function to reduce the dimensionality of your clustering data to 2 components
2.Create a data frame with the following fields:
- customer name
- cluster id the customer belongs to
- the two PCA components (label them `x` and `y`)
3.Plot a scatterplot of the `x` vs `y` columns
4.Color-code points differently based on cluster ID
5.How do the clusters look?
6.Based on what you see, what seems to be the best value for $K$? Moreover, which method of choosing $K$ seems to have produced the optimal result visually?

Exercise2: Now look at both the original raw data about the offers and transactions and look at the fitted clusters. Tell a story about the clusters in context of the original data. For example, do the clusters correspond to wine variants or something else interesting?

#your turn
# 1.1, 1.2
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X)
X_pca = pca.transform(X)

X_pca = pd.DataFrame(X_pca,columns='x y'.split())
X_pca.head()

# 1.3, 1.4
# The label to use is from 3a.K-Means Clustering with K=10 by Elbow method.
import matplotlib.colors

best_K = 10
model = KMeans(n_clusters = best_K,random_state=42)
model.fit(X)

cmap = matplotlib.colors.ListedColormap(['red', 'green', 'blue'])
plt.scatter(X_pca.x,X_pca.y,c=model.labels_,cmap=cmap);

plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.show()

1.5 How do the clusters look?, 1.6 Based on what you see, what seems to be the best value for 𝐾 ? Moreover, which method of choosing 𝐾 seems to have produced the optimal result visually?¶

The cluster looks more visible. Based on this two PCA components, the number of clusters looks minimum three as blue and green cluster is well explained, and the red is the remaining, but the number of. cluster seems upto seven to eight by my visual check.

When it comes to find the Best K visually, PCA method seems the best because unlike Elbow method or the other method, we can eyeball-check the data distribution more clearly.

2 Now look at both the original raw data about the offers and transactions and look at the fitted clusters. Tell a story about the clusters in context of the original data. For example, do the clusters correspond to wine variants or something else interesting?¶

# K we use here is 3
n_clusters=3
model = sklearn.cluster.KMeans(n_clusters=n_clusters,random_state=42)
cluster_assignments = model.fit_predict(X)

# Visualise wine variants vs cluster
df2 = pivoted.copy()
df2['cluster'] = model.labels_

df1 = df.merge(df2[['cluster']],on='customer_name',how='left')
df1[['cluster']] = df1[['cluster']].astype('category')
df1[['offer_id']] = df1[['offer_id']].astype('category')
df1.head()

# Plot the #clusters per wine variants and purchase
df1.pivot_table(index=['varietal'],columns='cluster',aggfunc='size').plot(kind='barh',figsize=(14,7));

Based on this chart, there seem to be a relationship between a choice of vine and clusters. The clients who chose the first left three varietals seem more likely to be clustered into 0. Others who chose Espumante, Pinot Grigio seems likely to be into cluster 1, and the others who chose Pinot Noir looks 2.

# Plot the #origins per origin and purchase
df1.pivot_table(index=['origin'],columns='cluster',aggfunc='size').plot(kind='barh',figsize=(14,7));

Looking at the purely frequency, people whose origin is France bought by far the most. Overall, we can recognise some variance in the number of of origin and purchase.

# Plot count of offer_id in each cluster vs globally.

n_clusters=3
model = sklearn.cluster.KMeans(n_clusters=n_clusters)
cluster_assignments = model.fit_predict(X)

colors = ['red', 'green', 'blue']
offer_proportions = X.sum(axis=0) / 100  # There are 100 customers

for i in range(n_clusters):
    plt.figure(i)
    cluster = X[cluster_assignments == i]
    offer_proportions_cluster = cluster.sum(axis=0) / cluster.shape[0]  # Number of customers in cluster
    lift = offer_proportions_cluster - offer_proportions
    plt.bar(range(1, 33), lift, color = colors[i])

Summary¶

What we've done is we've taken those columns of 0/1 indicator variables, and we've transformed them into a 2-D dataset. We took one column and arbitrarily called it x and then called the other y. Now we can throw each point into a scatterplot. We color coded each point based on it's cluster so it's easier to see them.

Exercise Set V

As we saw earlier, PCA has a lot of other uses. Since we wanted to visualize our data in 2 dimensions, restricted the number of dimensions to 2 in PCA. But what is the true optimal number of dimensions?

Exercise: Using a new PCA object shown in the next cell, plot the `explained_variance_` field and look for the elbow point, the point where the curve's rate of descent seems to slow sharply. This value is one possible value for the optimal number of dimensions. What is it?

#your turn
# Initialize a new PCA model with a default number of components.
import sklearn.decomposition
pca = sklearn.decomposition.PCA()
pca.fit(X)

# Do the rest on your own :)
variance = pca.explained_variance_ratio_

# Plot() it 
plt.plot(range(len(variance)), variance)

# Label the axes
plt.xlabel("Number of Components")
plt.ylabel("Proportion of Variance Explained")

Text(0, 0.5, 'Proportion of Variance Explained')

4. Conclusions and next steps¶

4a. Conclusions¶

What can you conclude from your investigations? Make a note, formulate it as clearly as possible, and be prepared to discuss it with your mentor in your next call.

4b. Other clustering algorithms¶

k-means is only one of a ton of clustering algorithms. Below is a brief description of several clustering algorithms, and the table provides references to the other clustering algorithms in scikit-learn.

Affinity Propagation does not require the number of clusters $K$ to be known in advance! AP uses a "message passing" paradigm to cluster points based on their similarity.
Spectral Clustering uses the eigenvalues of a similarity matrix to reduce the dimensionality of the data before clustering in a lower dimensional space. This is tangentially similar to what we did to visualize k-means clusters using PCA. The number of clusters must be known a priori.
Ward's Method applies to hierarchical clustering. Hierarchical clustering algorithms take a set of data and successively divide the observations into more and more clusters at each layer of the hierarchy. Ward's method is used to determine when two clusters in the hierarchy should be combined into one. It is basically an extension of hierarchical clustering. Hierarchical clustering is divisive, that is, all observations are part of the same cluster at first, and at each successive iteration, the clusters are made smaller and smaller. With hierarchical clustering, a hierarchy is constructed, and there is not really the concept of "number of clusters." The number of clusters simply determines how low or how high in the hierarchy we reference and can be determined empirically or by looking at the dendogram.
Agglomerative Clustering is similar to hierarchical clustering but but is not divisive, it is agglomerative. That is, every observation is placed into its own cluster and at each iteration or level or the hierarchy, observations are merged into fewer and fewer clusters until convergence. Similar to hierarchical clustering, the constructed hierarchy contains all possible numbers of clusters and it is up to the analyst to pick the number by reviewing statistics or the dendogram.
DBSCAN is based on point density rather than distance. It groups together points with many nearby neighbors. DBSCAN is one of the most cited algorithms in the literature. It does not require knowing the number of clusters a priori, but does require specifying the neighborhood size.

from math import sqrt
 
# Function calculates distance
# between two points
def dist(p1, p2):
     
    x0 = p1[0] - p2[0]
    y0 = p1[1] - p2[1]
    return x0 * x0 + y0 * y0
 
# Function to find the maximum
# distance between any two points
def maxDist(p):
 
    n = len(p)
    maxm = 0
 
    # Iterate over all possible pairs
    for i in range(n):
        for j in range(i + 1, n):
             
            # Update maxm
            maxm = max(maxm, dist(p[i], p[j]))
 
    # Return actual distance
    return sqrt(maxm)
 
# This code is contributed by mohit kumar 29
# https://www.geeksforgeeks.org/maximum-distance-between-two-points-in-coordinate-plane-using-rotating-calipers-method/

5

Text(0.5, 1.0, 'eps elbow plot')

We define a core sample as being a sample in the dataset such that there exist 4 other samples within a distance of 0.22. Higher min_samples or lower eps indicate higher density necessary to form a cluster.

Estimated number of clusters: 5
Estimated number of noise points: 17

Exercise Set VI

Exercise: Try clustering using the following algorithms.

Affinity propagation
Spectral clustering
Agglomerative clustering
DBSCAN

How do their results compare? Which performs the best? Tell a story why you think it performs the best.

1. Affinity propagation¶

from sklearn: Affinity Propagation can be interesting as it chooses the number of clusters based on the data provided. The main drawback of Affinity Propagation is its complexity. Affinity Propagation most appropriate for small to medium sized datasets.

ref: how to choose preference parameter

ref: Code source of Maximum distance

from math import sqrt
 
# Function calculates distance
# between two points
def dist(p1, p2):
     
    x0 = p1[0] - p2[0]
    y0 = p1[1] - p2[1]
    return x0 * x0 + y0 * y0
 
# Function to find the maximum
# distance between any two points
def maxDist(p):
 
    n = len(p)
    maxm = 0
 
    # Iterate over all possible pairs
    for i in range(n):
        for j in range(i + 1, n):
             
            # Update maxm
            maxm = max(maxm, dist(p[i], p[j]))
 
    # Return actual distance
    return sqrt(maxm)
 
# This code is contributed by mohit kumar 29
# https://www.geeksforgeeks.org/maximum-distance-between-two-points-in-coordinate-plane-using-rotating-calipers-method/

from sklearn.cluster import AffinityPropagation
X = pivoted.to_numpy()

pca = sklearn.decomposition.PCA(n_components=2)
pca.fit(X)
X = pca.transform(X)

# Set preference to -1 times the square of the largest distance in the data set
# https://stackoverflow.com/questions/33187354/affinity-propagation-preferences-initialization
preference = -maxDist(X)

# Compute Affinity Propagation
af = AffinityPropagation(preference=preference,random_state=42).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)


# Plot result
import matplotlib.pyplot as plt
from itertools import cycle

plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')

for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X[cluster_centers_indices[k]]
    plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in X[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

5

2. Spectral clustering¶

from sklearn: In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex, or more generally when a measure of the center and spread of the cluster is not a suitable description of the complete cluster, such as when clusters are nested circles on the 2D plane.

Code source

# Reset PCA
X = pivoted.to_numpy()
pca = sklearn.decomposition.PCA(n_components=2)
pca.fit(X)
X = pca.transform(X)

# Clustering with the SpectralClustering and visualizing

from sklearn.cluster import SpectralClustering

colors = plt.get_cmap("tab10")

plt.figure(figsize=(14,28))
nrows = 5
ncols = 2

for i in range(2,11):
    sc = SpectralClustering(n_clusters=i).fit(X)
    plt.subplot(5, 2, i-1)
    plt.scatter(X[:,0], X[:,1], c=sc.labels_,cmap=colors)
    plt.title("n_cluster-"+str(i))
plt.tight_layout();

With regards to the best K, it is difficult to say one is better than the others. Loooking at the n_cluster-5, although the most of data is labeled in the same way as Affinity propagation, the data close to the boundaries are different. As for the at the n_cluster-10, two of the right-end clusters (blue and orange) seems not well clustred and counter-intuitive. Therefore, Spectral clustering might be inappropriate.

3. Agglomerative clustering¶

from sklearn userguide: The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy:\
- Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.\
- Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters.\
- Average linkage minimizes the average of the distances between all observations of pairs of clusters.\
- Single linkage minimizes the distance between the closest observations of pairs of clusters.</a>

Comparison Of K-Means & Hierarchical Clustering In Customer Segmentation

Customer clustering

# Reset PCA
X = pivoted.to_numpy()
pca = sklearn.decomposition.PCA(n_components=2)
pca.fit(X)
X = pca.transform(X)

# Clustering with the SpectralClustering and visualizing

from sklearn.cluster import AgglomerativeClustering

colors = plt.get_cmap("tab10")

plt.figure(figsize=(14,28))
nrows = 5
ncols = 2

for i in range(2,11):
    hc = AgglomerativeClustering(n_clusters = i, affinity = 'euclidean', linkage ='ward')
    hc.fit_predict(X)
    plt.subplot(5, 2, i-1)
    plt.scatter(X[:,0], X[:,1], c=hc.labels_,cmap=colors)
    plt.title("n_cluster-"+str(i))
plt.tight_layout();

This Agglomerative clustering seems the most reasonable we have ever so far. Even the n_cluster-10 has quite intuitive way of the clusters division. This may be ebcause Ward method well match the visual way of clustering on 2D space.

4. DBSCAN¶

from sklearn.neighbors import NearestNeighbors

# Reset PCA
X = pivoted.to_numpy()
pca = sklearn.decomposition.PCA(n_components=2)
pca.fit(X)
X = pca.transform(X)

# Find the optimal epsilon
ns = 4
nbrs = NearestNeighbors(n_neighbors=ns).fit(X)
distances, indices = nbrs.kneighbors(X)
distanceDec = sorted(distances[:,ns-1], reverse=True)
plt.plot(list(range(1,len(X)+1)), distanceDec);

plt.xlabel("Distance")
plt.ylabel("epsilon")
plt.title('Elbow plot')

Text(0.5, 1.0, 'eps elbow plot')

# Optimal eps
eps =0.22
min_samples=4

print(f'We define a core sample as being a sample in the dataset such that there exist {min_samples} other samples within a distance of {eps} euclidean distance. Higher min_samples or lower eps indicate higher density necessary to form a cluster.')

We define a core sample as being a sample in the dataset such that there exist 4 other samples within a distance of 0.22. Higher min_samples or lower eps indicate higher density necessary to form a cluster.

from sklearn.cluster import DBSCAN
from sklearn import metrics

# Reset PCA
X = pivoted.to_numpy()
pca = sklearn.decomposition.PCA(n_components=2)
pca.fit(X)
X = pca.transform(X)

# Compute DBSCAN

db = DBSCAN(eps=eps, min_samples=min_samples).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)

# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

Estimated number of clusters: 5
Estimated number of noise points: 17

Although there are some outliers, the best number of clusters in DBSCAN gets the same result as the one in Affinity propagation

	offer_id	n	min_qty	discount
count	324.000000	324.0	324.000000	324.000000
mean	17.012346	1.0	58.407407	59.481481
std	9.703332	0.0	49.741444	20.327877
min	1.000000	1.0	6.000000	17.000000
25%	8.000000	1.0	6.000000	45.000000
50%	18.000000	1.0	72.000000	56.000000
75%	26.000000	1.0	72.000000	83.000000
max	32.000000	1.0	144.000000	89.000000

	x	y
0	1.007580	0.108215
1	-0.287539	0.044715
2	-0.392032	1.038391
3	0.699477	-0.022542
4	0.088183	-0.471695

Method name	Parameters	Scalability	Use Case	Geometry (metric used)
K-Means	number of clusters	Very largen_samples, medium n_clusters with MiniBatch code	General-purpose, even cluster size, flat geometry, not too many clusters	Distances between points
Affinity propagation	damping, sample preference	Not scalable with n_samples	Many clusters, uneven cluster size, non-flat geometry	Graph distance (e.g. nearest-neighbor graph)
Mean-shift	bandwidth	Not scalable with n_samples	Many clusters, uneven cluster size, non-flat geometry	Distances between points
Spectral clustering	number of clusters	Medium n_samples, small n_clusters	Few clusters, even cluster size, non-flat geometry	Graph distance (e.g. nearest-neighbor graph)
Ward hierarchical clustering	number of clusters	Large n_samples and n_clusters	Many clusters, possibly connectivity constraints	Distances between points
Agglomerative clustering	number of clusters, linkage type, distance	Large n_samples and n_clusters	Many clusters, possibly connectivity constraints, non Euclidean distances	Any pairwise distance
DBSCAN	neighborhood size	Very large n_samples, medium n_clusters	Non-flat geometry, uneven cluster sizes	Distances between nearest points
Gaussian mixtures	many	Not scalable	Flat geometry, good for density estimation	Mahalanobis distances to centers
Birch	branching factor, threshold, optional global clusterer.	Large n_clusters and n_samples	Large dataset, outlier removal, data reduction.	Euclidean distance between points

	offer_id	campaign	varietal	min_qty	discount	origin	past_peak
0	1	January	Malbec	72	56	France	False
1	2	January	Pinot Noir	72	17	France	False
2	3	February	Espumante	144	32	Oregon	True
3	4	February	Champagne	72	48	France	True
4	5	February	Cabernet Sauvignon	144	44	New Zealand	True

	customer_name	offer_id	n	varietal	min_qty	discount	origin	past_peak
0	Smith	2	1	Pinot Noir	72	17	France	False
1	Smith	24	1	Pinot Noir	6	34	Italy	False
2	Johnson	17	1	Pinot Noir	12	47	Germany	False
3	Johnson	24	1	Pinot Noir	6	34	Italy	False
4	Johnson	26	1	Pinot Noir	144	83	Australia	False

offer_id	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32
customer_name
Adams	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	1	1	0	0
Allen	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
Anderson	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0
Bailey	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
Baker	0	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0