In this project we look to apply unsupervised learning techniques to create customer segments. This is done by analysing a dataset containing data on various customer's annual spending amounts (reported in monetary units) of diverse product categories. A great blog post on customer segmentation is found here.
A goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. In achieving such, this can have several benefits for the distributor. Mainly, it would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer. This is a critical issue and practical problem that many distributors and organisations alike face.
The other goal of the project is to simply group customers in clusters of similar spending characteristics. This equips an organisation to better understand their audience and curate product production plans to meet demand timely.
The dataset for this project can be found on the UCI Machine Learning Repository. The site currently maintains 622 data sets as a service to the machine learning community.
The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. Below is a list and description of the variables in the data set:
1) FRESH: annual spending (m.u.) on fresh products (Continuous);
2) MILK: annual spending (m.u.) on milk products (Continuous);
3) GROCERY: annual spending (m.u.)on grocery products (Continuous);
4) FROZEN: annual spending (m.u.)on frozen products (Continuous)
5) DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
6) DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
7) CHANNEL: customers channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal);
8) REGION: customers region - Lisnon, Oporto or Other (Nominal)
More information can be found here.
For the purposes of this project, the features 'Channel' and 'Region' will be excluded in the analysis — with focus instead on the six product categories recorded for customers.
We begin by loading the necessary packages for the project. We also import the .py file we created for adding some supplementary visualisations. We also proceed to load the data for the project.
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')
# Show matplotlib plots inline
%matplotlib inline
# Import supplementary code visuals.py
import visuals as vs
# Load the dataset
data = pd.read_csv("customers.csv")
data.drop(['Region', 'Channel'], axis = 1, inplace = True)
print("Wholesale customers dataset has {} samples with {} features each.".format(*data.shape))
Here we shall begin having a look at our data through a serries of different plots and code to understand how each feature is related to the others.
As mentioned the data set has six important product categories: 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', and 'Delicatessen'. Let's produce a quick summary of the variables in our data set.
# Summary
display(data.describe().transpose())
We see that the ranges appear to be quite vast for all the variables. Lets better visualise these results using a box plot.
# Box-Plots of Summary for Variables
sns.set(rc = {'figure.figsize':(18,12)}) # used for controls of seaborn plot
plt.subplot(3, 2, 1)
sns.boxplot(y=data["Fresh"], width=0.3);
plt.subplot(3, 2, 2)
sns.boxplot(y=data["Milk"], width=0.3);
plt.subplot(3, 2, 3)
sns.boxplot(y=data["Grocery"], width=0.3);
plt.subplot(3, 2, 4)
sns.boxplot(x=data["Frozen"], width=0.3);
plt.subplot(3, 2, 5)
sns.boxplot(x=data["Detergents_Paper"], width=0.3);
plt.subplot(3, 2, 6)
sns.boxplot(x=data["Delicatessen"], width=0.3);
plt.tight_layout()
plt.show()
We see that the variables are all highly positively (right) skewed distributed, with large outlying values.
We can select a subsample, to aid us in better understanding (some of) the customers and how their data will transform through the analysis. We select a few sample data points and explore them in more detail below. We messed around with various indices (which will represent the customers to track) in order to obtain customers that vary significantly from one another.
# Customers to track
indices = [26,176,392]
# Create DataFrame of Subsample
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Chosen samples of wholesale customers dataset: \n")
display(samples)
A possible question we could be posed with, is to discern just by viewing the expenditure of these three customers, what kind of establishment (customer) could each of the three samples we’ve chosen represent?
Customer 1
#Custoemr 1
print("Customer 1:")
display(samples.loc[0])
Customer 2
#Custoemr 2
print("Customer 2:")
display(samples.loc[1])
Customer 3
#Custoemr 3
print("Customer 3:")
display(samples.loc[2])
To compare the three customers, look below as we plot the sample and their counts, as well as the means.
# Comparison of Samples
# Get the means
mean_data = data.describe().loc['mean', :]
# Append means to the samples' data
samples_bar = samples.append(mean_data)
# Construct indices
samples_bar.index = indices + ['mean']
# Plot bar plot
samples_bar.plot(kind='bar', figsize=(14,8));
for Customer 2 (index 176), it has a very large spend on Fresh items compared to the other customers and even in comparison to the mean of the dataset (of all customers).
Some extremes for other variables are also reflected/highlighted below in the comparison of percentiles plot.
# Percentiles comparison
# First, calculate the percentile ranks of the whole dataset.
percentiles = data.rank(pct=True)
# Then, round it up, and multiply by 100
percentiles = 100*percentiles.round(decimals=3)
# Select the indices you chose from the percentiles dataframe
percentiles = percentiles.iloc[indices]
# Now, create the heat map using the seaborn library
sns.heatmap(percentiles, vmin=1, vmax=99, annot=True);
If we had to consider other customers, lets use indices 43, 12 and 39. The resulting sample is shown below.
indices2 = [43, 12, 39]
# Create DataFrame of Subsample
samples2 = pd.DataFrame(data.loc[indices2], columns = data.keys()).reset_index(drop = True)
print("Chosen second samples of wholesale customers dataset: \n")
display(samples2)
Looking at the quantities if the annual expenditures. We can perhaps deduce or speculate the type of establishment. We tried to be more specific here, for fun.
Index 0: Coffee Cafe
- Low spending on "Fresh", "Frozen" and "Delicatessen".
- Majority of spending on "Grocery", "Milk" and "Detergents_Paper".
- With some spending on "Delicatessen", it may be a cafe establishment serving drinks, coffee perhaps, with some ready-made food as a complimentary product.
Index 1: Upscale Restaurant
- Low spending on "Frozen".
- Majority of spending is a mix of "Fresh", "Milk, and "Grocery"
- This may be an upscale restuarant with almost no spending on frozen foods.
- Most upscale restaurants only use fresh foods.
Index 2: Fresh Food Retailer
- Majority of spending is on "Fresh" goods with little spending on everything else except on "Frozen".
- This may be a grocery store specializing in fresh foods with some frozen goods.
An interesting question arises when we look at the data. Is it possible that maybe one or more of the six product categories is actually not relevant for understanding customer purchasing? That is, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products?
This is a clear deviation from our data exploration but is an interesting question to which we can simply answer by running a regression. That is, we shall train a simple supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.
We do this with the feature MILK removed (we need to predict that given the other variable consumptions).
# Import Modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
# New Data without Milk (need to predict it)
new_data = data.drop(['Milk'],axis=1)
# Create Train and test data sets
X_train, X_test, y_train, y_test = train_test_split(new_data,data['Milk'],test_size=0.25,random_state=101)
# fit regression decision tree (CART)
regressor = DecisionTreeRegressor(random_state=101).fit(X_train,y_train)
# Get Prediction
score = regressor.score(X_test,y_test)
print(score)
We trained the model with 75% the data from our data set (original). We tested our model using the remaining 25% of customers records. So we tried to predict the ‘Milk’ feature (i.e. annual spending on milk products), based on the other features in the dataset and our model achieved a predicted $R^2$ score was 0.2957. As we know that the R2 is between 0 and 1, the model we built for customer’s milk purchasing habits isn’t very good, although it is possible that there’s some correlation between this feature and others.
It’s safe to say that the ‘Milk’ feature is necessary for identifying customer’s spending habits because it isn’t possible to predict how a customer spends on Milk based on their spending on the other product categories. We can say that the ‘Milk’ feature adds extra (and maybe key) information to the data which is not easily inferable by model only through looking at the other features.
Suppose we wanted to find the R2 score for each variable and not just predict Milk. But maybe perhaps, create a loop to treat each variable as the target. Let's try to do that below.
# Create list to loop through
dep_vars = list(data.columns)
# Create loop to test each feature as a dependent variable
for var in dep_vars:
#'drop' function to drop the given feature
new_data = data.drop([var], axis = 1)
# Create feature Series
new_feature = pd.DataFrame(data.loc[:, var])
# Split the data into training and testing sets using the given feature as the target
X_train, X_test, y_train, y_test = train_test_split(new_data, new_feature, test_size=0.25, random_state=101)
# Create a decision tree regressor and fit it to the training set
dtr = DecisionTreeRegressor(random_state=101)
dtr.fit(X_train, y_train)
# Report the score of the prediction using the testing set (Returns R^2)
score = dtr.score(X_test, y_test)
print('R2 score for {} as dependent variable: {}'.format(var, score))
So essentially what we did was use a loop and predicted every single feature as a dependent variable with the results shown above.
We see that "Fresh" and "Frozen" as dependent variables have negative $R^2$ scores. Their negative scores imply that they are necessary for identifying customers' spending habits because the remaining features cannot explain the variation in them.
Similarly, "Milk" and "Delicatessen" have very low R2 scores. Their low scores also imply that they are necessary for identifying customers' spending habits.
We now turn our attention to the whole data set again. we can construct a scatter matrix of each of the six product features present in the data.
Apart from simply finding associations between variables in our data set this scatter plot matrix can also help infer on an earlier question we asked. That is, we can investigate the manner or way in which MILK may or may not be related to the other variables.
# Produce a scatter matrix
pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (20,10), diagonal = 'kde');
Looking at the plot above, we get the realtionship between each variable with one another. We also get on the diagonal, the density distribution of that variable.
We can identify that there are a few pairs of features that exhibit some degree of correlation. They include:
Grocery and Detergents_Paper
As we tried to predict the ‘Milk’ feature earlier, we found that it was in fact not really possible to predict how a customer spends on Milk based on their spending on the other product categories. The correlation results above confirms the suspicion that Milk isn’t correlated to most of the features in the dataset, although it shows a releatively mild correlation with ‘Groceries’ and ‘Detergents_Paper’.
More over, These features that are strongly correlated (i.e., grocery and detergents_paper) does lend credence to our initial claim that Grocery may not be necessary for identifying customers' spending habits.
Grocery and Detergent_papers respectively.The distribution of all the features appears to be similar. It is strongly right skewed as we have suspected from the box plots earlier on. That is, most of the data points fall in then first few intervals. Judging by the summary statistics, especially the mean and maximum value points, of the features that we calculated earlier, we can expect that there are some outliers in each of the distributions. We also saw there were several outliers when we drew up the box plots.
The data are not normally distributed due to the presence of many outliers. This indicates how normalization is required to make the data features normally distributed as many clustering algorithms (which we shall look at later) require them to be normally distributed.
Let's get the actual correlation heatmap for the data.
sns.set(rc = {'figure.figsize':(20,15)})
sns.heatmap(data.corr(), annot=True, cmap = "summer"); # heatmap with seaborn
This is to cross-reference with the scatter matrix above to draw more accurate insights from the data. The higher the color is on the bar, the higher the correlation.
Here we shall seek to preprocess our data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers.
That is, we shall conduct:
This is all to ensure our results you obtain from your analysis are significant and meaningful.
We noticed in our box plots, density plots and summary statistics, that the distributions of the variables are not normal. Specifically, the mean and median vary significantly (indicating a large skew). This often indicates non-linear scaling is required.
One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness.
A simpler approach which can work in most cases would be applying the natural logarithm. We do this very method below.
# Scale the data using the natural logarithm
log_data = data.apply(lambda x: np.log(x))
# Scale the sample data using the natural logarithm
log_samples = samples.apply(lambda x: np.log(x))
# Produce a scatter matrix - transformed data
pd.plotting.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
After applying a natural logarithm scaling to the data, the distribution of each feature appears much more normal. Moreover we still see that the pairs of variables which were correlated before, ar still correlated after the transformation.
print("Correlation Matrix (Original Data):")
display(data.corr())
print("Correlation Matrix (Transformed Data):")
display(log_data.corr())
Upon closer investigation we do see that the correlation appears to be weaker though.
Identifying outliers in data is an important part of statistical analyses. The presence of outliers can often skew results which take into consideration these data points.
There are many methods out there and rules of thumb that describes or attributes what makes an outlier. We make use of Tukey's method for identifying outliers. It is a simple rule for finding outliers and is based on the quartiles of the data. Essentially what the method describes is that: an outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.
That is; the first quartile $Q1$ is the value ≥ 1/4 of the data, the second quartile $Q2$ or the median is the value ≥ 1/2 of the data, and the third quartile $Q3$ is the value ≥ 3/4 of the data. The inter-quartile range, IQR, is $Q3 − Q1$. Tukey’s rule says that the outliers are values more than 1.5 times the inter-quartile range from the quartiles — either below $Q1 − 1.5IQR$, or above $Q3 + 1.5IQR$ (Tukey's Method).
log_data.keys()
# Select the indices for data points that we wish to remove
outliers = []
# For each feature (keys) find the data points with extreme high or low values
for feature in log_data.keys():
# Calculate Q1
Q1 = np.percentile(log_data[feature],25)
# Calculate Q3 for the given feature
Q3 = np.percentile(log_data[feature],75)
# calculate an outlier step (1.5 times the interquartile range)
step = (Q3-Q1) * 1.5
# Display the outliers
print("Data points considered outliers for the feature '{}':".format(feature))
out = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))]
display(out)
outliers = outliers + list(out.index.values)
#Creating list of more outliers which are the same for multiple features.
outliers = list(set([x for x in outliers if outliers.count(x) > 1]))
print("Outliers: {}".format(outliers))
# Remove the outliers, if any were specified
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)
print("The good dataset now has {} observations after removing outliers.".format(len(good_data)))
# Original Data
print('Original shape of data:\n', log_data.shape)
# Processed Data
print('New shape of data:\n', good_data.shape)
Upon quick inspection, our sample doesn't contain any of the outlier values. The good data now is a matrix that measures (435x6) instead of the original dimensionality of (440x6).
There were 5 data points that were considered outliers for more than one feature based on our definition above. So, instead of removing all outliers (which would result in us losing a lot of information), only outliers that occur for more than one feature are removed; i.e., were outliers in multiple features. That is; they were removed as they are not only outliers in one category but more than once. Hence, they are not representative of our general customers.
More information regarding whether to drop outliers is provided by the article here.
Here we shall use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.
More specfically; PCA refers to the process by which principal components are computed, and the subsequent use of these components in understanding the data. PCA is an unsupervised approach, since it involves only a set of features $X_1, X_2, . . . , X_p$ and no associated response $Y$. PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is measured by the amount that the observations vary along each dimension. Each of the dimensions found by PCA is a linear combination of the $p$ features.
TLDR; PCA starts with the data-set being normalized and then the eigen-analysis of the covariance matrix is done. Then to reduce the dimension, the data-set is projected onto the first few principal components. It is am method to reduce dimensionality of data, whilst retaining as much information as possible in the data.
Previously using natural logartihms we were able to scale the data to a more normal distribution and we have had any necessary outliers removed. As such, we can now apply PCA to the good_data to discover which dimensions about the data best maximize the variance of features involved.
Note that a component (dimension) from PCA can be considered a new “feature” of the space, however it is a composition of the original features present in the data.
# Import PCA module
from sklearn.decomposition import PCA
# Apply PCA by fitting the good data
pca = PCA().fit(good_data)
# Transform log_samples using the PCA fit
pca_samples = pca.transform(log_samples)
# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)
If we look at the loadings for each component:
Milk, Grocery, and Detergents_Paper features. These variables load highly on this component. This might represent Hotels, where these items are usually needed for the guests. This is in line with our initial findings where the 3 features are highly correlated. An increase in PC1 is associated with large increases in "Milk", "Grocery" and "Detergents_Paper" spending.Fresh, Frozen, and Delicatessen. This dimension might represent 'restaurants', where these items are used for ingredients in cooking dishes. We also note very low loading/contribution by Detergents_Paper. An increase in PC2 is associated with large increases in "Fresh", "Frozen" and "Delicatessen" spending.Dimension 3 has a high positive weight for Deli and Frozen features, and a low positive weight for Milk, but has negative weights for everything else. This dimension might represent Delis. An increase in PC3 is associated with a large increase in "Delicatessen" and a large decrease in "Fresh" spending
Dimension 4 has positive weights for Frozen, Detergents_Paper and Groceries, while being negative for Fresh and Deli. It's a bit tricky to pin this segment down, but I do believe that there are shops that sell frozen goods exclusively. An increase in PC4 is associated with a large increasing in "Frozen" and a large decrease in "Delicatessen" spending.
In addition to finding these dimensions, PCA also reports the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone.
We see that the first 4 components account for over 90% of the variation in the data. In fact, The first and second features, in total, explain approx. 70.8% of the variance in our data. The first four features, in total, explain approx. 93.11% of the variance. However, to establish how many components we shall use and need, we can use a scree plot.
PC_values = np.arange(pca.n_components_) + 1
plt.plot(PC_values, pca.explained_variance_ratio_, 'o-', linewidth=2, color='blue')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.show()
The x-axis displays the principal component and the y-axis displays the percentage of total variance explained by each individual principal component. We look for an elbow in the plot. A clear elbow is at 3 components. We shall use 4 given than we can explain over 90% of the data variation.
We can also use the following code to display the exact percentage of total variance explained by each principal component:
print(pca.explained_variance_ratio_)
# Explained Variance by PC's
print(pca.explained_variance_ratio_)
As mentioned before and shown in our earlier plot, the first principal component explains 44.3% of the total variation in the dataset. The second principal component explains 26.4% of the total variation. And so on.
Note, that one of the main goals of using PCA is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost:
Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. We have established that using just four principal components, we can explain over 93% of the variation in the data.
As such, since we have a significant amount of variance is explained by only two or three (around 80%) dimensions, the reduced data can be visualized afterwards in a bi-plot.
# Apply PCA by fitting the good data with only two dimensions
pca = PCA(n_components=2).fit(good_data)
# Transform the good data using the PCA fit above
reduced_data = pca.transform(good_data)
# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)
# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])
# Create a biplot
vs.biplot(good_data, reduced_data, pca);
A bi-plot is a scatter plot where each data point is represented by its scores along the principal components. The axes are the principal components (in this case Dimension 1 and Dimension 2). In addition, the bi-plot shows the projection of the original features along the components. A bi-plot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.
The original feature projections are in red. A point the lower right corner of the figure will likely correspond to a customer that spends a lot on 'Milk', 'Grocery' and 'Detergents_Paper', but not so much on the other product categories.
We can also deduce a lot of other information from the biplot. Essentially, the biplot shows where the points are the projected observations and the vectors are the projected variables.
The variance of the data along the principal component directions is associated with the magnitude of the eigenvalues - that is, the length of the vector is a measure of the variance of the data when projected onto that axis. Moreover, the further away these vectors are from a PC origin, the more influence they have on that PC.
Also, the angles between the vectors tell us how characteristics correlate with one another. When two vectors are close, forming a small angle, the two variables they represent are positively correlated.
Clustering analysis concerns itself with grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters).
In this part of the project we shall examine, implement and choose to use either a K-Means clustering algorithm, PAM clustering or a Gaussian Mixture Model clustering algorithm to identify the various customer segments hidden in the data.
All three methods are classified as Non-Hierarchical Clustering algorithm as opposed to a Hierarchical Clustering method - a key difference is that (for NHC) the desired number of clusters need to be specified. Moreover, this type of clustering (NH Clustering) does not incorporate a tree-like cluster structure. Rather this deals with splitting up and merging clusters.
To begin the algorithm, each observation is randomly assigned to each of the specified clusters. Once assigned, the centroids of each cluster are calculated. The objective of the K-means algorithm is to minimize the Squared Euclidean Distance (ESS) for each observation in every cluster. This is to ensure that all observations are close to the centroid of the cluster they lie in. The algorithm iterates by assigning each observation to the closest cluster centroid. This is done to reduce the ESS, and will continue to iterate until no observation moves from the cluster in which they are located in.
Advantages of K-Means clustering:
Disadvantages
Algorithm:
In order to assess the most suitable number of clusters, an Elbow Plot will be examined, below. We Can Choose the right number of clusters with the help of the Within-Cluster-Sum-of-Squares (WCSS) method. WCSS Stands for the sum of the squares of distances of the data points in each and every cluster from its centroid.
The process:
Execute the K-means clustering on a given dataset for different K values (ranging from 1-10).
For each value of K, calculates the WCSS value.
Plots a graph/curve between WCSS values and the respective number of clusters K.
The sharp point of bend or a point( looking like an elbow joint ) of the plot like an arm, will be considered as the best/optimal value of K
#import module
from sklearn.cluster import KMeans
# Data
x = reduced_data
# add a bit of flair
kmeans_kwargs = {
"init": "random",
"n_init": 10,
"max_iter": 300,
"random_state": 101,
}
# Initialise Within-Cluster-Sum-of-Squares (WSS)
wss=[]
# loop to get wss
for i in range(1, 7):
kmeans = KMeans(n_clusters = i, **kmeans_kwargs)
kmeans.fit(x)
wss_iter = kmeans.inertia_
wss.append(wss_iter)
# plot elbow plot with wss
number_clusters = range(1,7)
plt.plot(number_clusters, wss)
plt.title('The Elbow title')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
We shall use 2 clusters - as indicated by the elbow. Note if we are having trouble choosing the elbow point of the curve, then you could use a Python package, kneed, to identify the elbow point programmatically.
# modules
from kneed import KneeLocator
# Apply program to choose k
kl = KneeLocator(
range(1, 7), wss, curve="convex", direction="decreasing"
)
print(kl.elbow)
We can also apply the silhouette method to choose number of clusters. See below code.
# modules
from sklearn.metrics import silhouette_score
# A list holds the silhouette coefficients for each k
silhouette_coefficients = []
# Notice you start at 2 clusters for silhouette coefficient
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(reduced_data)
score = silhouette_score(reduced_data, kmeans.labels_)
silhouette_coefficients.append(score)
plt.style.use("fivethirtyeight")
plt.plot(range(2, 11), silhouette_coefficients)
plt.xticks(range(2, 11))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
# apply kmenas algorithm with n clusters
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
# Show first 5 cluster assignments
identified_clusters = kmeans.fit_predict(x)
identified_clusters[0:5]
# plot Clusters
plt.scatter(reduced_data["Dimension 1"], reduced_data["Dimension 2"], c=kmeans.labels_)
plt.show()
Instead of Hard assigning data points to a cluster, if we are uncertain about the data points where they belong or to which group, we use this method. It uses probability of a sample to determine the feasibility of it belonging to a cluster.
Advantages of Gaussian Mixture Model clustering:
Disadvantages:
Given the soft assignment of points it may be more well suited for our wholesale customer data (i.,e., there may be a mixed membership problem in our dataset where there is no clear demarcation), Gaussian Mixture model clustering might be the better option over the other two algorithms. '
A problem with hard assignment is that there might be some hidden patterns in the data that we may miss by assigning only one cluster to each data point. For example, let’s take the case of the Supermarket customer in our sample: while doing PCA, it had similar and high positive weights for multiple dimensions, i.e. it didn’t belong to one dimension over the other. So a supermarket may be a combination of a fresh produce store/grocery store/frozen goods store.
As with K-Means algorithm, the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the “goodness” of a clustering by calculating each data point’s silhouette coefficient.
Again, (as we previously mentioned under K-Means) the silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.
# import modules
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
# Create range of clusters
n_clusters = list(range(2,11))
print("Different Number of Clusters (K) Tried: ", n_clusters)
# for each clusters k, generate silhouette score
for n in n_clusters:
# Apply your GMM algorithm to reduced data
clusterer = GaussianMixture(n_components=n).fit(reduced_data)
# Predict the cluster for each data point
preds = clusterer.predict(reduced_data)
# Find the cluster centers
centers = clusterer.means_
# Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(pca_samples)
# Calculate the mean silhouette coefficient for the number of clusters chosen
score = silhouette_score(reduced_data,preds, metric="mahalanobis")
print("The silhouette_score for {} clusters is {}".format(n,score));
Note, that the Silhouette Coefficient is calculated using the mean intra-cluster distance and the mean nearest-cluster distance for each sample. Therefore, it makes sense to use the same distance metric here as the one used in the clustering algorithm. This is Euclidean for KMeans and Mahalanobis for general GMM.
Of the several cluster numbers tried, 2 clusters had the best silhouette score. Similarly, when we use KMeans, the best score is also obtained with the same number of clusters. Let's visualise this clustering.
# Apply your GMM algorithm to reduced data
clusterer = GaussianMixture(n_components=2).fit(reduced_data)
# Predict the cluster for each data point
preds = clusterer.predict(reduced_data)
# Find the cluster centers
centers = clusterer.means_
# Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(pca_samples)
# Display the results of the clustering from implementation
vs.cluster_results(reduced_data, preds, centers, pca_samples)
Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster’s center point corresponds to the average customer of that segment.
Note that we had earlier reduced the original data in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations. We attempt this below.
# Inverse transform the centers
log_centers = pca.inverse_transform(centers)
# Exponentiate the centers
true_centers = np.exp(log_centers)
# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
display(true_centers)
If we consider the summary statistics for the product categories and the total purchase cost of each product category for the representative data points above, we can perhaps try to deduce what set of establishments could each of the customer segments represent.
Segment 0: Taking an educated guess,
Segment 0: best represents supermarkets. They spend a higher than median amount on Milk, Grocery, Detergents_Paper and Deli, which are both essential to be stocked in such places.
Segment 1: best represents restaurants. Their spend on Fresh, and Frozen is higher than the median, and lower, but still close to median on Deli. Their spend on Milk, Grocery and Detergents_Paper is lower than median, which adds to our assessment.
Let's find which cluster each sample point is predicted to be. The sample points where the sample customers we drew from the original data (indices 26, 176, 392)
# Display the predictions
for i, pred in enumerate(sample_preds):
print("Sample point", i, "predicted to be in Cluster", pred)
# Purchases from samples
print(samples)
It is evident that sample point 0 (customer) belongs to cluster 0 (segment 0) where spending on "Milk", "Grocery" And "Detergents_Paper" is high.
When we were drawing up guesses as to what establishments might the sample points represent, we had decided on restaurant, supermarket and café for customers (sample points) 0, 1 and 2 respectively.
Our judgement could be correct for sample points 0 and 2. While incorrect, or rather inconsistent, with our predictions for sample point 1. Looking at the visualization for our cluster in the previous section, it could be that sample 1 is the point close to the boundary of both clusters
Note that, we had decided to drop Channel and Region features early on tin this project. By reintroducing the 'Channel' feature to the dataset, an interesting structure emerges when considering the same PCA dimensionality reduction applied earlier to the original dataset. Each data point is labeled either 'HoReCa' (Hotel/Restaurant/Cafe) or 'Retail' the reduced space. In addition, we find the sample points are circled in the plot, which will identify their labeling.
# Display the clustering results based on 'Channel' data
dup_outliers = [128, 65, 66, 75, 154]
vs.channel_results(reduced_data, dup_outliers, pca_samples)
this underlying distribution of Hotel/Restaurant/Cafe customers to Retailer customers is shown above. The number of clusters we had chosen is consistent with the underlying distribution with 2 major clusters hence the clustering algorithm did well. There are customer segments that would be purely classified as "Retailers" or "Hotels/"Restaurants/Cafes" on the extreme left and right accordingly. This underlying classification is consistent with our observation where we noted cluster 0 customers are typically restaurants and cafes and cluster 1 customers are typically markets.
We can now discuss ways in which we could make use of the clustered data. That is; we can consider how the different groups of customers, the customer segments, may be affected differently by a specific delivery scheme.
Given that each customer has a customer segment it best identifies with (depending on the clustering algorithm applied), we can consider customer segment as an engineered feature for the data. Feature engineering as such, has many benefits.
And more simply, assume the wholesale distributor recently acquired new customers and each provided estimates for anticipated annual spending of each product category. Knowing these estimates, the wholesale distributor can seek to classify each new customer to a customer segment to determine the most appropriate delivery service. This can be done in supervised learning fashion. Specifically, we can label the new customers, the distributor will first need to build and train a supervised learner on the data that we labeled through clustering. The data to fit will be the estimated spends, and the target variable will be the customer segment i.e. 0 or 1 (i.e. grocery store or restaurant). They can then use the classifier to predict segments for new incoming data