Oct 26, 2021
Our Investigations

How to Perform Simple & Effective Customer Segmentation | A Walkthrough with Data from a Delicatessen

Andy Clarke

Customer segmentation involves splitting a customer base into distinct groups. These customer segments are defined by specific and shared characteristics, behaviours or preferences that help businesses to spot patterns and associate customers with one another.

Market segmentation is an extremely powerful data science technique that is designed to expose hidden patterns in customer data. Also known as customer segmentation - the technique is only as useful as the data that drives it.

This article walks through the steps involved in a simple customer segmentation analysis. Using sales data from a delicatessen, we'll segment customers according to their buying preferences and behaviour. To achieve this, we'll use a powerful machine learning technique known as clustering.

Then, we'll use our analysis to spot patterns between a customer's buying preferences and other characteristics including demographic information, income and education level.

What Kind of Data Do I Need To Perform Customer Segmentation?

The question of data is one of the most important and common questions surrounding customer segmentation analysis. A significant number of popular demonstrations of the technique involve grouping customers based on their buying preferences - something which would require sales data. In truth, customer segmentation can be completed with any kind of data that contains information about a customer base.

Some forms of psychographic segmentation are effectively carried out using social media data - like Twitter posts. Geographical segmentations are best carried out with latitude and longitude variables as data can be visualised using maps and more easily enriched with external APIs. Behavioural segmentation will typically require variables that record actions taken by customers or users.

3 Tips on Working With Customer Segmentation Datasets

  • The data should not be aggregated. Instead, each row should contain information about one customer.
  • The variables you have should define the insights you set out to obtain.
  • For clustering analysis - remember the variables you choose to use as factors will define how your customers are grouped into clusters.

Customer Segmentation | An Example Dataset

Useful data about a customer base is hard to come by. Why? Because businesses aren't legally able to hand out data on their customers. Moreover, why would they want to? Anonymized datasets solve the first of these obstacles but even still - these are few and far between.

The data we are working with today is one of the best examples of a customer segmentation dataset that we've seen. Shown in the table below, this data includes information about the customers of a delicatessen and details their income, marital status, education level alongside the type of products they purchase and how they have responded to marketing campaigns.

Although businesses will often be working with internal datasets, it can be useful to practise and prepare with model datasets. Here are the best customer segmentation datasets known to our team ...

  • This dataset | Kaggle link here.
  • 1000 Supermarket Transactions with Customer Data | Kaggle link here.
  • Telco Customer Churn Dataset | Kaggle link here.

Preparing the Analysis

Clustering works by grouping your data points based on the similarity of their features. Therefore, clustering our customer data is going to group similar customers based on specific features. Here's the key - the features that are chosen will define how our customers are grouped. That's why effective customer segmentation starts from a solid understanding of the data.

Here are the features of the data we'll use to cluster our customers ...

Number of days since customer's last purchase

Amount Spent on Wine
Amount spent on wine in last 2 years

Amount Spent on Fruit
Amount spent on fruit in last 2 years

Amount Spent on Meat
Amount spent on meat in last 2 years

Amount Spent on Fish
Amount spent on wine in last 2 years

Amount Spent on Sweets
Amount spent on sweets in last 2 years

Discounted Purchases
Number of purchases made with a discount

Web Purchases
Number of purchases made through the company’s web site

Catalogue Purchases
Number of purchases made using a catalogue

Store Purchases
Number of purchases made directly in stores

Web Visits
Number of visits to company’s web site in the last month

The features here represent a strong picture of a customer's buying preferences. We have information on what type of products they are purchasing as well as how they are making their purchases and whether they are responding to marketing campaigns.

Building a Clustering Analysis with Graphext

We'll build our clustering model with Graphext, choosing UMAP as our clustering algorithm and setting the data features we want to calculate our clusters with using the project setup wizard.

The process takes around 5 minutes. Here's how it works ...

  1. Upload the data to Graphext.
  2. Click on the dataset once it has uploaded.
  3. Choose Models as your analysis type and then choose Cluster.
  4. Inside the Select Target And Factors tab - select your data features from the list of Other Variables and move them to the list of Factors.
  5. Click Next and name your project.
  6. Click Execute.
  7. That's it. Once the project is ready ... Open it up.

Performing the Analysis

There is no gold standard method for inspecting a customer segmentation project. The avenues of investigation are typically determined by the insights that business teams want to find. Setting aims help to direct progress.

Our Aims

  1. To find a small customer segment that are the most likely to respond positively to marketing campaigns about wine.
  2. To determine which features of the data (income, education ...) best define this customer segment.

In other words, our analysis should aim to find the customers most likely to purchase wine after being targeted with a discount offer and to find out which characteristics these customers share with each other.

The Usefulness of Clusters

Clustering algorithms are especially useful for performing segmentation analysis because they add a categorical variable to your data - labelling each row with the cluster that it belongs to. With our clustering project built and our data ready to explore in a network visualisation, the UMAP cluster variable is a great place to start.

Currently, we have 36 clusters in a dataset of 2240 customers. Our biggest cluster has 120 customers and our smallest has just 28.

Network visualisations are perfect for exploring clustering and segmentation projects because of the way that they can express proximity through visualising nodes on a Graph.

For a simple customer segmentation analysis, we have quite a lot of clusters here. Configuring the resolution (strength) of our clusters rewrites the cluster boundaries so that they are either broken up further or joined together.

Slightly joining our clusters so that we have just 21 is more appropriate for the size of the customer base we are analyzing.

Finding Our Key Customer Segment

To move the analysis forward, we need to expose customer segments that spend more money on wine than others. On top of this, we need to make sure that the customers we are picking out are also responsive to discount offers.

Using a filter query we can use these conditions to pick out our key cluster of customers; those with a high amount spent on wine combined with a good discount response rate.

Here's how we'll do it ...

  1. Find the Amount Spent on Wine variable.
  2. Filter the data so that only the upper quartile range of Amount Spent on Wine is selected.

Sanity Check - You should have 25% of the data active.

  1. Find the Discounted Purchases variable.
  2. Keeping your current filter active, filter the data again so that your active data includes only those in the upper quartile range of Discounted Purchases.

Sanity Check - You should have 7% of the data active.

That's it.

We now have access to the customers that best match the aims that we set out with. These are the people that spend more on wine and respond positively to discount deals. The deli could already start to target them with specific campaigns ...

Now, we can start to find out more about who these people are.

Useful To Do ...

Reorder the UMAP Cluster variable so that data is sorted by TF-IDF. This will help to reveal the clusters with a greater presence of your key customer characteristics. You should see that your key customers are most represented in Clusters 15, 18 and 1.

Associations between customers in clusters are meaningful. As this customer base grows, these 3 clusters will become more populated with customers that share characteristics with this key segment.

The Characteristics of Customers in Our Key Cluster

With the delicatessen dataset filtered to present customers inside our key cluster, we can start to compare the distribution of values for this customer segment vs the entire dataset.

This relative comparison is going to start to expose the demographic features of our key customers; those who spend more on wine and are typically engaged with marketing campaigns.

Learning about demographic features including income and education level would be hugely beneficial to the business team at the delicatessen. Spotting meaningful relationships between a customer's characteristics and their buying preferences has incredible potential to enrich decision making for the deli's business team.

Comparing the values in a key segment with the value distribution for the entire dataset is an incredibly useful way to understand who a businesses most valuable customers are.


Customers in our key segment tend to be born between 1950 and 1980. With the recognition that this makes them between 41 and 71, the business team at the deli could immediately start to draw conclusions about the interests and attitudes of these people and use that understanding creatively as part of marketing campaigns or design choices.


People with PhDs are overrepresented amongst this key customer segment. What's more, we can identify that this trend evolves to include a marginal overrepresentation of people educated to postgraduate level.

Helping to confirm our suspicions that customers in this segment are more likely to have higher levels of education, this insight could be used to direct an email marketing campaign.


The income of customers in our key segment follows a normal distribution between 40,000 and 80,000. Specifically targeting discount deals on wine at customers with an income of around 60,000 would have a high probability of success.

Marital Status

The data suggests that divorced people are overrepresented amongst customers in our key segment.

Useful Resources on Customer Segmentation

A Beginners Guide to Market Segmentation: Types, Techniques & Examples to Better Understand Your Customer Base (with Data)

Customer Analytics at Graphext

Clustering Supermarket Transactions | Use Case Guide

With the level of detail we've reached through the demographic insights about our key customer segment, the action taken by the delicatessen to communicate wine discount deals would be much more informed. Insights derived from this kind of customer segmentation analysis - no matter how simple they are - will help to direct and inform marketing action and enrich business intelligence.


The Data

Sales & Customer Data from a Delicatessen

Key Variables

Type of Analysis

Relevant Industries

Explore Yourself

🍷 Delicatessen | Customer Segmentation

A digest of our blog data analysis, product updates and company news
Thank you! Your submission has been received!

Sorry. Something failed

Other stories