What customers say matters. Digital markets are thriving but are becoming saturated with competition. It takes more than a fancy website nowadays to convince people to buy your stuff. Trust is becoming increasingly important. As a customer, how do I know that you aren't ripping me off with a cheap product that will quickly fall apart? How can I be sure that the hotel room you are selling me hasn't been photoshopped to high heaven?

Reviews. Customers want to know how other customers feel about your products and services. Are they all you hype them up to be or does the feedback say otherwise? Customer reviews are a fantastic way of persuading people that you are a business worth buying from. But as the scale of feedback starts to stack up, the challenge becomes monitoring, organising and responding appropriately to what people are saying.

The challenge is monitoring, organising and responding appropriately to customer feedback.

Analyzing the content of text with Natural Language Processing (NLP) and Machine Learning lets us efficiently process huge quantities of text. We can use NLP to extract topics, adjectives, places, sentiment and more language features from reviews. Businesses fighting to understand feedback from hundreds or thousands of customer reviews can analyze what people are saying using techniques like text clustering or topic modelling to gain insights that no human could realistically achieve.

But it takes time and resources to do so. Programming NLP algorithms to analyze customer reviews is not for the faint-hearted. Graphext's text analysis flows allow people to command and execute powerful data science algorithms without having to write any code. Depending on the size of your dataset, it takes around 10 minutes to go from raw data to discover meaningful insights about your customer feedback.

About This Guide

This guide is intended to walk you through the process of analyzing customer reviews with Graphext. We will analyze a dataset of 42,656 reviews about 3 Disneyland branches. Looking to model the topics we will choose Text → Topics as our analysis type and focus on analyzing the content of the reviews. We'll extract information from the reviews including significant terms, adjectives and nouns.

Since the analysis will model the topic of customer reviews about Disneyland, our project will cluster reviews according to their topics. We'll investigate the topics that are most common in reviews across the 3 theme park branches, then we'll use the language features extracted by Graphext to understand:

  • How people reviewed aspects of their experience at Disneyland.
  • Why people felt negatively about a topic.
  • How opinions differed between branches.

Download the dataset here.

Step 1. Building the Project

Uploading the dataset

First, we need to upload the dataset containing 42,656 reviews about 3 Disneyland branches; California, Hong Kong and Paris. Make sure you have access to the dataset on your local computer. Then, start from the Datasets panel of your Personal team in Graphext. Select New Dataset and either browse for the file on your computer or drag and drop it into the Adding Dataset window.

Graphext should process your upload within a couple of minutes. Once it's ready, select it from the Datasets panel to start inspecting it. You should see a dataset with 42,656 rows and 6 columns; Review_ID, Rating, Year_Month, Reviewer_Location, Review_text, Branch.

Setting Up the Project

To analyze reviews, we are most interested in the Review_text column containing the content of the customer reviews.

Start building the project by selecting Text as your analysis type from the project setup wizard in your right sidebar. Then, choose Topics. The wizard will now have set up your project using Graphext's default configuration for this flow. Let's review it before executing the project so that we know exactly what to expect when we start exploring.

Select Edit from inside the Configuration tab. Now, from inside the project setup wizard you should see 3 tabs; Data Enrichment, Data Extraction and Text.

Open up the Data Enrichment tab. Graphext has calculated that there is a column representing date values and has automatically selected to enrich the dataset with more specific columns for Hour, Month and Weekday name. This seems useful. We don't need any other data enrichment so close this tab.

Open up the Data Extraction tab. This is where Graphext decides on which information to extract from text in your dataset. Notice that we are already extracting a range of additional variables from the Review_Text column such as significant terms, sentiment and keywords. Additionally, the last question in this tab refers to language features that Graphext will extract whilst building your project. Click inside of the keywords dropdown and add Verbs to the list of keyword types.

Open up the Text tab. Here, notice that Graphext has already configured a step regarding the length of text in your data. Leave the configuration inside the Text tab as it is.


Take a look at the top right of your project setup wizard and notice that we have a statistic stating that our project will use 30k rows of our dataset. This is a large sample but it would be better to work with the full dataset.

Click on the text that reads 30k rows and change the option presented here so that it reads 100%.

Now click Next, name your project something like Analyzing Disneyland Reviews and execute the project. It should take Graphext around 10 minutes to build the project.

Step 2. Exploring the Project

Once your project is ready, you will be able to open it from inside of the Project panel belonging to your Personal team. Click on the Analyzing Disneyland Reviews project card to open it up.


The first thing you will see is your project's Graph containing all 42,656 reviews in your dataset, each represented by a node. These nodes have been clustered together according to the content of the review. When executing your project, Graphext modelled the topic of each review and linked similar reviews together to form clusters, which are currently representing the color of nodes in your Graph.

Let's have a look at these clusters. Find the Cluster variable from inside your left sidebar and notice that the color of bars inside of the variable chart corresponds to the colors of clusters inside your Graph. These clusters represent the topics Graphext was able to discern inside the reviews. For instance, the blue cluster seems to represent reviews about hospitality in some sense.


Now turn your attention to your right sidebar where most of your variables are stored. Halfway down the sidebar you have an icon to expand the sidebar. Click on this icon to spread your variable charts over the whole screen.

Browse through the variables and take notice of the additional variables that Graphext has added during the course of building your project. On top of the existing variables in the data, we now have access to all of the extracted information on the language features of the reviews.

Step 3. Inspecting Clusters

It's time to dive into our investigation. Let's attempt to understand a little more about reviews inside of the blue cluster. Select the blue cluster bar from inside of your Cluster variable chart. Your Graph will now only present values inside of this cluster. Now change your Absolute / Relative dropdown so that it reads Relative - we want the values in our other charts to represent values relative to our selection.

Relative representation changes the values inside your other variable charts so that they display their value dynamically and in proportion to any selection that you make.

Hospitality Reviews: Significant Terms

With the blue cluster selected we can start to explore the features of this cluster using our other variable sidebar charts. Remaining inside of the left sidebar, take a look at the second variable card; Significant Terms in Review_Text. The words inside of this list have updated so that they represent the most significant terms extracted in relation to this cluster.

Since the order of words in this list is sorted by TF-IDF, you should see that terms like meal, breakfast, drink, hotel and food are the most commonly featured in reviews belonging to this cluster. It would already be fair to assume that these reviews are generally about hospitality.

Renaming a Cluster

Since so many words related to hospitality are included as significant terms inside this cluster, it makes sense to rename this cluster so that it is easier to determine the topic at a glance. Click on the edit icon from inside of the Cluster variable chart. Then, click on the blue cluster bar to select it and select the Rename icon. Inside of the text box that appears, rename your cluster Hospitality and click OK followed by the save icon from the top right of the Cluster variable chart. Fantastic. That is much clearer.

Hospitality Reviews: Investigating Features

Reselect the newly named Hospitality cluster and turn your attention to the right sidebar. Here, the other 27 variables inside the dataset are showing their values in relation to reviews in the Hospitality cluster.

Browse through the variable charts. You can see from the Rating variable chart that there are relatively more 4 star reviews inside of this cluster and relatively less 5 star reviews.

Use the search bar at the top of your left variable sidebar to find the Branch variable chart. There is something interesting going on here. It seems like there was much more reviews about hospitality in relation to Disneyland Paris than there were in relation to Disneyland California. This could be for many reasons but is certainly something that would interest the park owners. Let's save it as an insight.

Hospitality Reviews: Distribution amongst Branches

Click on the three dots on the top right of the Branch variable chart to bring up the More Options menu list. Now select Save as Insight from the menu list. Name your insight something like There are far more reviews about hospitality in Paris compared with California and click Save. Now head over to the project's Insights panel to check it over. Excellent. Our insights shows exactly the finding that we've made.

Hospitality Reviews: Language Features

Navigate back to the project's Graph and find the Branch variable card once more. Make sure your Graph is still only representing reviews inside of the Hospitality cluster. Add another filter by clicking on the bar representing Disneyland_Paris from inside of the Branch variable chart so that our Graph only represents Hospitality reviews about Disneyland_Paris. With both of these filters active, we can start investigating the language features of this specific segment of reviews.

Use the search bar again to find the Adjectives variable chart from inside of your right sidebar. This variable reveals the adjectives that were used in relation to Hospitality reviews about Disneyland_Paris. Change the sorting of this variable list so that it represents variables most frequently appearing inside the selection. You can do this by clicking on the sorting icon from the top of the variable chart.

It seems like there are some interesting findings here too! Alongside words we might expect to find such as good and great, we also can see that expensive, long, small and fast were adjectives very frequently used in relation to reviews about Hospitality at Disneyland Paris. Save this as an insight and name it something like The Adjectives used to Describe Hospitality at Disneyland Paris.

Step 4. Understanding Bad Reviews

Make sure you've cleared all of your filters so that your Graph represents 100% of your dataset. Let's move on to inspect the features of bad reviews. We can do this using by selecting reviews rated as either 1 star or 2 stars using the Rating variable.

Bad Reviews: Topics

The first thing we need to do is isolate the bad reviews. Find the Rating variable chart from inside of your right sidebar and drag your cursor over the bars to select the reviews that were rated as either 1 or 2 stars. Your Graph should now represent 9% or 3626 reviews. Check that your Absolute / Relative dropdown is still set to relative. With this filter active we can start to investigate the features of bad reviews.

Take a look at the Cluster variable chart and notice that the values here have shifted significantly after applying the bad review filter. You can see that most bad reviews were about either the people, member, staff cast cluster **or the fast pass, fast, pass, queue cluster.

Click on the people, member, staff cast cluster and take a look at the significant terms inside of this cluster. It seems that most of the reviews here are in some about Customer Service. Rename this cluster Customer Service using the same process as you used to rename the Hospitality cluster.

Now click on the fast pass, fast, pass cluster and once again check out the significant terms here. It seems as though these reviews refer to Waiting Times. Rename the cluster Waiting Times.

Finally, with both of these clusters given appropriate names, you can save this representation of your clusters as an insight. Use the more options menu list to do so and name your insight something like Customer Service and Waiting Times Often Motivate Bad Reviews.

Bad Reviews: Language Features

Now, let's use Graphext's extraction of the language features of these reviews to understand more about why people were motivated to give poor reviews. Find the variable Nouns from your right sidebar.

Before paying too much attention, change the sorting of the variable chart so that it represents frequency inside of our selection. There are some predictable words appearing here - for instance not everyone will enjoy the park, ride or the day. Nonetheless, it is very interesting to see that hour and food are nouns much more common in bad reviews - we can discern this because the blue bar for these values showing relative representation far exceeds the grey bars showing the value in proportion to the whole dataset. Save this chart as an insight and call it something like The Nouns Associated with Bad Reviews.

Step 5. Comparing Branches

Head over to the Compare panel. Let's have a look at the features distinguishing reviews about the 3 Disneyland branches that we have in the dataset. Choose Branch as the variable you want to compare using the search box and then add in values for all 3 Disneyland branches using the plus icon on the far right of the Compare panel.

Comparing Clusters among Branches

Your Compare panel now presents a series of charts showing the variables best explaining the difference in reviews between each of the Disneyland branches. Take a look at the Cluster chart. The first row in this chart is hardly surprising - it shows that reviews in the hong kong cluster almost exclusively refer to Disneyland Hong Kong.

Moving on, you can see that reviews belonging to the Hospitality, Waiting Times and Customer Service clusters were most common in relation to Disneyland Paris. Interestingly it seems as though visitors to Disneyland's Hong Kong branch were not so concerned about reviewing Customer Service or Waiting Times. This is a good finding so let's save it as an insight. Click on the three dots from the top right of the Cluster compare chart and choose Save as Insight. Name the insight something like The Topics Most Frequently Occurring in Reviews about 3 Disneyland Branches.

Comparing Ratings among Branches

Move on to inspect the Rating compare chart. It is very clear to see from this chart that there is a reasonable distinction between the ratings given to each of the branches. The three lines represent three quite different rating distributions, especially considering that we are looking at a dataset of almost 50,000 reviews.

The Rating chart reveals that Disneyland California is the branch with the most 5 star reviews. Disneyland Paris is the subject of the worst review rating distribution.

Comparing Language Features among Branches

Scroll further down the series of charts in the project's Compare panel. Click Show More to reveal more than the 9 charts that are already presented here. Find the chart representing Adjectives and inspect it. This chart shows how different adjectives have been used in reviews of each of our 3 Disneyland branches. It is quite a revealing chart and would be exceptionally useful in helping management at Disneyland understand how their customers differ across their sites.

Small is an adjective most associated with Disneyland Hong Kong, whereas fast is rarely associated with this branch. Expensive is a more common term in reviews about Disneyland Paris. Save this chart as an insight.

This kind of finding can help to inform action taken by the customer support or marketing teams at Disneyland. For instance, if there is a perception that Disneyland Paris is too expensive, this can be picked up by site management, giving them the opportunity to change this perception.

Like most datasets containing customer reviews - this is a rich dataset allowing for many avenues of analysis. Even during the short time that we have spent analyzing reviews about 3 Disneyland branches, we have been able to determine the difference in review topics among the different sites as well as going some way to understanding why reviews of Disneyland Paris are comparatively worse than reviews about Disneyland Hong Kong and California.

Need Something Different?

We know that data isn't always clean and simple.
Have a look through these topics if you can't see what you are looking for.