Understanding more about why people are experiencing health problems is crucial to improving the way that healthcare professionals treat or prevent similar issues. Data science plays an important role in monitoring patient's health, suggesting preventative steps and detecting diseases at an early stage.

Predictive analytics can root out the causes of an illness because of its power to model vast quantities of information on patients and their health.

We can use prediction to understand the factors driving health problems. Models are able to reverse-engineer the characteristics of disorders or diseases in order to understand the relationships between a patient's habits, diets, symptoms, treatments and ultimately the outcome of the illness.

About This Guide

This guide is intended to walk you through the process of analyzing a healthcare dataset with Graphext. We will build a prediction model that analyzes a dataset of 5110 patients. The dataset contains information about the lifestyle and existing health conditions of a patient alongside an important variable signifying whether the patient suffered a stroke or not.

The model we built will use the Stroke variable as a target - the variable we want to gain a deeper understanding of - and every other patient characteristic as a factor - the variables used to establish a relationship to Stroke.

The aim of the project is to discover whether our model is able to recognise the factors most influential in determining whether a patient has suffered a stroke or not. Our analysis will:

  • Model and cluster patients according to their characteristics.
  • Inspect the features of clusters where more patients suffered a stroke.
  • Compare the most influential features determining the outcome of the target - Stroke - variable.
  • Bust some myths about causation and confounding variables.

Download the dataset here.

Step 1. Building the Project

Uploading the Dataset

We need access to the dataset in Graphext to start building our project. First make sure you have access to it on your local computer by downloading it here. Then, start from the Datasets panel of your Personal team in Graphext. Select New Dataset and either browse for the file on your computer or drag and drop it into the Adding Dataset window.

Graphext will process your upload within a couple of minutes. Once it's ready, select the dataset to start inspecting it. You should see a dataset with 5110 rows and 12 columns; id, gender, age, hypertension, heart_disease, ever_married, work_type, residence_type, avg_glucose_level, bmi, smoking_status and stroke.

Scroll to the far-right of the dataset table and find the stroke column. This will be our model's target variable. It's important to note that each row in the dataset relates to one patient. Whether a patient suffered a stroke or not is represented by the presence of 0 or 1 in the stroke column. 1 represents that a stroke has been suffered, 0 represents that it has not.

Before moving on, change the type of the stroke column using the dropdown menu currently reading Number. Change it so that stroke is considered a Boolean variable. Despite having numerical values, it is preferable to consider this as a Boolean, since the 0 and 1 values correspond to True or False.

Setting up the Project

Let's build our model. We want to understand which factors of the dataset are more related to the likelihood of suffering a stroke. We'll use stroke as our target variable and list every other variable as a factor for the model to analyze.

Start by selecting Models as your analysis type from the project setup wizard in your right sidebar. Then, choose Train and Predict.

Graphext will now present you with 3 tabs to configure the model. For this project we will focus on the Select Target and Factors tab but if you want to learn more about the other possibilities here, check out our articles on data enrichment and aggregation.

Find the stroke variable from your Other Variables list, tick it's checkbox and send it to your Target variables list. Select every other variable from the Other Variables list - excluding the id column - and send it to your list of Factors. Your models targets and factors should be as follows:

1 Target: stroke

10 Factors: gender, age, hypertension, heart_disease, ever_married, work_type, residence_type, avg_glucose_level, bmi, smoking_status

For more information on Prediction Models in Graphext - check out this article.

Now that you've told your model how to reverse-engineer the dataset in order to model the values in the stroke variable, four other tabs will appear; Model Selection - Model Params - Validation - Network and Clusters. Turn your attention to the Network and Clusters tab. Open it up and find the option asking; Do you want to use the target variable as UMAP target?

Select No from the menu. This prevents values in the stroke column from affecting the layout of your data inside the Graph or network visualisation. Selecting Yes will result in your data being largely divided into two groups (those that suffered strokes and those that did not) - something we don't want for this project.

Once you've done this - go ahead and continue through the project wizard. Name your project something like Predicting Stroke Probability and execute it.

Step 2. Exploring the Project

Once your project is ready, you will be able to open it from inside of the Project panel belonging to your Personal team. Click on the Predicting Stroke Probability project and open it.

The first thing that you see will be your project's Graph containing all 5110 patients in the dataset represented in a network visualisation. The layout of the network means that patients with similar factor values have been grouped together and now form clusters. We'll take a look at your clusters soon.

Variable Charts

Take a look at the variables presented in your two sidebars. In particular, notice the GX_Target and Prediction variable charts presented at the top of your right sidebar. GX_Target is a variable that has been created by Graphext as you executed the project and contains values from your target variable in Boolean form. True represents the actual values for patients that did suffer a stroke. False represents the actual values for those that did not.

Now turn your attention to the Prediction variable chart directly underneath GX_Target. These values represent the model's predictions for the stroke variable. As you can see there is a slightly different distribution of values here - more True values than the GX_Target variable. This represents the errors our model made.


Error scores are important. They tell us how well the model performed - which in turn signifies how well our project will be able to highlight the most influential factors when considering the likelihood of suffering a stroke.

At the top of the left sidebar, you can see Error and Confusion_Matrix. These variables tell us how well the model performed and where it made its mistakes.

Error - signifies how many correct and incorrect predictions your model made

Confusion Matrix - signifies the distribution of correct and incorrect predictions

Hover over the value representing Wrong predictions inside the Error variable chart. Although there are over 200 errors made, the error score is pretty low considering that we are working with a dataset of 5110 patients. We can use the confusion matrix to confirm that all of the model's errors were made by incorrectly predicting that a patient did suffer a stroke - or overprediction.

Step 3. Investigating Clusters

Enough housekeeping, let's dive into our analysis. A good place to start is by inspecting the features of clusters in the data. When Graphext built your project it created 40 clusters, grouping patients together based on the similarity of their characteristics.

We want to find out which of our 40 clusters contains a large number of patients that suffered from a stroke. This could help us identify the features that stroke sufferers have in common.

Note: The number of clusters in your project might vary slightly from the numbers shown in this guide.

Identifying Key Clusters

Start by coloring your Graph by the GX_Target variable. Click the color mapping icon from within the GX_Target variable chart and notice how nodes on the Graph are now either orange - representing those that suffered strokes - or blue - representing those that did not.

Already, it is possible to see a pattern in the network. Most orange nodes are grouped towards the upper periphery of the Graph. To inspect this further, click on the True bar inside of the GX_Target variable chart so that your Graph displays only those patients that suffered a stroke. You should see that only 249 nodes have been selected.

Switch up the Absolute vs Relative dropdown menu from the top of your right sidebar so that it represents Relative values. Now, values in your variable sidebar charts are represented in proportion to the selection you've made - making it easier to spot the characteristics of patients that have suffered a stroke.

Relative representation displays values in proportion to selections that you make.

Find the UMAP_Cluster variable chart at the bottom of your left sidebar. This contains the clusters generated by Graphext when you built your project. You can see from the distribution of values in these charts - the blue bars - that some clusters don't feature any patients that suffered a stroke - some feature a lot.

Let's find out which clusters have the most stroke victims. Click the sorting arrow icon from the top of the UMAP_Cluster variable chart. Sort the values according to their frequency in the selection you've made. You should see that Cluster 8 and Cluster 13 appear at the top of your list. These clusters are the ones we want to focus our next stage of analysis on.

The Features of Key Clusters

Clear all of your filters and select Cluster 8 from the list of clusters. With your Absolute vs Relative dropdown menu still set to Relative, the values in your sidebar charts should now represent the features of Cluster 8.

Scroll through the variable charts in your right sidebar and pay close attention to the distribution of values in each of the charts. We are looking to distinguish the characteristics of patients in this cluster.

It seems as though patients in this cluster are all over the age of 65, most of them are married, they generally have a low Avg_Glucose_Level and are slightly more prone to heart disease and hypertension when compared with the distribution of values for all patients in the dataset.

Let's save these findings as an insight. To do this, first save the representation of the Graph as an insight using the insight icon at the top of your Graph panel. Name the insight something like Cluster 8 - The Highest Density of Stroke Sufferers and save it.

Head over to your project's Insights panel and find the insight card that you just saved. We will now add the relevant charts to this card. Click the icon to edit the insight and remove the UMAP_Cluster chart by clicking on the trash icon.

Now, use the plus icon at the bottom of your insights card to add in charts for GX_Target, Age, Avg_Glucose_Level, Ever_Married and Hypertension. That's great. Now customize the layout of your insight until you are happy with it.

Step 4. Comparing Target Values

Head over to the Compare panel. Let's generate some charts that expose the differences between patients that suffered a stroke and those that did not.

Comparing Stroke Patients

Use the search bar to set the variable you want to compare as the GX_Target variable. This will bring up a series of charts showing only the distribution of False values - those patients that did not suffer a stroke - across variables in the dataset. Each chart represents one variable and variable charts are ordered in terms of their relevance to showing the difference between values you are comparing.

But there is only one value in the charts right now. That's not much use. Use the plus icon to add in True values in the dataset. You should now see two colors in each of your charts.

Notice that the dropdown sitting on the top left of your compare charts currently reads Show Important variables that explain the Difference. This means that Graphext has restricted the charts on show to Target and Factor variables. We aren't much interested in the Target variable so change the important dropdown so that it represents Factors.

Great. You should see that the most relevant variables explaining the difference between those patients suffering - or not suffering - a stroke are; Age, Ever_married, Avg_Glucose_Level, Hypertension and Heart_Disease. This makes sense! The relevance of these variables pretty much confirms the suspicions we had from inspecting Cluster 8.

Save the first five charts as an insight using the insight icon sitting on the top right of your compare charts. Call it something like The Most Important Variables Distinguishing Patients that Did or Did Not Suffer a Stroke.

The Confounding Variable

Most of the variables that seem to be common among patients that suffer a stroke seem to make logical sense. For instance, we would expect to see that older people are more likely to suffer strokes compared with their younger counterparts.

But Ever_Married? It doesn't seem logical that people that are married are much more likely to be stroke sufferers than single people. Tempting as it is to conclude that from the compare charts shown here, there is probably something else going on.

Let's open up the two values from Ever_Married in our Compare panel in an attempt to determine if there is something obvious that is obscured from us here.

Change up the variables shown in the Compare panel using the variable dropdown menus at the very top of your window. Make sure that you see Yes and No values represented in your charts before taking a moment to closely inspect the charts here.

The first chart - Age - is significant. Remember that Age was our most relevant variable when determining the difference between patients that did and did not suffer a stroke? Well, it seems like Age is also playing a significant role in distinguishing those patients that are and are not married. Older patients are much more likely to be married than younger patients.

Ever_Married is a confounding variable - meaning that its positive relationship to patients suffering a stroke is obscuring a greater truth about its relationship to the Age variable.

This means that the influence that Age has on being married is also causing Ever_Married to show up as a relevant variable when considering the relevance between our Factors and Target variables.

This is a clear example of Simpson's paradox, a data science phenomenon that occurs when there appear to be trends in groups of data but those trends disappear or are reversed when the groups are seen as a whole. It's often tempting to see the trends in the smaller groups of data as causing the trend in the larger group of data but Simpson's paradox suggests that this isn't the case.

Now that we've discovered the key factors most influential when considering the probability that patients will suffer a stroke, we've achieved the main purpose of this project. However, there is plenty more exploration to do. To take a look at the further analysis we conducted on the project, check it out for yourself here.

Need Something Different?

We know that data isn't always clean and simple.
Have a look through these topics if you can't see what you are looking for.