Models

Understanding what makes your employees tick is a fundamental part of making good business decisions. People can be vastly different from one another in many ways but also very similar in other ways. Within companies, there are sub-communities of employees with specific preferences, behaviours and attitudes. Recognising these communities, who belongs in them and how to treat them is difficult. This is where your data comes in.

Today, data and smart technology are playing momentous roles in most departments of most companies in most sectors. HR should be no different. This guide looks at how we can use prediction models built into Graphext in order to understand what is driving the behaviour of people in your company. The process that we will follow is not limited to understanding employee behaviour. In fact, the same flow can be applied to analyzing the preferences, motivations or performance level of employees.

"Understanding your employee's perspective can go a long way towards increasing productivity and happiness."

- Kathryn Minshew

As well as predicting missing values in datasets, models can also be used to understand relationships between variables in your data. Your predictions will tell you more about which feature of your dataset is most important when considering the value of your target variable. You can learn more about prediction models in Graphext here.

This guide is focused on the process of building a model to understand the reasons why employees left their jobs. We will be working with a public dataset containing information on 14,999 employees and information about their Salary, their Monthly Hours, their Satisfaction Level and the Number of Projects they worked on among other variables. The dataset was originally published on Kaggle but you can download it here.

Building the Project

Uploading the Dataset

Before we start building a model to analyze the reasons why employees might have left their jobs, we need to upload the dataset to Graphext. First, make sure you have the dataset downloaded to your computer. Then, start from the Datasets panel of your Personal team in Graphext. Select New Dataset and either browse for the file on your computer or drag and drop it into the Adding Dataset window.

It should less than 2 minutes for Graphext to process your data. Once it's ready, select it from the Datasets panel to start inspecting it. Scan through the variables in the dataset. You should see the following columns; Work Accident, Average Monthly Hours, Last Evaluation, Left, Number of Projects, Promoted in the last 5 years?, Salary, Department, Satisfaction Level, Years in the Company.

Target and Factors

In our project, we want to build a model that predicts whether an employee has left the company or not based on the values they have for every other variable in the dataset. Left represents whether the employee left or not. This will be our target variable. Every other variable in the dataset will be used as a factor in an attempt to understand which is most significant when considering why an employee did or didn't leave.

Factors are the features of your data you want to use to calculate relationships to your target variable.

Targets are the variables that your model will make predictions about. Their relationships to factors define how a model calculates a prediction.

Setting Up the Project

Let's build the model. From the project setup wizard in your right sidebar, choose Models as your analysis type. Next, choose Train and Predict. This will bring up a series of dropdown containers where you can specify exactly how your model will work.

Find the Clusters and Network Creation dropdown. Let's add your factors and target variables to the correct list. Choose Left from the list of Other Variables and make it your target variable. Then, select every other variable using the checkbox next to Other Variables and send them to your list of factors. That's it. Your model will now know how to interpret your dataset in order to make predictions on whether an employee left the company or not.

Next, open up the Network and Clusters tab. Find the dropdown next to the question; Do you want to use the target variable as UMAP target? Make sure that the dropdown reads Yes. This helps to ensure that the clusters that appear on our Graph will be defined by the model's prediction on whether the employee left the company as well as their characteristics. This should leave us with a nice distinction between those employees that left and those that did not allowing us to start recognising some sub-communities quickly.

Then, inside the Network Visualization tab, choose Left as the variable you want to color your nodes by. When we open the Graph we want to get an immediate indication of which clusters of employees left the company and which did not.

That's it. Click Continue and name your project something like Predicting Employee Behaviour. Executing your project will tell Graphext to build the model according to your instructions. It should take around 5 minutes to build your project.


Exploring Your Communities

Time to inspect how your model did! From the Projects panel of your personal team, find the Predicting Employee Behaviour project and open it. The project will open on your Graph with orange nodes representing employees that left their jobs and blue nodes representing employees that haven't left their jobs. Take a moment to digest the visualization and try to identify the four communities that have appeared.

Identifying Communities

There is a clear distinction between the employees that left and those that didn't. We can immediately see that the employees that didn't leave are clustered together in a big 'brain' shape. Employees that left are divided into three distinct sub-communities. It's highly likely that these communities feature employees that left the company for different reasons. Our task now is to identify what these reasons were.

Errors & Confusion

First, find the Error variable from the top of your left sidebar. Select the Wrong value from the Error variable chart. This value tells you how many mistakes your model made when predicting whether an employee left the company or not. Check how many values are wrong. The number is pretty low considering we were working with a dataset of 14,999 employees. This means our model is reliable!

Find the Confusion Matrix variable further down the left sidebar. This variable indicates where your model's mistakes took place. Hover over the Wrong No and Wrong Yes values. It looks like the model made most of its mistakes by not correctly identifying that an employee left the company when they did. Click the Wrong No value from the Confusion Matrix variable chart. Highlighting these nodes on the Graph shows that most of the employees that fall into this category are clustered together with the community of employees that didn't leave. It's likely that there isn't much distinguishing them from the characteristics of employees that didn't leave and this is why the model struggled here.


Segmenting Communities

It makes sense for us to create a segmentation dividing the 4 communities that can see visually on the Graph. Segmentations are new groups of values that you create based on some kind of filtering or selection process. After we've created a segmentation we can start working with these groups of employees as if they were a variable in the dataset, making it much easier to start identifying the features of each community.

Creating a Segmentation

To create a segmentation, select the New Segmentation icon from the top of your right sidebar. Then name your segmentation Communities and click OK. To add a value to a segmentation, you have to filter or select some data points. Start with the community of employees that left the company positioned towards the top of your Graph. Use the direct selection tool to drag your cursor over the data points in this community. It doesn't have to be totally accurate but try to include as many as possible.

Now that you've selected these nodes you can add them as a value to your segmentation. Select the plus icon from inside the Communities segmentation and name this community Left Community 1. That's it! The values inside of Left Community 1 represent the employees within your selection.

Repeat the process for the other two communities of employees that left the company. You should end up with 3 values inside Communities; Left Community 1, Left Community 2 and Left Community 3.

Once you've added the 3 communities of employees that left the company to your new segmentation, add the community of employees that didn't leave. It might be tricky to directly select them from the Graph so let's filter these employees using the Left variable from the top of your right sidebar. First, make sure all of your filters are cleared so that your Graph represents 100% of the dataset. Then, select the No value inside the Left variable chart and add this selection to your Communities segmentation, calling it Stayed Community.

Finally, save your Communities segmentation by clicking the save icon from the top right of the variable card.

Region Labels

Now that we've got access to a variable that neatly defines our 4 communities of employees, let's use this to label our Graph. Open the project settings using the gear icon at the top of your Graph. Find the Region Labels dropdown menu and choose Communities from the menu list. Now click Save and return to your Graph. That's much clearer. Now you can recognise the important communities in your data at a glance. Finally, use the raindrop icon inside of the Communities card to color your Graph using the value in your segmentation.


Exploratory Analysis

We can start exploring the differences between the 3 communities of employees that left the company using size and color mapping. Size mapping involves changing the size of nodes on your Graph according to their value for a quantitative variable. Color mapping involves changing the color of nodes on the Graph according to their value for a variable.

Size Mapping

Let's begin exploring the features of our communities using size mapping. Browse through the variables in your right sidebar and locate Satisfaction Level. Notice that this variable is represented by a histogram where the values range from 0 (Low Satisfaction) - 1 (High Satisfaction). Select the circular icon representing size mapping.

Now check your Graph. The size of nodes now represent each node's value for the Satisfaction Level variable. Larger nodes represent employees that has a high level of satisfaction and smaller nodes represent employees that weren't satisfied. It looks like employees in Left Community 2 were pretty unsatisfied with their jobs.

Color Mapping

Now let's use color mapping to inspect our communities a little more. Find Average Monthly Hours from your right sidebar and click the raindrop icon to apply color mapping to this variable. The color of nodes inside your Graph now represents their value inside of the numerical range of Average Monthly Hours. It's insightful to see that nodes within Left Community 2 work a lot of hours. Using a combination of size mapping and color mapping we might be able to form a hypothesis from the findings that these employees aren't satisfied with their jobs and work a lot of hours each month.

Additionally, it's interesting to note that employees in Left Community 3 worked relatively little compared with the other communities of employees that left their jobs. This could well be a reason as to why they have been clustered together.

Find Number of Projects from your right sidebar and apply color mapping to this variable. This reveals a similar trend to the Average Monthly Hours worked by our communities. It seems as though employees in Left Community 2 worked on a lot of projects whilst employees in Left Community 3 worked on a small number of projects.


Inspecting Community Features

Using color and size mapping we've managed to get a pretty good idea of some of the characteristics of Left Community 2 and Left Community 3 but we haven't found out much about Left Community 1 yet. We can select this community directly to start inspecting how the values of employees within it are distributed among our other variables.

Filtering

Find your Communities segmentation from the top of the left sidebar and select the bar representing Left Community 1. Notice that not only has your Graph blurred out all nodes apart from those representing employees inside this community, but also that your variable sidebar charts have changed. This is a really useful method of understanding the features of groups in your data.

Absolute vs Relative Representation

From the top of your right sidebar, make sure that the Absolute vs Relative dropdown menu is set to Relative. Relative representation displays data inside your variable charts in proportion to the number of values in the selection. Using relative representation we get an impression of how employees in Left Community 1 match up to the employees in the rest of our dataset.

Variable Chart Analysis

Browse through the charts in your left sidebar and take note of how the values inside the charts have changes to represent the features of employees in Left Community 1. From the Average Monthly Hours chart, you can see that employees in this community worked a relatively high number of hours each month. From the Salary chart you can see that these employees were paid a relatively low salary. This might be an interesting discovery!

Saving Insights from Variable Charts

It's a good idea to save your findings as insights throughout the course of your analysis. Then, when you return to form a narrative from your findings you can use insight cards to jump back into the point of analysis where you saved a specific insight.

With the filter still active, select the three dots from the top right of the Salary variable chart. Then from the menu list click Save as insight. Name your insight something like Employees in Left Community 1 get paid relatively little. Now head over to the Insights panel of your project to see your insight in action. You can change how elements are displayed inside insights cards but for now let's leave it as it is.


Comparing Communities

The Compare panel offers useful visualizations highlighting the differences between communities in your data. Go to the Compare panel where you will be presented with an empty search field prompting you to pick a variable to compare.

Comparing Communities with Everything

Choose Communities as the variable you want to compare. This will bring you to the Compare panel where a series of charts have been generated. Now use the plus icon at the top of the panel to add another value into your charts.

First, we want to compare employees in Left Community 2 with the values for every employee in the dataset. Click inside of the Stayed Community tab and change the value to represent Left Community 2. Now, click on the variable dropdown above the second value and change Communities to Everything.

Your compare charts are now visualizing the values which best explain the difference between employees in Left Community 2 and every employee. It's particularly striking to see how the Satisfaction Level of employees in this community differs hugely from the Satisfaction Level of other employees. Click on the three dots from the top right of the Satisfaction Level chart and save this chart as an insight.

Comparing Communities with Each Other

It's also going to be useful to compare each of our communities with each other to spot the different reasons why employees have left their jobs. Delete the Everything value from your charts and add in three more values. Then, delete the Stayed Community value that just got added for the sake of clarity. This looks a bit more colorful. Your compare charts will now be representing the variables that best explain the difference between the three communities of employees that left their jobs.

There are lots of insights to gain from these charts. It's interesting to see that the division of Satisfaction Level is extremely varied between our three communities. It is becoming more likely that our model considered this an important factor.

Find the chart representing Salary. This chart could also be important despite the fact that it's relevance to the difference between our communities doesn't appear to be highly significant. What is striking about this chart is that the values for all three communities are bunched together. Moreover, around 60% of employees in each of the communities have a low salary. This is often an important aspect of whether employees decide to stay in their jobs or not and could definitely be a motivating factor as to why these communities of employees have left their jobs.

Need Something Different?

We know that data isn't always clean and simple.
Have a look through these topics if you can't see what you are looking for.