Analysis Types | Models

Inside Graphext's Models analysis type, you'll find a series of out-of-the-box data science algorithms including models to undertake clustering and prediction tasks. We built Models to let you quickly customize and deploy powerful models without having to use Python or R.

Our models can be used to predict or cluster your data as well as analyze the correlation between variables in your data or study the co-occurrence of values.

‍

What are Data Models?

Data modelling means understanding data through a mathematical diagram - of sorts. This diagram is created by processing and connecting variables in your data in order to build relationships between them. Because of this understanding granted by relationships in past data, the representation of datasets that models create can be used to make classifications or predictions about new data.

‍

‍

Using Models

The technology behind the analysis types in Models has been built by our team or integrated with open-source machine learning projects. Our idea here is to give you quick access to powerful data science algorithms without having to write code.

Inside each Models analysis type, Graphext will ask you to fill in a series of questions. These questions configure the parameters of the model and connect it to your dataset. Then, executing your project will deploy the model and you'll see the output variables in the data.

For prediction models, use your project's Models panel to evaluate the performance of your model.

‍

Types of Models in Graphext

Cluster

Clustering projects use machine learning technology to group your data. When setting up a clustering model, Graphext will ask you to specify the variables to use as factors to calculate the similarity between rows in your data. Through understanding how your rows share values, your clustering model will compute the strength of the relationship between them and use this to group them into clusters.

‍

‍

How to Build Clustering Projects?

Choosing Cluster as your analysis type will lead you to Graphext's project setup wizard. Here you can configure how your clustering model will work, connect it to your dataset and customize the appearance and layout of your project's network visualisation - an extremely useful tool for inspecting your clusters.

‍

‍

Selecting Targets & Factors

Crucial to setting up a clustering project is the process of selecting factors. Your factors will be the variables used to calculate similarity between rows in your dataset. If a row shares more factor values with another row - there will be a higher likelihood that they will be grouped together in the same cluster. Move variables from your list of Other Variables to your list of Factors to continue setting up your project.

‍

‍

Target(s)

A pinned variable. This won't affect your clusters unless you use UMAPΒ and choose to set this variable as your UMAP target inside the Network and Clusters tab.

Factors

The variables your model will use to calculate similarity.

Other Variables

All variables not used as factors or targets. These won't affect your clusters or your network layout.

‍

You can use most variable types as factors but you can't use text values. Clustering projects work best with more factors but the more you add - the less impact each will have on the definition of your clusters.

Targets aren't as important here but set one if you want to pin a specific variable in your project. Note that setting a target variable can have a big influence on the layout of your network visualisation - Graph. You can configure whether you want this to happen inside of the Network & Clusters tab of the project setup wizard.

‍

Network and Clusters

Your Network and Clusters tab lets you customize the network visualisation - Graph - that your project will create as well as configuring the parameters for the clustering algorithm that Graphext will execute.

Choosing a dimensionality algorithm will affect the way that clusters are calculated in your project as well as the layout of your network - see our article on Creating Graphs and Layouts for more information.

‍

‍

Switching between UMAP and k-NNG will change the algorithm used to cluster your data - and the parameters needed to do so!

‍

If you've selected to use UMAP and have a target variable, you'll be able to choose whether or not to use it as your UMAP target. Doing so will mean that your target variable will have a heavy influence on your network and clusters. Not doing so will neglect your target variable.

Depending on which algorithm you choose to use to cluster your data - UMAP vs k-NNG - you'll be able to set parameters to customize how your clusters are created. You'll also be offered the chance to create HDBSCAN clusters - an alternative clustering algorithm which will add another set of clusters to your project.

‍

What to Expect from Clustering Projects?

The resulting network visualisation - Graph - that is created in clustering projects is a great place to start. Your network displays nodes - data points | rows - linked together based on the similarity of their values for each of the factors that you selected. Graphext will automatically color your Graph according to the clusters that were created. You can inspect the features of each cluster by selecting each one from the Cluster variable sidebar chart.

The cluster that each node belongs to has been added as a new categorical variable in your project. You can use this to filter or segment your data further within each cluster. Exporting the dataset from the Details panel means you can have access to the new cluster related variables that were created.

‍

‍

Use Case Guide | Clustering Supermarket Transactions

This guide is intended to walk you through the process of creating a clustering model to group your data. We'll be using a dataset of 1000 supermarket transactions from stores in Myanmar. The aim of our project is to group these transactions in order to find patterns in the buying habits of the supermarket's customers.

‍

Train and Predict

Train and Predict projects use machine learning technology to make predictions about the value of data. Prediction models you build in Graphext use open-source technology like logistic regression or CatBoost and are matched to your data and prediction task as you complete the project setup wizard.

When setting up a train and predict project, Graphext will ask you to set a target variable and to specify the factors used to understand relationships in your data. Through understanding how your factors relate to your existing target values, your prediction model will be able to make predictions about your target.

‍

‍

What are Prediction Models?

Prediction models are used to make predictions about the values belonging to a target variable. They analyze existing relationships between a set of variables (factors) in your data and your target variable. This understanding takes the form of a mathematical diagram - of sorts - and can be used to predict new values in data where the target value is not known.

Because prediction models can learn from your existing data to produce new values, they belong to the field of machine learning. But rather than being a specific algorithm, prediction is a task that many different algorithms attempt to solve - each varying in their approach.

‍

How to Build Train & Predict Projects?

Choosing Train and Predict as your analysis type will lead you to Graphext's project setup wizard. Here you can configure how your prediction model will make predictions, customize its parameters and choose how to validate your model.

You'll also be able to customize the appearance and layout of your project's network visualisation - an extremely useful tool for inspecting your data.

‍

‍

Selecting Targets & Factors

Setting your target and factors is an essential part of building prediction models with Graphext. Your target variable is the variable you want to make predictions about. Your factors will be the variables used to calculate relationships to your target variable. These relationships form the basis of how your model will calculate its predictions.

Move variables from your list of Other Variables to your list of Factors to continue setting up your project.

‍

‍

Target

Targets are the variables that your model will make predictions about. Their relationships to factors define how a model calculates a prediction. You can only set 1 target variable in Train and Predict projects.

Factors

Factors are the features of your data you want to use to calculate relationships to your target variable. They will also be used to form the clusters that Graphext creates in your project.

Other Variables

All variables not used as factors or targets. These won't affect your predictions, clusters or your network layout.

‍

You can use most variable types as factors but you can't use text values. Your choice of factors will have a huge impact on the predictions your model produces. As a general rule, try to avoid choosing multiple factors with values that are closely aligned to one another. These can have an unduly heavy influence on your model's performance.

‍

Model Selection

Inside the Model Selection tab, you can choose which algorithm you want to use in your prediction project. Depending on whether you are making predictions about a quantitative variable or not - the choices available to you here will change.

Different algorithms use different methods to achieve prediction tasks. Switching between them will change the parameters you need to complete in the Model Params tab. If you want to learn more about the difference between the prediction algorithms we have - check out our documentation on training models here.

‍

‍

Model Params

The Model Params tab of your project setup involves customizing the other input values that your model will use. These will change depending on your choice of algorithm and by default, Graphext has auto-completed all parameter values.

Parameters can include setting a method to calculate missing values, specifying how many iterations your model will make or setting the depth of your model's decision trees.

If you'd like to change parameter values, be sure to understand how your changes will affect the model's performance. We'd recommend checking the documentation behind each algorithm if you don't know what each parameter refers to.

‍

‍

Validation

Inside the Validation tab, you can choose the scoring functions used to evaluate your model's performance. These will then appear inside of your project's Models panel.

It is also possible to choose how to split your dataset for the model's training and testing process. We recommend sticking with Cross-Validation unless you are working with an especially large dataset or need quick results.

Learn more about training and testing in our article on building prediction models in Graphext.

‍

‍

Network and Clusters

Your Network and Clusters tab lets you customize the network visualisation - Graph - that your project will create as well as configuring the parameters for the clustering algorithm that Graphext will execute.

Choosing a dimensionality algorithm will affect the way that clusters are calculated in your project as well as the layout of your network - see our article on Creating Graphs and Layouts for more information.

‍

‍

Switching between UMAP and k-NNG will change the algorithm used to cluster your data - and the parameters needed to do so!

‍

If you've selected to use UMAP and have a target variable, you'll be able to choose whether or not to use it as your UMAP target. Doing so will mean that your target variable will have a heavy influence on your network and clusters. Not doing so will neglect your target variable.

Depending on which algorithm you choose to use to cluster your data - UMAP vs k-NNG - you'll be able to set parameters to customize how your clusters are created. You'll also be offered the chance to create HDBSCAN clusters - an alternative clustering algorithm that will add another set of clusters to your project.

‍

What to Expect from Train and Predict Projects?

On the surface, Train and Predict projects look similar to Cluster projects. Unless you've chosen to set a UMAP target, your project's Graph displays data points | rows linked together based on the similarity of their values for each of the factors that you selected. Even though you've built a prediction model, Graphext will still calculate clusters for you to inspect.

‍

‍

A crucial part of prediction tasks involves evaluating and interpreting your model's performance. Graphext will add the predictions that your model made to a new variable called Prediction. You'll also be able to see variables recording the error and the confusion matrix produced by the model.

On top of these variables which you can explore across your project, Train and Predict projects also result in a new project panel; Models. You can use this panel to understand how your model was built, check its accuracy scores and explore the feature importance of the factors that it used.

Read our guide to interpreting prediction models built with Graphext.

‍

Use Case Guide | Predicting Employee Behaviour

This guide is focused on the process of building a model to understand the reasons why employees left their jobs. We will be working with a public dataset containing information on 14,999 employees and information about their Salary, their Monthly Hours, their Satisfaction Level and the Number of Projects they worked on.

‍

Need Something Different?

We know that data isn't always clean and simple.
Have a look through these topics if you can't see what you are looking for.