Analysis Types | Models

Models

Updated

August 3, 2023

Inside Graphext's Models analysis type, you'll find a series of out-of-the-box data science algorithms including models to undertake clustering and prediction tasks. We built Models to let you quickly customize and deploy powerful models without having to use Python or R.

Our models can be used to predict or cluster your data as well as analyze the correlation between variables in your data or study the co-occurrence of values.

‍

What are Data Models?

Data modelling means understanding data through a mathematical diagram - of sorts. This diagram is created by processing and connecting variables in your data in order to build relationships between them. Because of this understanding granted by relationships in past data, the representation of datasets that models create can be used to make classifications or predictions about new data.

‍

‍

Using Models

The technology behind the analysis types in Models has been built by our team or integrated with open-source machine learning projects. Our idea here is to give you quick access to powerful data science algorithms without having to write code.

Inside each Models analysis type, Graphext will ask you to fill in a series of questions. These questions configure the parameters of the model and connect it to your dataset. Then, executing your project will deploy the model and you'll see the output variables in the data.

For prediction models, use your project's Models panel to evaluate the performance of your model.

‍

Types of Models in Graphext

Cluster

Clustering projects use machine learning technology to group your data. When setting up a clustering model, Graphext will ask you to specify the variables to use as factors to calculate the similarity between rows in your data. Through understanding how your rows share values, your clustering model will compute the strength of the relationship between them and use this to group them into clusters.

‍

‍

How to Build Clustering Projects?

Choosing Cluster as your analysis type will lead you to Graphext's project setup wizard. Here you can configure how your clustering model will work, connect it to your dataset and customize the appearance and layout of your project's network visualisation - an extremely useful tool for inspecting your clusters.

‍

Selecting Targets & Factors

Crucial to setting up a clustering project is the process of selecting factors. Your factors will be the variables used to calculate similarity between rows in your dataset. If a row shares more factor values with another row - there will be a higher likelihood that they will be grouped together in the same cluster. Move variables from your list of Other Variables to your list of Factors to continue setting up your project.

‍

‍

Target(s)

A pinned variable. This won't affect your clusters unless you use UMAP and choose to set this variable as your UMAP target inside the Network and Clusters tab.

Factors

The variables your model will use to calculate similarity.

Other Variables

All variables not used as factors or targets. These won't affect your clusters or your network layout.

‍

You can use most variable types as factors but you can't use text values. Clustering projects work best with more factors but the more you add - the less impact each will have on the definition of your clusters.

Targets aren't as important here but set one if you want to pin a specific variable in your project. Note that setting a target variable can have a big influence on the layout of your network visualisation - Graph. You can configure whether you want this to happen inside of the Network & Clusters tab of the project setup wizard.

‍

Network and Clusters

Your Network and Clusters tab lets you customize the network visualisation - Graph - that your project will create as well as configuring the parameters for the clustering algorithm that Graphext will execute.

Choosing a dimensionality algorithm will affect the way that clusters are calculated in your project as well as the layout of your network - see our article on Creating Graphs and Layouts for more information.

‍

‍

Switching between UMAP and k-NNG will change the algorithm used to cluster your data - and the parameters needed to do so!

‍

If you've selected to use UMAP and have a target variable, you'll be able to choose whether or not to use it as your UMAP target. Doing so will mean that your target variable will have a heavy influence on your network and clusters. Not doing so will neglect your target variable.

Depending on which algorithm you choose to use to cluster your data - UMAP vs k-NNG - you'll be able to set parameters to customize how your clusters are created. You'll also be offered the chance to create HDBSCAN clusters - an alternative clustering algorithm which will add another set of clusters to your project.

‍

What to Expect from Clustering Projects?

The resulting network visualisation - Graph - that is created in clustering projects is a great place to start. Your network displays nodes - data points | rows - linked together based on the similarity of their values for each of the factors that you selected. Graphext will automatically color your Graph according to the clusters that were created. You can inspect the features of each cluster by selecting each one from the Cluster variable sidebar chart.

The cluster that each node belongs to has been added as a new categorical variable in your project. You can use this to filter or segment your data further within each cluster. Exporting the dataset from the Details panel means you can have access to the new cluster related variables that were created.

‍

‍

Use Case Guide | Clustering Supermarket Transactions

This guide is intended to walk you through the process of creating a clustering model to group your data. We'll be using a dataset of 1000 supermarket transactions from stores in Myanmar. The aim of our project is to group these transactions in order to find patterns in the buying habits of the supermarket's customers.

‍

Train and Predict

Train and Predict projects use machine learning technology to make predictions about the value of data. Prediction models you build in Graphext use open-source technology like logistic regression or CatBoost and are matched to your data and prediction task as you complete the project setup wizard.

When setting up a train and predict project, Graphext will ask you to set a target variable and to specify the factors used to understand relationships in your data. Through understanding how your factors relate to your existing target values, your prediction model will be able to make predictions about your target.

‍

‍

What are Prediction Models?

Prediction models are used to make predictions about the values belonging to a target variable. They analyze existing relationships between a set of variables (factors) in your data and your target variable. This understanding takes the form of a mathematical diagram - of sorts - and can be used to predict new values in data where the target value is not known.

Because prediction models can learn from your existing data to produce new values, they belong to the field of machine learning. But rather than being a specific algorithm, prediction is a task that many different algorithms attempt to solve - each varying in their approach.

‍

How to Build Train & Predict Projects?

Choosing Train and Predict as your analysis type will lead you to Graphext's project setup wizard. Here you can configure how your prediction model will make predictions, customize its parameters and choose how to validate your model.

You'll also be able to customize the appearance and layout of your project's network visualisation - an extremely useful tool for inspecting your data.

‍