There are two principal approaches to constructing network maps, or embeddings, in Graphext:
Dimensionality reduction using UMAP is the faster and more precise method, and should be preferred unless the focus is on analysis of longish natural language texts. In the latter case, the k-NNG method currently provides a measure of text similarity that usually works a little better. Also, if the dataset contains a broad mix of quantitative and categorical variables, and UMAP dimensionality reduction does not generate the desired result, k-NNG may be worth trying, as it sometimes achieves a better balance between columns of different type.
The k-NNG approach (k-Nearest Neighbor Graph) constructs a graph (network) where each point in a dataset is connected to the k other points most similar to it. On its own, this simply constructs an abstract graph, consisting of nodes that represent the rows in the dataset, and links connecting them if they are similar enough.
In order to be able to view this graph on the (2-dimensional) screen, the nodes need to be assigned coordinates in the x/y plane. This is usually referred to as calculating a layout for the graph. The method employed in Graphext to do this is a force-based layout, which aims to place points such that the length of links reflects the similarity between them, while also aiming to avoid links crossing each other. We have chosen the forceAtlas2 layout algorithm as a sensible default, as it tends to work well enough across a variety of use cases.
The resulting graph, laid out on the 2-dimensional screen, can be thought of as an “embedding”. We have effectively embedded the original dataset in the space of x/y coordinates such that the positions of nodes reflect the similarity or dissimilarity between their corresponding dataset rows. While certain structures, like clusters of related points, are often immediately visible to the naked eye in such graphs, their identification can further be automated using clustering algorithms (in the context of graphs/networks also referred to as community detection). In Graphext, we use the Louvain algorithm as a sensible default to detect clusters in graphs, as it produces good results in most cases. It also has the advantage of being fast to calculate, and so we can offer to re-calculate it on-the-fly from within Graphext’s UI (using the “Automatic segmentation” feature).
Construction of a k-NN graph is simple if one is able to measure the similarity between pairs of data points. E.g. if a dataset only contains numeric columns, and each data point is thus a simple list of numbers (vector), we can simply use the Euclidean distance to measure the similarity between two points (the straight-line distance between points in what we intuitively consider the normal 2d, 3d, or n-dimensional space, such as the x/y plane we know from school).
This approach is not applicable, however, if the dataset contains other types of data, such as categorical columns, dates, free-form natural language text and so on, since now we cannot easily interpret points as living in such a simple space (e.g. what would we be the coordinates of a categorical variable?). This means we need another measure of similarity that can be applied to datasets having mixed types of data. In Graphext, our choice is a custom measure that is somewhat similar to what’s known as the Gower distance. The idea is simple: to measure the similarity between two data points, we apply a different measure for each type of column. E.g., for each numerical column we may apply some form of normalized difference between the quantitative values; for categorical columns we may simply check if the two points belong to the same category (distance = 0) or not (distance = 1); for text columns we may calculate similarity in terms of common word usage, etc. The final similarity then is simply the weighted average of all these individual column-wise measures. Since this provides a way to compare two points with arbitrary data types, we can then simply calculate the distances between all pairs, and use these to generate the k-NN graph.
We use our own (fast, C++) implementation of the force-based layout algorithm forceAtlas2. The basic idea of this algorithm is that nodes generally repel each other, while links act as springs applying an attracting force between connected nodes. The algorithm applies these forces to the nodes iteratively until a stable configuration is found in which the forces all balance out and the nodes stop moving (mostly). If the result is not ideal for a given dataset, e.g. nodes being too close together, or too far apart, the algorithm has a number of parameters that can be changed to create aesthetically different results (never changing the actual connectivity, and so also having no influence whatsoever on any clustering applied to the network, for example).
For graph clustering we also have our own, optimized implementation of the Louvain algorithm (written in Rust, and compiled to also work in the browser, i.e. from the Graphext UI). Louvain iteratively assigns nodes to clusters such as to maximize a measure called modularity, which is the relative density of edges within clusters with respect to edges between clusters. In other words, it tries to assign nodes to clusters such that there are many more connections between the nodes in the same cluster than there are between nodes in different clusters. As with the forceAtlas2 layout, the default configuration should generate decent results in most cases. However, a somewhat common occurrence is the generation of either too many or too few clusters. This can easily be addressed, however, as the algorithm has a resolution parameter that influences the number of clusters generated. For further details see the documentation of the corresponding cluster_network step.
Since it doesn’t scale very well for larger datasets, we recommend the k-NNG method principally for analyses with a focus on longish natural language text, where the current Tf-Idf similarity tends to provide slightly better results than the word embeddings used in the dimensionality reduction approach. Also, if the dataset contains a broad mix of quantitative and categorical columns, and dimensionality reduction does not provide the desired result, it may be worth trying with k-NNG, as sometimes it achieves a better balance between columns of different data types.
For each step in the k-NNG as described above, there is a corresponding method in Graphext to be used in recipes. A partial recipe may look like this, for example:
Here, the step link_similar_rows creates the abstract graph. Given an input dataset, the step’s output is a new dataset, links, in which each row is a link connecting one row to another. It has 3 columns: source, target and weight. The first two refer to the ids of the rows that are connected by the link (row numbers), while the weight column contains the numerical value of the similarity between the two rows.
To calculate node x/y positions we use the step layout_network. This takes the just generated links dataset, and adds the x and y columns to the original rows, such that each row/node now has an associated position.
Finally, we detect clusters in the graph with cluster_network. This step also receives the links dataset as input, and returns a simple categorical column containing the cluster labels.
Details may vary depending on the version of Graphext (if used on-premise), and the particular choice of analysis selected initially in the Wizard, such as whether you’ve chosen to do a specific analysis of customers, say, or a generic segmentation (Other); but in most cases you should be able to select the option of doing a simple Segmentation (or Segment your data), which without further qualification means a clustering using the k-NNG approach (selecting Segmentation using UMAP & HDBSCAN, on the other hand, would correspond to the dimensionality reduction approach).
In newer versions of the Wizard, you may also find a Network and Clusters panel in the Advanced Settings section. Here you should be able to select k-NNG & Louvain as the Dimensionality reduction and clustering algorithms option. This will automatically configure the recipe to include the steps as shown above (or similar).
As mentioned, the k-NNG approach is inherently slower than the dimensionality reduction described in the next section. The only way to reduce execution time somewhat is to use fewer variables, i.e. passing a smaller subset of columns to the link_similar_rows step.
Our second approach to embedding data points is based on dimensionality reduction, and more specifically (at least for now), the UMAP algorithm. The idea, in this case, is to find a new way to represent our data using a smallish number of numerical columns (e.g. x and y only), but such that distances measured on our dimensionally reduced dataset maintain the same neighborhood structure as the original data. In other words, we want to find x/y positions for each data point, such that rows in the original dataset that are very similar will be found close to each other in terms of their x/y positions; and rows that are very dissimilar in the original dataset should be farther away from each other in terms of x and y. UMAP is a mathematically sound and principled approach to do just that.
Thus, when using dimensionality reduction (whether UMAP or other algorithms), we do not explicitly create a graph of nearest neighbors, and then calculate a layout. We rather consider the dimensionally reduced dataset itself our layout. Or, to state the same in reverse, to layout our dataset, we directly reduce our dataset to two dimensions (using UMAP or other algorithms) and interpret these as the data points’ x and y positions.
Note that the fewer dimensions we retain, the more information we lose from the original data. When calculating the 2-dimensional layout, the goal therefore is to at least capture the most salient structures in the dataset (e.g. clusters, the relative similarity between different clusters, local differences within clusters etc.). However, we do not have to restrict ourselves to always reducing our data to two dimensions only. For any purpose other than visualizing the dataset on the screen, it may make sense to keep more than two dimensions (and therefore more of the original information). This applies e.g. when trying to identify clusters. It may be, for example, that certain clusters can only be separated well in 3 dimensions (as an analogy, imagine e.g. one cloud behind another, and being able to distinguish them only from certain angles). In graphext, we therefore often embed our dataset twice using dimensionality reduction:
Since the result of a dimensionality reduction is not a graph, but in essence simply a new dataset with fewer (numerical) columns, we cluster the reduced data not using graph-based community detection, but using “ordinary” clustering algorithms. By default this is HDBSCAN in Graphext, but in principle it could also be k-means, agglomerative clustering etc. (see scikit-learn for an overview of possibilities).
The vast majority of machine learning algorithms, including dimensionality reduction techniques like UMAP and cluster algorithms such as HDBSCAN, require their input data to be purely numerical and without any missing data. Few algorithms provide support for other data types (such as categorical, date and time etc.), for a mix of different types in the same dataset, or support missing data out of the box.
Graphext therefore provides a way to convert a dataset with arbitrary data types into a purely numerical dataset without missing data. It does so by defining for each possible type of input column a transformation from non-numeric to numeric values. As an example, ordered categorical variables (ordinals) such as the day of week, may be converted into a series of numbers (0..7). Non-ordered categorical variables of low-cardinality (containing few different categories) may be expanded into multiple new columns of 0s and 1s, indicating whether each row belongs to a specific category or not (one-hot encoding). Similar transformations are applied to dates, multivalued categoricals etc. Missing values (NaNs) are imputed, i.e. replaced with an appropriate value from the corresponding column (e.g. the median in a quantitative column). In addition, a new column of 0s and 1s is added, indicating whether the original column had a missing value or not.
All this is done automatically behind the scenes whenever you pass a dataset to an algorithm that expects its input to be numerical and complete. While this is convenient (since otherwise essentially no machine learning algorithm would be applicable to real-world datasets), it is currently not configurable, and so in specific, and hopefully rare, cases the way missing data is handled may not be ideal, for example. Note, however, that most of the time the machine learning algorithms we apply to your data are not supposed to give you the most precise predictions possible. Rather, the goal is to help you more easily explore your dataset and generate hypotheses (which you may then want to confirm or not using a more targeted approach).
For many clustering algorithms it is difficult to determine in an intuitive way how many clusters should be found. In k-means, for example, you need to know beforehand how many clusters you’d like to identify, while many times such information is not available a priori. Other algorithms (such as Louvain, described above), may provide a resolution parameter, but again, it is not usually obvious how to select it. HDBSCAN tries to address this problem with a more intuitive configuration. Its main parameter influencing the resulting number of clusters is min_cluster_size, which as the name implies, is the minimum number of data points any group should have to be considered a cluster. Intuitively, HDBSCAN will (hierarchically) identify clusters of densely packed points, where groups smaller than min_cluster_size are considered noise if not part of a larger cluster in the hierarchy.
As described above, although no explicit nearest neighbour graph is needed in principle to use the dimensionality reduction approach (neither for creating a layout, nor for clustering), Graphext will usually derive one anyway from the dimensionally reduced embeddings. Having the connectivity between data points available allows for easier navigation between neighbours in the UI, for example, and allows for the use of Louvain to cluster all or parts of the dataset on the fly (“automatic segmentation”). Note that the step to create this connectivity is not the same as the k-NNG approach however. Since we already have our dimensionally reduced embeddings (e.g. as the output of the embed_dataset step), we can use very fast algorithms to find each row’s nearest neighbour (we use Spotify’s Annoy here), and so this extra step does not add significant computational overhead.
Ignoring recipes provided out-of-the box in Graphext, we recommend using the dimensionality reduction approach with UMAP as default in most cases, more so when datasets are large, or when little is known about the distributions in the data. UMAP is considerably faster, and generally more precise in maintaining both local and global similarity between data points. When analyzing longish texts, however, the text similarity measure used by the k-NNG approach works a little better sometimes. Also, when working with a lot of mixed variables types (numerical, categorical etc.), the k-NNG may be worth trying, since in some cases it seems to achieve a better balance between the different types.
Again, each step in the flowchart for dimensionality reduction corresponds to a step/method in Graphext recipes.
Here, layout_network is used to reduce the input dataset to 2 dimensions (x/y) using UMAP. Next, embed_dataset employs UMAP again to create 10-dimensional embeddings. In the resulting output column, each original dataset row is now represented by a vector of 10 numbers.
Having embedded the dataset in 10 dimensions, we then use cluster_embeddings with the HDBSCAN algorithm by default to identify clusters in the dataset. Finally, link_embeddings is used to calculate for each embedded row its k nearest neighbours. The output is again a dataset of links indicating which nodes should be connected as well as how similar they are quantitatively.
Again, details may vary depending on the version of Graphext (if used on-premise), and the particular choice of analysis selected initially in the Wizard; but if the type of analysis supports the use of dimensionality reduction, then you should be able to select something like Segmentation using UMAP & HDBSCAN as an option from the first set of questions (if the segmentation option doesn’t specify any concrete method, the k-NNG approach is used instead).
In newer versions of Graphext, you may also find a Network and Clusters panel in the Advanced Settings section of the Wizard. Here you should be able to select UMAP & HDBSCAN as the Dimensionality reduction and clustering algorithms option. This will automatically configure the recipe to include the steps as shown above (or similar).
Although UMAP is already pretty fast (when compare to related algorithms like t-SNE), for very large dataset there are a number of tweaks that increase performance at the cost of a possible (but not necessary) reduction in precision:
In some cases, if clusters are sufficiently different from each other (if the corresponding graph is essentially disconnected), UMAP may push the clusters rather far apart from each other. Visually this means individual points may become small in the UI, and although this doesn’t affect the “quality” of the embedding, it may make navigating the resulting layout less convenient. In this case you may try to change the parameter n_epochs (which is 200 by default for large datasets and 500 for smaller ones). Setting this to smaller values means the algorithm will have less time to push clusters apart, but also less time to converge on the “optimal” solution.
This is usually a sign that you’ve used only (or principally) input columns that are (nearly) unique, i.e. have a different value in each row. Typical cases are e.g. the use of numeric customer IDs or dates, where each row has a different ID or date. If this is the only information the algorithm has to go on, then each row will essentially be connected to the next higher and lower ID or date, and the result will be a more or less tangled thread of data points with values increasing or decreasing from one end of the thread to the other. Simply include more informative features in the corresponding steps to create a proper clustering.
Since the embedding of a dataset, and its clustering, is usually done unsupervised, i.e. without knowing the ground truth for at least part of the data, there is no direct way to confirm whether a particular clustering is “correct”, or even any “good” or “bad”. In fact, without a ground truth, or some other context, any clustering whatsoever is as valid as any other. What matters in this case, usually, is whether the clustering is useful. Being useful here could mean, e.g., that different clusters of customers correspond to my intuition about the groups of customers I expect; or that the clusters allow me to predict, understand or classify customer behaviour more easily when compared to not distinguishing between groups of them.
Fortunately, for the purpose of evaluating a particular approach or algorithm, we can make use of labelled data (where the ground truth is known) to get an idea of its usefulness. The idea here is simply to see whether the clusters identified automatically and from the features of the dataset alone, correspond to the labels a human has assigned to data points manually. E.g., the famous Iris dataset, contains information about 3 different species of Iris flowers. It provides 4 variables that describe the shape of their petals and stalks, as well as a 5th column indicating the species (versicolor, setosa or virginica). To see whether any of Graphext’s embedding and clustering approach actually “works”, we simply embed and cluster the dataset using the four feature variables only, and then compare whether the found clusters of flowers correspond to the different species as identified in the label column. In other words, we’d expect all flowers in the same, automatically identified, cluster to be of the same species, even though we haven’t used information about species in the clustering.
Of course, this will only work if the features provided in the dataset correlate to a sufficient degree with the labels. If the measured features of the flowers didn’t tell us anything about their species, then a clustering wouldn’t be able to “recover” the species either. Fortunately, there are many datasets used to benchmark classification algorithms, which we can simply re-purpose for the task of cluster evaluation. Doing this for a sufficiently diverse number of datasets (in terms of size, complexity, data types etc.), we can convince ourselves that, and to which degree, each approach recovers the original labels.
We include and share a number of projects doing just this in the Graphext team Validation (you may need to ask us to add you to the team to be able to access it). All projects here are based on typical datasets used to benchmark classification algorithms. As such, each contains the human-annotated ground-truth labels in a categorical column, which we’ve pinned to the top of the left column in the Graphext UI. Just below on the left, you’ll then find the clusters we’ve identified without taking into account the true labels.
To easily compare the two, you can e.g. color the nodes by the true labels and confirm that each true class corresponds to one (or more) clusters (sometimes automatic clusters will make a finer or coarser distinction between groups of similar data points). E.g., in the Iris project, with color and node title indicating the true species, the following figure confirms that automatic clusters almost perfectly distinguish the 3 different species:
Conversely, you could select each automatic cluster and confirm the “purity” of the true labels of the corresponding nodes. In Figure 4, for example, we have selected cluster 2, with true labels being ~99% “versicolor” and a single node corresponding to an unusually small example of virginica; thus confirming that the cluster (like the others) does indeed correspond to a particular species of flower.
We have included a variety of different datasets and projects in the Validation team. For each dataset we have created at least one project using the UMAP approach, and one using the k-NNG approach, with the name indicating which was used. The ground truth labels are always to be found on the top-left, followed by the automatic clusters. See the box below for a list of current validation projects (at the time of writing).
Note that depending on the size of the dataset we may have adjusted a few parameters in the projects’ recipes when compared to their defaults (usually considering fewer/more neighbouring nodes in an algorithm, or adjusting the minimum size of clusters). Also, when processing texts or images, for example, there may be slight variations from the example recipes described above. You can always select “Recreate project” from the sandwich menu in the top left corner of a project view to inspect the recipe and how each algorithm was configured.
We have given an overview of the two principal approaches to laying out a dataset in Graphext:
We have compared the advantages and disadvantages of each approach and explained when to use them. Finally, we have described a method to evaluate how well these methods work using human-annotated datasets, in which the ground truth can be compared with automatically identified clusters. A number of such datasets and corresponding projects using each approach are publicly available in Graphext.
One of Graphext’s most differentiating features is the way we allow you to see all your data at a glance in a single “topological map”1, or graph. The Graph view intuitively highlights the local and global structures in your data by mapping rows that are similar to each other to close-by points on the screen, and conversely, mapping dissimilar rows to more distant points. This lets you immediately identify clusters of data points that are more similar amongst themselves than they are to others, detect outliers that are far away from the bulk of the data, and so on.
We know that data isn't always clean and simple.
Have a look through these topics if you can't see what you are looking for.