You can use the code editor to build projects using datasets that you have stored in your Graphext workspace.
Using the code editor is like assembling code to execute your project and gives you more control over the configuration of your network and the transformations made to your dataset.
"Type a few lines of code, you create an organism."
- Richard Powers
Projects are formed using nothing more than a number of steps. These steps are functions that accept some data and output new, transformed or enriched data. A recipe can have an arbitrary number of steps and can generate an arbitrary number of intermediate datasets. But, the output must always be a single dataset that serves as the basis for your project's network visualization.
You can open the code editor at any point after selecting a dataset and before executing the project. As you choose options from the project sidebar, functions are added to your code editor.
To access the code editor, select a dataset from your Graphext workspace then click Open text editor from the bottom of the right sidebar window that appears.
When you open the code editor, you will have already chosen a dataset that serves as the main input for the recipe. This dataset is made available by default with the name ds, and so the simplest possible recipe is simply written as follows.
This builds a project with a single step called create_project, which accepts a dataset as input and has no output. This is a special case. In practice, you'll almost always want to somehow transform or enrich your dataset so adding other steps will almost always be necessary.
Steps are functions that get applied to your data or your project and affect one or the other in some specific and unique way.
In general, the syntax for adding a step is very simple and always written as follows.
The inputs may be either specific columns of a dataset, a dataset itself, or a model. Details about the expected types of inputs depend on the specific step in question. Documentation on the different steps available in the code editor is available here.
A step's inputs may be either specific columns of a dataset, a dataset itself, or a model.
As the step's last argument in parenthesis, you can provide parameters to configure how the step will transform the input data. Finally, in another set of parenthesis (and separated by ->), you provide names for the outputs that the step will generate. Again, the outputs may be one or more columns or datasets.
Read the documentation on Graphext steps to see which steps you can use to build projects.
To differentiate between input datasets and columns, column names need to be prefixed with the name of the dataset it belongs to, while datasets can be referred to by their name only. In other words, ds refers to the dataset with the name ds and to pick out a specific column you'd use either ds.my_column or ds["my_column"]. The two forms are generally interchangeable, but the latter is required if a column name contains spaces.
A simple step that splits the texts in a given column at the first comma, might be written as follows.
The result of the split will be two new columns named left_part and right_part in the dataset ds. The columns resulting from the split will now be included in the project that is created.
Usually, when you start typing the beginning of a step's name in the Recipe editor, the rest of the step's signature will autocomplete, including the default names of any outputs it creates. So you only need to change the names if you don't like the default ones or if they clash with other outputs you may have generated already.
Each step's individual documentation will describe its valid parameters. Invalid parameters will be highlighted by the Code Editor.
In this example, "pattern" is the parameter's name and "," represents its value. In general, all parameter names must be quoted strings, while values may also be quoted strings as well as numbers, lists of numbers or strings or another nested object in curly braces, following the above rules.
In the context of building a project, steps combine in a sequence to help construct different stages of the project. Steps can be grouped together with regard to the stage of the project setup that they should be used in.
For instance, steps related to both filtering and enrichment belong to the stage of the project setup wherein your dataset is modified. The stages of setting up a project are as follows:
During this stage, you will modify your original dataset by adding columns, training models or enriching your data. We classify these functions as; Transform - Enrichment - Aggregate, Join & Combine - Filtering & Sampling -Embedding - Models & Inference.
Some steps help us to create a network from your dataset. These steps have a dataset as their output - links. This new links dataset will have three columns; Source - Target - Weight. Each row in this dataset represents a link in your Graph - or network visualization.
Next follows a choice of steps used to construct clusters in your dataset.
In this stage of building your project, the series of steps help to create coordinates mapping your data points on the Graph. Each row in your data is given an x and a y coordinate. Importantly, these variables must be named x and y.
This is the final stage of setting up your project. There are two possible steps involved with this stage of execution:
This stage is executed after Graphext has built your project and configures aspects of your project including the size, labels and color of nodes. Steps applied during this stage will also configure any other customized specifications you made whilst building the project including the order of variables in your project. There is no output of steps in this stage.
This stage adds automatic insights to your project. As with the previous stage, steps here have no output.
This final stage is not always required. The purpose of the steps here is to write your dataset output to a database. This is very useful for projects that are representing large amounts of data and can help to speed up loading times.
The order of your steps should follow the structure of the stages set out above.
Whilst it is not always essential to follow this structure, working with datasets or dataset variables as inputs and outputs of your steps can quickly cause things to get messy if you aren't careful about the order of your steps.
For instance the first snippet below will embed the original dataset, and then will add a new column with the languages of texts.
However, the second version will use the original dataset and the column language to calculate the embeddings.
The Code Editor will highlight any errors in your code. It does this by continuously validating the steps that you write.
Code validation checks:
If the transpiler finds errors, the recipe will not be valid and you will not be able to execute the project. In order to find out what the problem is, open the Code Editor, look for the mistake - highlighted in red - and hover over it. You will see a popup giving you more information about the error.
We know that data isn't always clean and simple.
Have a look through these topics if you can't see what you are looking for.