Import and Export Data

Somewhere, in a parallel universe ... there is only one format for data files. Here on Earth, Graphext supports a range of different file types.

File Formats

Graphext will use an uploaded file's extension to decide how to interpret the data it contains. The below sections list our supported file formats along with their recognized extensions.

Some details of how we import files into Graphext is shared between all or most of the supported file formats.

For example, in most cases Graphext will inspect the raw data to try and infer the correct data type for each column (categorical, numeric, date etc.). This is not the case for formats that already have well defined column types, such as Apache Arrow (.arr / .arrow), Parquet (.pqt / .parquet), and SPSS (.sav). In these cases, instead of inferring the data types, we simply map them to the Graphext equivalent (e.g. Arrow's dictionary type to Graphext's category type).

Additionally, Graphext will automatically detect and convert the following list of strings to missing values (equivalent to no value, or and empty cell): "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#INF", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#INF", "1.#INF000000", "1.#QNAN", "<NA>", "N/A", "n/a", "NA", "NAN", "NaN", "nan", "NULL", "Null", "null" and "" (the empty string). The conversion will apply only if the whole field corresponds to one of these strings, i.e. if any of these values occurs as a substring inside a longer text, it will be left unchanged.

CSV

Extensions: .csv and .tsv

CSV files (comma-separated values) are delimited text files using commas to separate values. Each line of the file contains a data record, and each record contains one or more fields separated by commas. Graphext expects column names to be listed in the first line of the file. TSV files (.tsv) follow the same formatting rules as CSV files but use the tab character ("\t") to delimit fields.

Graphext treats CSV and TSV as equivalent, and will try to infer the delimiter and other details of the format from the file's content. Note that if the file is not formatted correctly, or uses unusual characters as delimiters, or to quote fields containing the delimiter, Graphext may fail to infer the correct format and consequently to read the file correctly. In particular, while Graphext will try to skip initial lines that don't appear to be part of actual tabular data, this cannot be guaranteed to always work correctly, and so we discourage use of such preambles in CSV files.

For the curious, CSV files are read using Graphext's open-source lector library, which is documented in some detail here.

For more details regarding the "specification" of CSV files see the section below.

Excel

Extensions: .xls and .xlsx

We support both XLS and XLSX files, the most commonly used formats to save Microsoft Excel spreadsheets. These files store data in worksheets containing cells arranged as a grid of rows and columns. Like other file types you can upload these directly to Graphext. In case the file has several sheets, we will import the first sheet only.

To ensure Graphext reads your data correctly, we recommend that the sheet contain a single table only, that the table start in the first row and first column, and that the first row correspond to the table's column names. If the sheet contains comments, charts or other elements not part of the tabular data, the import may not work as expected.

JSON

Extensions: .json and .jsonl

JSON files ("JavaScript Object Notation") are primarily used for transmitting data between web applications and servers. They store data in a format similar to a JavaScript object or Python dictionary. You can upload JSON files directly to Graphext. We support normal JSON files (.json) as well as line-delimited json files (.jsonl). See below for details about the supported file structures (both row- and column-oriented).

Note that we currently do not support the import of nested data. Any column containing nested JSON data will be imported as plain strings. You will nevertheless be able to extract specific fields from nested data using one of our data transformations steps.

Apache Arrow

Extensions: .arr and .arrow

You can also upload binary Apache Arrow files written in streaming or random access (batch) mode to Graphext. Arrow represents a language-independent columnar memory format for flat and hierarchical data. Using this format allow for very fast imports, since the format is very efficient to read, compact, and doesn't require inference of data types. It is the format Graphext and many other tools, DBs etc. use internally to store their data.

As with JSON, we currently do not support the import of nested data. Any column containing nested data will be imported as plain strings. You will nevertheless be able to extract specific fields from nested data using one of our data transformations steps after import. Other types not currently supported will either be imported as categorical (text), or be represented by a column containing only missing values (to at least preserve the correct number of columns).

Apache Parquet

Extensions: .pqt and .parquet

Apache Parquet is a widely used open source, column-oriented data file format designed for efficient data storage and retrieval. Importing it into Graphext is equivalent to importing Apache Arrow files, offering the same performance benefits (and with same caveats regarding unsupported data types).

SPSS SAV

Extension: .sav

SAV files are part of the SPSS Statistics File Format or SPSS family. Information in a SAV file is divided into a header, a sequence of tagged 'records' comprising a dictionary for the file and finally the data itself.

Note that while we try our best to import such files, SAV is a proprietary format with no official documentation, and as such is not well supported in the greater data ecosystem. If you are able to export your data in another of our supported formats we would recommend that instead.

GML & GraphML

Extensions: .gml and .graphml

The file formats Graph Modelling Language (.gml), and GraphML (.graphml), let you import data already representing a graph or network. They can be exported from tools like Gephi or Cytoscape, or libraries such as igraph or NetworkX. We will be able to read those files as long as igraph is able to read them. This means some advanced features, for example hypergraphs or ports, are not supported.

ZIP archives

Extension: .sav

ZIp files allow you to upload and concatenate multiple dataset files at once. The archive should contain files in any of the supported formats mentioned, and all files should share the same schema (column names and data types). Graphext will concatenate all contained datasets horizontally, i.e. by appending their rows, and so the columns must best consistent across different files.


File Structures

Text-like file formats, like CSV and JSON, may be subject to specific restrictions on how the data is structured inside the file.

CSV

While there is no "official" CSV standard, most implementations follow some common rules. We recommend adhering to the following guidelines adapted from the Internet Engineering Task Force, which you may also access directly here.

  1. The first line in the file is a header line with the same format as normal record lines. This header contains names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file. For example:
 field_1,field_2,field_3
  aaa,bbb,ccc
  zzz,yyy,xxx

  1. Each actual data record is located on a new line, delimited by a line break.
  2. The last record in the file may or may not have an ending line break.
  3. Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and will not be ignored. The last field in the record must not be followed by a comma. For example:
Good
 field_1,field_2,field_3
  aaa,bbb,ccc
  zzz,yyy,xxx
Bad
 field_1,field_2,field_3
  aaa,bbb,ccc,
  zzz,yyy,xxx,

  1. Each field may or may not be enclosed in double quotes. If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:
 "aaa","bbb","ccc"
  zzz,yyy,xxx

  1. Fields containing line breaks, double quotes, and commas must be enclosed in double-quotes. For example:
 "aaa","b
  bb","ccc"
  zzz,yyy,xxx

  1. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:
 "aaa","He said ""Hi!""","ccc"

JSON

We support three different JSON (JavaScript Object Notation) formats, which will be detected automatically by inspecting the beginning of a .json file.

Json Lines

In the JSON lines format, each line in the file is a JSON object representing a dataset row. The object in each row contains field names as keys and the corresponding field's value. For example:

 {"field_1": "aaa", "field_2": "bbb", "field_3": "ccc"}
  {"field_1": "zzz", "field_2": "yyy", "field_3": "xxx"}

For further details see the official JSON Lines documentation.

List of Records

In this format the file contains a JSON list of objects, where each object contains field names and values as key-value pairs. For example:

 [
    {"field_1": "aaa", "field_2": "bbb", "field_3": "ccc"},
    {"field_1": "zzz", "field_2": "yyy", "field_3": "xxx"}
  ]

Notice how the first level represents a list, and that objects within this list are separated by a comma. Line breaks and spaces between fields are not required, so the following is an equivalent but more compact format that is equally valid:

 [{"field_1":"aaa","field_2":"bbb","field_3":"ccc"},{"field_1":"zzz","field_2":"yyy","field_3":"xxx"}]

Object of Columns

The last supported JSON format is column-oriented. In this format the file contains at the highest level a JSON object. This object has key-value pairs where each key is the name of a field/column, and each value is a JSON list containing {index: value} objects. For example:

 {
    "field_1": {0: "aaa", 1: "zzz"},
    "field_2": {0: "bbb", 1: "yyy"},
    "field_3": {0: "ccc", 1: "xxx"}
 }

In this format, line breaks and spaces between fields are also ignored, and so the following is equivalent:

 {"field_1":{0:"aaa",1:"zzz"},"field_2":{0:"bbb",1:"yyy"},"field_3":{0:"ccc",1:"xxx"}}

A Note on Automatic Detection

As can be seen in the examples, each JSON format is easily identified by inspecting the first few lines of the file. We use the following heuristic:

  1. If the file starts with [ - assume the List of Records format.
  2. If the file contains more than 1 line, and each of the first 2 lines starts with { and ends with } - assume the JSON Lines format.
  3. In all other cases - assume the Object of Columns format.

Need Something Different?

We know that data isn't always clean and simple.
Have a look through these topics if you can't see what you are looking for.