Supported File Types
Updated
December 16, 2022
Somewhere, in a parallel universe ... there is only one format for data files. Here on Earth, Graphext supports a range of different file types.
File Formats
For most formats Graphext will inspect the raw data to try and infer the correct type for each column (categorical, numeric, date etc.). This is not the case only for SPSS (.sav) and Arrow (.arr) files, which already come with reliable type information.
Additionally, Graphext will automatically detect and convert the following list of strings to missing values (equivalent to no value, or and empty cell): "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#INF", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#INF", "1.#INF000000", "1.#QNAN", "<NA>", "N/A", "n/a", "NA", "NAN", "NaN", "nan", "NULL", "Null", "null", "NONE", "None", "none". The conversion will apply only if the whole cell corresponds to one of these strings, i.e. if any of these values occurs as a substring inside a longer text, it will be left unchanged.
CSV
CSV or comma-separated values files are delimited text files using commas to separate values. Each line of the file contains a data record. Each record contains one or more fields separated by commas. CSV column names are listed in the first line of files.
See here for file structure.
XLS or XLSX
XLS or XLSX files are formats commonly used in Microsoft Excel spreadsheets. These files store data in worksheets containing cells arranged as a grid of rows and columns. Like other file types you can upload these directly to Graphext. In case the file has several sheets, we will import only the first one.
JSON
JSON or JavaScript Object Notation files are primarily used for transmitting data between web applications and servers. They store data in a format similar to a JavaScript object or Python dictionary. You can upload JSON files directly to Graphext.
See here for file structure.
SAV
SAV files are part of the SPSS Statistics File Format or SPSS family. Information in a SAV file is divided into a header, a sequence of tagged 'records' comprising a dictionary for the file and finally the data itself.
GML
Data saved in the Graph Modelling Language format lets you import data already representing a graph or network. GML is a hierarchical ASCII-based file format for describing graphs.
ARROW
You can also upload binary Apache Arrow files written in streaming mode to Graphext. Arrow files are defined as a language-independent columnar memory format for flat and hierarchical data.
ZIP
ZIp files allow you to upload and concatenate multiple CSV files at once. You'll probably want all files in the archive to have the same structure (the same columns). If this is not the case, parts of the resulting table where a file didn't have a column that is present in other files will have missing values.
File Structures
CSV
While there is no "official" CSV standard, most implementations follow some common rules. We recommend adhering to the following guidelines adapted from the Internet Engineering Task Force, which you may also access directly here.
- The first line in the file is a header line with the same format as normal record lines. This header contains names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file. For example:
- Each actual data record is located on a new line, delimited by a line break.
- The last record in the file may or may not have an ending line break.
- Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and will not be ignored. The last field in the record must not be followed by a comma. For example:
Good
Bad
- Each field may or may not be enclosed in double quotes. If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:
- Fields containing line breaks, double quotes, and commas must be enclosed in double-quotes. For example:
- If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:
JSON
We support three different JSON (JavaScript Object Notation) formats, which will be detected automatically by inspecting the beginning of a .json file.
Json Lines
In the JSON lines format, each line in the file is a JSON object representing a dataset row. The object in each row contains field names as keys and the corresponding field's value. For example:
For further details see the official JSON Lines documentation.
List of Records
In this format the file contains a JSON list of objects, where each object contains field names and values as key-value pairs. For example:
Notice how the first level represents a list, and that objects within this list are separated by a comma. Line breaks and spaces between fields are not required, so the following is an equivalent but more compact format that is equally valid:
Object of Columns
The last supported JSON format is column-oriented. In this format the file contains at the highest level a JSON object. This object has key-value pairs where each key is the name of a field/column, and each value is a JSON list containing {index: value} objects. For example:
In this format, line breaks and spaces between fields are also ignored, and so the following is equivalent:
A Note on Automatic Detection
As can be seen in the examples, each JSON format is easily identified by inspecting the first few lines of the file. We use the following heuristic:
- If the file starts with [ - assume the List of Records format.
- If the file contains more than 1 line, and each of the first 2 lines starts with { and ends with } - assume the JSON Lines format.
- In all other cases - assume the Object of Columns format.
Need Something Different?
We know that data isn't always clean and simple.
Have a look through these topics if you can't see what you are looking for.