Building Projects

Some steps allow the use of advanced queries to filter rows in a dataset. Right now, the following steps allow this:

An advanced query allows you to filter data using a simple text query, the syntax of which may be familiar to users of SQL or Elasticsearch. For example, you may want to find the tweets with more than a certain amount of retweets - "RT > 12"; all customers that have paid their invoices - "invoice_paid: True"; specific patterns in the title of documents - "title: Obama"; or find the votes whose values ​​are greater than the median - "num_votes > MEDIAN".

Advances queries allow you to succinctly express such conditions, as well as compose multiple conditions in a single query.

Text and categorical queries

In categorical or text columns you can search for specific words (tokens) or categories. You can combine searches using boolean logic (i.e. string together multiple conditions using the AND/OR keywords).

Select rows where the "department" column contains "engineering":

department: engineering

or where the "text" column contains "he" and "she":

text: he AND she

or filter the four most frequent categories in the "department" column:

department: TOP(4)

Numerical and date queries

You can compare the values in numeric and date columns against constants to test for equality or against a desired range (mininum and/or maximum). For queries in date columns you can use dates expressed in the ISO format 8061 standard (e.g. "2019-01-01").

Column statistics in filter conditions

In addition to using constants and ranges, you may also compare date and numeric values against certain column statistics. Specifically, all the following statistics are supported:

  • MIN: minimum value
  • MAX: maximum value
  • MEAN: the average
  • P25: or Q1, 25% of data lies below this point
  • MEDIAN: that is, P50 or Q2. This is the median of the data and 50% of it lies below this point
  • P75: or Q3, 75% of data lies below this point

Date queries

Dates have to be specified in the ISO format 8061 standard, where the date part is required and the time part optional. I.e. both the following dates are valid: 2020-02-23 and 2020-02-23T14:30:00. In the first example, only year, month and date are specified, while the second includes the time also (hours, minutes and seconds).

To select all rows whose "date" field falls into the year 2019:

date: >= 2019-01-01 AND <= 2019-12-31

Or to match a specific date:

date: 2019-01-01

The = operator is implicit in the above query, so the following query produces the same result:

date: =2019-01-01

Select dates before 2020, January 1st, 12:35 PM:

date: <= 2020-01-01T12:35:00

Dates before the median date in the same column:

date: >= 2020-02-23 AND <= P75

And you may of course also combine constants with calculated statistics:

date: >= 2020-02-23 AND <= P75

Valid operators

Note that queries in date columns only support =, >= or  operators, while > and < are not supported.

Numerical queries

In numeric columns, numbers can also be specified using scientific notation. The following two numbers are both valid and represent the same number: 145000 and 1.45e5.

To select rows whose age field falls into the range 10 to 55:

age: >= 10 AND <= 55

To match a specific number:

age: 12

The = operator is implicit in the above query, so the following query produces the same result:

age: =12

Include ages from 10 up to but not including 55 (exclusive smaller/greater than):

age: >= 10 AND < 55

Using scientific notation:

age: >= 5.6e-4 AND <= 9.35e-2

Using a computed statistic:

age: >= P25

Combining both constants and computed statistics:

age: >= 5 AND < MEAN

Valid operators

Queries in numeric columns support =, >, >=, < and operators.

Need Something Different?

We know that data isn't always clean and simple.
Have a look through these topics if you can't see what you are looking for.