Technical Docs | Text Analysis

Advancements in natural language processing mean that data science is ever more capable of analyzing text written (or spoken) in the many different languages of the world. Text analysis is a powerful tool. Whenever you are working with text data it is important to analyze the content of text fields in the appropriate language. Not doing so can result in noisy or incorrect results.

‍

"A different language is a different vision of life."

- Federico Fellini

‍


Language Models

We use two types of language models at Graphext; spaCy models and Stanza models. Although the implementation varies between the two models, the method you use to incorporate them into your project does not. To add in any language model, use the Data Extraction tab inside your project setup wizard.

The language models we use that are provided by spaCy are fast and robust. We use these for common languages and our team of engineers and data scientists have built them into Graphext.

‍

Languages Supported by spaCy Models

English

Spanish

French

Portuguese

German

Italian

‍

Stanza models are slower and less well tested. These are used for less common languages but enable us to offer a wider range of language support on request. Because Stanza models are less well tested, when choosing to work with one, you will be asked to confirm whether you want to work with an experimental language.

‍

Languages Supported by Stanza Models

Arabic

Catalan

Basque

Turkish

‍

Don't see what you are looking for?

We are able to include more language models on request. Send us an email with your requirements and we'll get back to you.

‍


Incorporating Language Support

When you build a text based project in Graphext, you will be asked to specify which text fields you want to analyze alongside setting the language of these fields. Graphext supports the ability to infer language directly from the text itself. Alternatively, you can explicitly inform Graphext about the language of the text that you will analyze.

Language support is incorporated into your projects as you are setting them up.

‍

How to Incorporate Language Support?

  1. Choose a dataset with at least one text field to start analyzing it.
  2. From inside the project setup wizard, choose 'Text' or 'Social Media' as your analysis type.
  3. Inside of the 'Data Extraction' tab, choose how you would like to set the language of your text.
  4. You can set languages manually or by inferring it directly from the text itself.
  5. That's it. Now execute your project.
  6. Done ... Setting the language of text makes your analysis more accurate.

‍

Need Something Different?

We know that data isn't always clean and simple.
Have a look through these topics if you can't see what you are looking for.