Advancements in natural language processing mean that data science is ever more capable of analyzing text written (or spoken) in the many different languages of the world. Text analysis is a powerful tool. Whenever you are working with text data it is important to analyze the content of text fields in the appropriate language. Not doing so can result in noisy or incorrect results.
"A different language is a different vision of life."
- Federico Fellini
We use two types of language models at Graphext; spaCy models and Stanza models. Although the implementation varies between the two models, the method you use to incorporate them into your project does not. To add in any language model, use the Data Extraction tab inside your project setup wizard.
The language models we use that are provided by spaCy are fast and robust. We use these for common languages and our team of engineers and data scientists have built them into Graphext.
Stanza models are slower and less well tested. These are used for less common languages but enable us to offer a wider range of language support on request. Because Stanza models are less well tested, when choosing to work with one, you will be asked to confirm whether you want to work with an experimental language.
We are able to include more language models on request. Send us an email with your requirements and we'll get back to you.
When you build a text based project in Graphext, you will be asked to specify which text fields you want to analyze alongside setting the language of these fields. Graphext supports the ability to infer language directly from the text itself. Alternatively, you can explicitly inform Graphext about the language of the text that you will analyze.
Language support is incorporated into your projects as you are setting them up.
We know that data isn't always clean and simple.
Have a look through these topics if you can't see what you are looking for.