Data Academy | Text Analysis

In a nutshell, NLP - natural language processing - is computer manipulation of human language. Although scientists have tried to program computers to understand text and speech for more than 50 years, massive recent advancements in the field mean that businesses of all shapes and sizes now use NLP to support decision-making and study markets. NLP encompasses lots of data science tasks but we'll be focusing on text analysis here - roughly defined as the process of deriving insights from written language.

The broad field of computer science called natural language processing (NLP) is focused on equipping computers with the ability to process human language. To do this, computers need to be able to understand the rules of human language and compute this understanding using statistical models.

NLP is, therefore, a marriage between linguistics and mathematical modelling and because of advancements in the latter in recent years, computers can now process human language using machine learning and deep learning.

"Natural language processing strives to build machines that understand and respond to text or voice data—and respond with text or speech of their own—in much the same way humans do."

- IBM Cloud Education

The power of machine learning models to analyze large volumes of text has opened up the field of NLP to the business world. Products and services we interact with every day are powered by NLP. Amazon's Alexa is built on speech recognition software. Auto-correct uses NLP to predict which word you meant to type. Emails you receive are classified as updates, promotions or spam depending on the language inside the message.

But NLP is also used to study human language on the web, on social platforms and in customer reviews in such a way that enterprises can learn about what people think and feel. This field of NLP is called text analytics and is used across the world to enrich business intelligence.

NLP Today & How We Got Here ...

In 1950, Alan Turing published a paper in Mind called "Computing Machinery and Intelligence" in which he first introduced the concept of what is now known as the Turing test. The test puts forward the idea of the "Imitation Game", a challenge that replaces the question 'can machines think?' and instead asks whether a machine can act indistinguishably from the way that a human does. Language - being the human vehicle of communication - is a key part of Turing's test.

The advent of machine learning algorithms that could process human language in the 1980s gave way to NLP techniques that we can recognise today. Early successes came in the field of machine translation and later as the internet became ... well the internet ... natural language models had a lot more data to learn from.

Since the early 2000s, text analysis models have enjoyed greater exposure to all forms of language thanks to the internet. Models trained on unstructured text data posted online are becoming increasingly useful to businesses because of their ability to effectively transform and analyze language from a wide variety of contexts.

The internet changed the game for text analysis, providing much more data to train NLP models. Data that also provides businesses with relevant market insights.

For instance, gathering data from specific conversations taking place on Twitter has helped train a variety of context-specific sentiment analysis and text classification models. Digital news articles are frequently the subject of named entity recognition models that can detect what is making headlines. Message archives between customers and support teams can now be analyzed with topic detection models that segment text - and the people that write it - based on what it is referring to.

The models themselves are also increasing in complexity. Simple methods such as counting word frequency have been surpassed by more complex algorithms like TF-IDF, which measures the relevance of words. Methods used to transform language data so that computers can understand them have moved on from bag-of-words to more advanced vectorization techniques.

The Computational Challenge of Human Language

But these models are mathematical attempts to map a diagram of human language and are hardwired with imperfections. Text classification models are predictive and can only make decisions about the content of language based on the data and methods that were used to train them.

This is not to say that language models are useless but rather to point out that accuracy varies massively from one model to the next. Asking a computer program to account for all the subtleties, hidden meanings and variations in human language is a mammoth task. Some models perform better than others but none can truly claim to be 100% accurate.

"It is hard from the standpoint of the child, who must spend many years acquiring a language … it is hard for the adult language learner, it is hard for the scientist who attempts to model the relevant phenomena, and it is hard for the engineer who attempts to build systems that deal with natural language input or output.

These tasks are so hard that Turing could rightly make fluent conversation in natural language the centrepiece of his test for intelligence."

Mathematical Linguistics | 2010

Instead, language models are created to be effective at specific tasks. Different text analysis tasks require a different elemental understanding of human language and break it down with different mathematical models that reflect the linguistic rules defining the specific task.

The way that a model is able to detect sarcasm in a TV sitcom requires a completely different set of rules compared with a model that is designed to parse and extract significant terms (n-grams) from a series of Twitter posts.

Confining NLP models to specific tasks allows researchers to focus on improving the accuracy of models built to achieve specific tasks. Because the focus of research is often driven by market demand, sentiment analysis models are generally accepted to be more advanced than models that detect irony.

NLP for Text Analysis: Techniques

Transforming Unstructured Text Data

A crucial part of most text analysis models involves transforming language into a format that computers can read. Computers do not process words as a series of sounds or symbols in the same way that humans do. Instead, language must be transformed into a statistical representation in order for computers to analyze it.

There are many ways to transform text data but generally, the process involves representing words as lists of numbers - or vectors. Methods of vectorizing - or embedding words - include Bag-of-words, count vectorization and one-hot encoding. Learn more here.

Sentiment Analysis, Classification & Intent Detection Models

Detecting intent using NLP models is a field concentrated on classifying the human motivation behind language. Intent detection models are able to assign a label to each new item of language that is passed into it. This label might be concerned with whether the language is considered as angry, happy or sad or whether it contains hate speech or not.

Intent detection models are predictive and learn how to classify text using training data with annotated labels.

The most commonly known example of intent detection is sentiment analysis, which typically involves classifying text as either containing a positive, negative or neutral sentiment. Sentiment analysis models are able to achieve this classification through a process of training and testing in which the model can learn from a dataset that already contains positive, negative and neutral annotations - or labels - for existing text.

Most intent detection models work in a similar way to this and are predictive in their nature. As you might be starting to imagine, the business use cases for intent detection models are vast and wide-ranging. Brands use intent detection to monitor how their products or services are perceived by users.

Social media platforms use intent detection models to identify hateful and offensive language published by their users. Researchers have can track the emotions of fans towards their sports teams throughout the course of a season or competition.

Topic Modelling & Word Association

Machine learning models can also be used to discover what is being spoken or written about in language data. Topic modelling is an unsupervised learning technique used to infer content-related patterns in large volumes of text. Unlike classification models, topic modelling will not produce neatly packaged labels. Instead, these algorithms group text considered to be semantically similar and will also often tell you the terms that were used to infer this relationship.

Associating words with one another has huge potential in the field of text analytics. The image below represents a keyword analysis built with Graphext in which the significant terms in a text field have been extracted and linked to other, semantically similar significant terms.

Modelling Grammar

Grammatical structures are crucial to the meaning of language. Without an understanding of how language is structured - and the effect this has on meaning - computers aren't able to effectively interpret who is being spoken about or which kind of 'bat' is being referenced.

Since words are always dependent on the other words around them, modelling grammar is an important part of many advanced text analysis techniques and is key to effectively vectorizing text - a process that itself is central to NLP model building.

But analysis of word positions, sentence structures and writing style can also be of interest in and of itself. Techniques like parsing can represent sentences in a syntax tree that maps the relationships of words to one another. Word position tagging (POS Tagging) helps analysts to extract key language features like verbs, adjectives or nouns.

NLP for Text Analysis: Tools

Open Source NLP Technology: Libraries in Python & R

One of the most common ways to approach text analysis is using a programming language like Python. Data scientists will often work with open source libraries like NLTK or spaCy inside interactive notebooks because they can clean up and transform their data step by step.

Open-source libraries like NLTK give analysts quick access to powerful pre-built NLP algorithms that they can deploy in their own analysis. This might simply involve stemming words (returning them to their root) or tokenization (breaking text into tokens that a computer can better understand).

We are also starting to see an increase in the availability of open-source machine learning technologies. Specific models that achieve very particular tasks can be found on community sites like Hugging Face, which itself offers a wide range of context-specific NLP models for text transformation, classification, tokenization and many more use cases.

Business Friendly Text Analysis Tools

The huge downside to text analysis performed with programming languages is that, to do so, you need a very strong knowledge of how to code. This is often inaccessible for business analysts, marketing teams, CEOs and other business strategists who will often be the people that require text analysis insights and drive these projects forward.

Low-code tools like Graphext offer access to built-in NLP algorithms including topic detection, sentiment analysis and entity extraction.

In order to negotiate this skill divide, companies have developed software that gives business analysts the ability to conduct powerful text analysis projects without having to code themselves. Low code tools like Graphext offer access to built-in NLP algorithms including topic detection, sentiment analysis and entity extraction.

Because of the way that these tools empower non-technical users, they are quickly becoming a popular option for businesses looking for more NLP insights.

Why is Text Analysis Important For Businesses Today?

Market Research

Global markets are more talkative than they have ever been. The internet has brought cascades of data connecting people from across the world in conversations about the trending topics of today. Social platforms, forums, blogs and comment sections have not only changed the way that markets interact with one another but they also offer the possibility of understanding how, why and when market interest might be shifting from one thing to another.

Social platforms, forums, blogs, comment sections have the potential to tell us how, why and when market interest might be shifting from one thing to another.

Use Case Example

We collected headlines from 38 national news publishers and plotted the evolution of UK news throughout 2020. Read more.

This is what market research looks like today. Broadly speaking, businesses can gain a huge amount of knowledge to drive decision making by using text analysis techniques to derive insights from data posted by members of the public online.

A simple example here might involve recognising when a brand is spoken about on Twitter then analyzing the sentiment of the text used to reference the brand. Over time, this type of data can tell businesses whether to give their PR representative a promotion or not.

Consumer Opinion & Customer Feedback

Consumer opinion is incredibly important. The reviews that customers write about products on eCommerce sites have huge influence on the future sales of that product. The feedback that support teams receive through live chat channels like Intercom provides invaluable information on issues that customers face as well as improvements that they suggest.

Positive customer feedback is an incredibly important benchmark for business success in the future.

Use Case Example

We used Graphext to collect and analyze conversations surrounding Lloyds Bank on Twitter. Read more.

Using text analysis to derive insights from feedback helps to pinpoint exactly where the main pain points of customers are. For instance, businesses working with a number of customer support teams that operate in different countries might not easily be able to recognise that customers are always angry about a specific product feature.

Analyzing their message archives using an emotion recognition model with an entity extraction model would help the business to recognise that their customers were uniting in their annoyance and that action is necessary.

Conclusions & Further Reading

Deriving insights from text data can be as simple as counting word frequencies. Simple approaches to analyzing critical business text data like customer reviews or employee feedback can be incredibly useful. Advanced methods of text analysis are also becoming more accessible to businesses who can take advantage of low or no code platforms to perform data science tasks like sentiment analysis or entity extraction.

Due to the unstructured nature of language data, text analysis can be tricky. Human language is not just a set of numbers and can contain ambiguous meanings. It's also worth remembering that many NLP tasks rely on models that are predictive in their nature. But with so many pressing industry demands for useful text analysis algorithms, steady developments in research fields related to NLP mean that the accuracy of these models is continuing to improve.

If you want to read more about NLP for text analysis, here are some of our favourite resources on the topic.

Need Something Different?

We know that data isn't always clean and simple.
Have a look through these topics if you can't see what you are looking for.