Say you ran a survey and collected responses from 1,000 individuals.
You’ve included two open-ended questions in your survey and all 1,000 of your respondents answered them, using 15 words each.
Using simple arithmetic, you’ll find that you’ve collected 2,000 open-ended responses (2 * 1,000) that totaled 30,000 words (2,000 * 15).
With such a daunting amount of text to read, how can you reasonably expect to review and identify the key insights from your responses?
The answer to both of these questions involves the use of Natural Language Processing, often referred to as NLP, which is essentially the process of using computers to help understand large amounts of text data.
Throughout this page, we’ll provide an introduction to Natural Language Processing and discuss how to use it to help review your survey results. By the end, you’ll have an idea of how to use Natural Language Processing in your future surveys.
Natural Language Processing is a field where computer programming and machine learning techniques attempt to understand and make use of large volumes of text data.
Natural Language Processing offers hundreds of ways to review your open-ended survey responses. Unfortunately, you don’t have the time to review each of these applications and decide on the best one.
We’ll fast-track your review process by walking you through 3 of the most popular Natural Language Processing use cases.
The word cloud allows you to identify the relative frequency of different keywords using an easily digestible visual.
For example, in a previous study, we’ve asked Americans to describe millennials in a single word. Their responses led to the following word cloud:
The bigger words in the chart appear more often in responses relative to the other words. In this case, these words tend to be negative—e.g. “lazy” and “spoiled.”
Now that you know how it works, you might be asking yourself, “How do word clouds help my survey analysis?”
Here are some of its key benefits:
But here are some of its drawbacks to consider:
TFIDF focuses on how unique a word or a group of words are from a set of responses. It’s calculated as follows:
The closer the number is to 1, the more important the word becomes. What’s the reasoning behind this formula? If more people say something but don’t necessarily say it frequently, it’s easily neglected or missed—despite its value to your analysis. TFIDF solves this challenge by highlighting the most important unique words or group of words.
For example, let’s say we gathered responses from the question: “If you had $1,000 and you could save it, invest it, or use it to pay off bills, what would you do with it?”
We end up finding that many young adults would spend the money on school-related expenses as words like, “tuition” and “buying textbooks” have a high TFIDF rating.
Use TFIDF when you want to…
Just keep the following pitfalls in mind…
Topic modeling is an advanced natural language processing technique that involves using algorithms to identify the main themes or ideas (topics) in a large amount of text data. Topic modeling algorithms examine text to look for clusters of similar words and then group them based on the statistics of how often the words appear and what the balance of topics is.
As a result, topic modeling helps you understand the key themes from your survey responses as well as the relative importance of each theme.
Let’s say we asked respondents whether or not they like swimming. We followed up with an open-ended question where the respondent can explain their answer. Our topic model produces the following chart, based on the clusters of similar words that appear in the open-ended responses.
Eight main topics emerge, based on the frequency of word clusters that appeared in our open-ended responses. Since we used a 95% confidence interval, there’s some variability in the weight of each topic, which the lines on either side of the topic represent.
As you can see, the topic clusters that appear for respondents who said they don’t like swimming are negative, while the ones who said they like swimming are positive. In our example above, “exhausting” was the most relevant topic when respondents disliked swimming. Meanwhile, “fun” was the most applicable topic when respondents said they liked swimming.
Here are some of its shortcomings:
Deciding on the right application of Natural Language Processing isn’t simple. But choosing between these 3 use cases makes the process much easier. So go forward and embrace your free responses with confidence. You’ll uncover any and all of the key insights they provide.