Almost every firm, company, or agency has a collection of text data that is difficult to manage. Big blocks of text, in sentences or paragraphs, in Word documents or text files, can’t be easily queried, searched, averaged, or summarized by database analysts like numeric fields stored in databases.

Dedicating an employee to reading and interpreting every single block of text is time-consuming and may not allow us to understand the big picture in an unbiased way. Often, there’s valuable information stored in these text fields. Perhaps the text fields represent call center transcripts, customer support email text, HR documentation (such as employee performance reviews), written product reviews from a company website or marketplaces such as Amazon, or social media text posts. All of these examples contain information that could be highly valuable to a company. Further, this information could assist company management in making data-driven decisions.

Natural Language Processing

Natural Language Procession (often referred to as NLP) is a collection of techniques for processing and then subsequently analyzing text data. NLP techniques represent pre-processing tasks that are required to transform often unstructured blocks of text into a structured format, necessary for any type of linguistic analysis. The purpose of each of these tasks is to split blocks of text into pieces that can be partially interpreted by computers. While there are many different types of NLP pre-processing tasks, the major ones include tokenization, removing stop words, part-of-speech tagging, lemmatizing, and text chunking, summarized below.

  • Tokenization: Tokenization is the process of splitting one single chunk of text into “tokens”. For example, a document might be split into paragraph tokens, where one token is equal to one paragraph. A paragraph might be split into sentences, where one token is equal to one sentence. But most commonly, sentences are typically split into words, where one token is equal to one word. In this example, tokenization allows us to evaluate each individual word within a document or set of documents.
    • Example: [“The dog went to the park yesterday.”] becomes [“The”, “dog”, “went”, “to”, “the”, “park”, “yesterday”, “.”]
  • Removing Stop Words: Recall that the purpose of NLP is to process text such that a computer can partially understand natural language. It is important to understand that computers do not read and interpret natural language in the same way that a human reads and interprets natural language. We use each of these NLP pre-processing tasks as a way to break down natural language for a computer so it can draw meaningful information from the text for analysis. Stop words are words that are not only highly common in natural language but also don’t provide a computer with much meaningful information about the meaning of the text. Words such as “the”, “and”, and “did” are all words that are commonly used in the English language but don’t offer much information to a computer about the meaning of a block of text. As a result, we remove stop words as part of the NLP process.
    • Example: [“The dog went to the park yesterday.”] becomes [“dog”, “went”, “park”, “yesterday”, “.”] after tokenizing and removing stop words. Notice that the general meaning of the sentence is still intact.
  • Part-of-Speech (POS) Tagging: Part-of-speech tagging involves labeling words as nouns, adjectives, verbs, etc. Certain analysis techniques involve understanding whether a word is a person, a place, a descriptive word, etc. Printing a summary of the most common descriptive words used to describe a product in Amazon reviews, for example, requires the computer to understand which words are descriptive words.
    • Example: [“The dog went to the park yesterday.”] becomes [(“dog”, noun), (“went”, verb), (“park”, noun), (“yesterday”, adverb)].
  • Lemmatization: Lemmatization involves converting different forms of a word to a single word that represents its lexical root. Within our analysis, we may not want to treat the word “phone” differently from the word “phones.” Lemmatization converts both of these words to the same word, “phone,” for purposes of analysis.
    • Example: “geese” becomes “goose,” “cacti” becomes “cactus,” “cats” becomes “cat.”
  • Text Chunking: After identifying the part of speech of each word within a block of text, we can now “chunk,”or group words together into meaningful chunks, or often “noun phrases.” In other words, we’re trying to split a sentence into meaningful but independent phrases.
    • Example: “Jerry goes to the mall when his parents are out of town” could become “Jerry goes to the mall” and “His parents are out of town”, depending on how the chunks are defined. Chunking can be used for different purposes and there are reasons someone may define chunks in different levels of specificity.

Once we have pre-processed our text using NLP pre-processing techniques, we can now start to draw meaningful conclusions from the text and in some circumstances, we can make predictions. Some examples of NLP use cases include determining the general public sentiment about a company, product, or person; summarizing the most common positive and negative feedback from customers about a product or service; properly categorizing documents based on the language within the document; analyzing whether there is bias in a firm’s performance review process depending on the performance review text, rating, or financial compensation; generating useful automatic responses to customer help requests; creating more sophisticated targeted advertising; or performing market intelligence (public sentiment regarding competitors).

This Natural Language Processing blog is the first in a multi-part series aimed at exploring many of the potential use cases for NLP. Join us for future blog posts regarding sentiment analysis, data analysis and visualization with text data, document categorization, and text summarization.