This series of blog posts focus on Natural Language Processing (often referred to as NLP). See our last blog, “Natural Language Processing: What Is It & How Can It Be Used?”, for an introduction to NLP. This blog post builds upon the previous article by providing an overview of a potential use case for NLP: Sentiment Analysis.


NLP Sentiment Analysis

Sentiment analysis focuses on understanding the general sentiment behind a block of text, usually expressed as “positive” or “negative.” 

Sentiment analysis is often used to determine the public feeling regarding a company, person, product, or current event from various types of text: social media posts, product reviews, news articles, voice transcripts, etc. This information can be highly useful in a variety of contexts. For example, a politician may want to determine whether the public feels positively or negatively towards them prior to an election.

Companies may want to measure whether the public feels positively or negatively about their flagship product. Executives may be curious as to whether customers dialing into customer service call centers are generally happy (positive sentiment) or unhappy (negative sentiment). Sentiment analysis can help us understand whether the general tone of each block of text is either positive or negative in each of these examples and, if data are available, allow us to understand whether the sentiment is positive or negative overall and if the sentiment changes across regions or demographics.

As an example of a sentiment analysis implementation, we acquired Amazon Marketplace review data for pet supplies products. As part of this analysis, we train the model to identify text associated with poorly rated reviews (reviews with 1 or 2 stars) and highly rated reviews (reviews with 4 or 5 stars) then predict to see if we can appropriately estimate the general sentiment of the review text based solely on language. As a test of accuracy, we then compare the predicted sentiment (positive or negative) against the actual number of stars associated with each review in our testing dataset. The idea here is that by training our models with the Amazon review data, we can now predict whether a similar pet product review from another platform is positive or negative, which then enables us to provide summary-level information about the overall sentiment (positive or negative) on the product across a variety of platforms.

Sentiment Analysis Step 1: Pre-processing Text

Anytime we work with unstructured text data, we must first pre-process the data using NLP pre-processing techniques. A summary of these pre-processing techniques is available in the “Natural Language Processing: What is it & How Can it be Used?” blog post. As part of this sentiment analysis, we tokenized the text into word tokens, removed stop words and punctuation, and lemmatized so that different forms of the same word would be treated the same (i.e., ‘dogs’ and ‘dog’ would be treated the same in terms of the analysis). For this analysis, we did not tag parts of speech or text chunk.

Sentiment Analysis Step 2: Classifying Text

We now have a list of information about each review that can be partially understood and classified by a computer—we have the review text, which has been broken into individual words and cleaned—and we have a categorization—whether the review is positive (4 or 5 stars) or negative (1 or 2 stars). We can use this information to train a number of classifiers. To do this, we identify the top 4,000 most common words used within the text reviews and train our classifiers using these words. The training determines whether they appear in either a positive (4 or 5 stars) or negative (1 or 2 stars) review. In this case, highly rated reviews are assumed to use positive language and poorly rated reviews are assumed to use negative language.

In this case, we chose to implement an multiple classifiers and allow the classifiers to “vote” (also referred to as an ensemble modeling technique). In other words, each classifier will attempt to classify the block of review text as either “positive” or “negative” based on the training data and will cast a “vote.” If the majority of classifiers determine that the review is positive, the final classification will be positive. If the majority of classifiers determine that the review is negative, the final classification will be negative. We also report the percentage of classifiers that voted for the majority as a simple measure of confidence.

This analysis uses the Naive Bayes Classifier, Logistic Regression, Stochastic Gradient Descent (SGD) Classifier, Support Vector Classifier (SVC), and Linear Support Vector Classifier (SVC). Each of the classifiers performs well on the testing data, with accuracy percentages ranging from 85.7 percent (Naive Bayes) to 88.7 percent (Logistic Regression). 

NLP Sentiment Analysis Results

As an example of the power of NLP, see the following Amazon reviews and classified sentiments (positive or negative). NLP was able to accurately determine the sentiment of these reviews (and thousands of others) in seconds, without the need for a staff member to read each one and summarize.

“This is the best smelling dog shampoo I have found so far. Does not have the chemical/over-perfumed smell of other shampoos. Works wonders on my westie mix’s coat—he has fewer itches, and smells good for much longer than with other shampoos.”

Classified Sentiment: Positive

Percentage of Classifiers Voted Positive: 100%

“Great warming outdoor pad for any outside animals you would like to keep warm. The fleece cover is soft and the cord is well protected. I placed it on top of a cushion I had already had in an outdoor pet house. My feral cat loves it in this frigid weather!”

Classified Sentiment: Positive

Percentage of Classifiers Voted Positive: 100%

“We used this product on our dog and it didn’t work for us. I don’t know if it was just our dog or not, but with this type of problem, you should try different products until you find one that works for you. Maybe this might work for your dog.”

Classified Sentiment: Negative

Percentage of Classifiers Voted Negative: 60%

Join us for future NLP blog posts regarding data analysis and visualization with text data, document categorization, and text summarization.