A Semantic Analysis of Amazon Fine Food Reviews (2024)

A Semantic Analysis of Amazon Fine Food Reviews (1)

A Semantic Analysis of Amazon Fine Food Reviews (2)

Published in

Artificial Intelligence in Plain English

·

8 min read

·

Apr 15, 2020

--

Text is the heart of how we communicate with others, especially companies online. Each type of communication, whether it is chatting, or post in social media, reviews of any product or service, contains potentially relevant and useful information that needs to be captured and understood by the companies for outstanding and progressive results.

Capturing the information is not the hard part. But the real pain lies in understanding what is being said and mapping it to a large scale.

Humans understand the text with the help of the knowledge of the language and the context on which the text is being said they already know. But machines cannot rely on the same techniques.

So, how do these giant tech companies like youtube, twitter, Facebook, and so on know what you might prefer? How does online bot knows what you need to know? Amazing..isn’t it?

The magic happening behind is the different techniques provided by Natural Language Processing (NLP), and Semantic Analysis happens to be one of those.

Semantic Analysis describes the process of understanding natural language — the way humans can communicate with meaning and context.

The semantic analysis of a natural language content starts with reading all the words in the material to capture the meaning of the text. It identifies the text elements and assigns them to their logical and grammatical role.

It analyzes the context in the surrounding text, and it examines the text structure to accurately disambiguate the proper meaning of words that have more than one definition.

In this project, we create classifiers to classify positive and negative reviews using various machine learning techniques.

Now, you may have questions about how this is going to be useful for companies like Amazon? And even if it’s helpful to them, why use Machine Learning?

Then, folks to answer your queries: this type of classifiers will allow the companies to know which of their product/service is going right and which of those needs improvement. Also, these companies will receive millions of reviews per day, which is very difficult to analyze without ML techniques.

1. Basic Understanding of Python and Statistics.

2. Familiarity with anaconda and Jupiter notebook. You can also use other platforms like colab or kaggle.

We will be using a freely available dataset from Kaggle. This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012.

Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories. You can download the dataset here.

Exploratory Data Analysis (EDA) and Preprocessing

Exploratory data analysis is the process of analyzing data sets to summarize their main characteristics, often with visualization techniques.

It is essential to explore the datasets before actual implementation once to be aware of the existing attributes, their relationships, and nature. Let’s explore our datasets as below:

"""
Importing necessary packages
"""
# for large and multidimensional arrays
import
numpy as np
import re
# for data manipulation and analysis
import
pandas as pd
# Natural Language processing tool-kit
import
nltk
# stopwords corpus
from
nltk.corpus import stopwords
# Stemmer
from
nltk.stem import PorterStemmer
# for Bag of Words Vector
from
sklearn.feature_extraction.text import CountVectorizer
# For TF-IDF
from
sklearn.feature_extraction.text import TfidfVectorizer
# for word2Vec
from
gensim.models import word2vec
"""
Loading Datasets
"""
df = pd.read_csv("amazon-fine-food-reviews/Reviews.csv")
"""
Exploratory Data Analysis
"""
print(df.head(10)) # See first 10 datas from the dataset
print(df.columns) # Find all the columns
print(df[df["Score"] > 3]) # See values having positive score

Data preprocessing allows transforming raw data to more useful and efficient for Analysis.Here, we will perform following preprocessing task:

1 . Data Cleansing : Remove duplicate entries

Reviews might contain duplicate entries. So, we need to remove the duplicate entries so that we get unbiased data for Analysis.

final_df = df.drop_duplicates(subset=("UserId", "ProfileName", "Time", "Text"))

2. Helpfulness numerator should always be less than helpfulness denominator

The helpful numerator is the number of users who found the review helpful. The helpfulness denominator is the number of users whether they found the review useful or not.

final = final_df[final_df["HelpfulnessNumerator"] <=final_df["HelpfulnessDenominator"]]final_X = final["Text"]
final_Y = final["Score"]

3. Stemming and removing stopword

Stemming is the process of converting a word into its base/root word. This will help to reduce the vector dimension as we don’t consider all the similar words. For eg: cats to the cat, playing to play, etc.

Stopword is the process of removing those unnecessary words from the text, which, when removed, doesn’t change the sentiment of the text. For eg: Ball is green => Ball green

tmp = []
snow_stemmer = nltk.stem.SnowballStemmer(language="english")
for sentence in final_X: # converting the sentence to lowercase
sentence = sentence.lower()
clean = re.compile("<.*?>")
sentence = re.sub(clean, " ", sentence) # remove html tags
sentence = re.sub(r"[?|!|\'|\"|#]", r"", sentence)
sentence = re.sub(r"[.|,|:|(|)|\|/]", r" ",sentence) # removing puntuations
# removing stopwords and then stemming the result
words = [snow_stemmer.stem(word)for word in sentence.split()
if word not in stopwords.words("english")]
tmp.append(words)
final_X = tmp
# preparing sentences from list of words
sent = []
for row in final_X:
sequence = " "
for word in row:
sequence = sequence + word
sent.append(sequence)
final_X = sent

Encoding Technique ( Bag of Words)

After EDA and preprocessing, we are now sure that our reviews contain only text which needs to be encoded to specific numerical patterns i.e. vectors, as machines cannot understand that word directly.

In this tutorial, we will be using a bag of words encoding technique where we count the frequency of a word that appears in each document and prepare a dictionary.

A Semantic Analysis of Amazon Fine Food Reviews (4)

But in practice, we use a binary bag of words in which we do not count the frequency of the word. Instead, we place ‘1’ if it appears in the review or ‘0’ otherwise. Here, we will use CountVectorizer, a package provided by sci-kit-learn, to perform BOW.

count_vect = CountVectorizer(max_features=5000)
binary_bow_data = count_vect.fit_transform(binary=True)

There is a disadvantage of using unigram BOW, which I will explain in the example below:

A Semantic Analysis of Amazon Fine Food Reviews (5)

To tackle this problem, we use the bigram/N-gram technique. In unigram, we take one word at a time, and in bigram, we consider two words at a time. Similarly, in N-gram, we take ’N’ words at a time, as shown below:

A Semantic Analysis of Amazon Fine Food Reviews (6)
A Semantic Analysis of Amazon Fine Food Reviews (7)

To learn more on Word Embedding techniques, read this article

Scaling Data

We need to scale our independent features in the dataset to a fixed range to handle highly varying magnitude or values or units.

Here we use the “StandardScaler” package of sci-kit-learn that assumes your data is normally distributed within each feature and will scale them such that the distribution is now centered around mean 0, with a standard deviation 1.

final_bow_np = StandardScaler(with_mean=False).fit_transform(binary_bow_data)

Train-Test Split

As we have already sorted our data, we now use 70% of our data set for the training set with cross-validation and the remaining 30% for testing.

I will leave this to you guys. Hints: use “final_bow_np” as our X and “score” as Y.

K-NN Algorithm

It is the supervised machine learning algorithm used for both classification and regression problems.

This fundamentally depends on distance metrics. The better that metrics reflect label similarity, the better the classified will be.

Wait, a minute, What is Supervised Machine Learning?

Machine Learning is broadly classified into two groups:

  1. Supervised Machine Learning
  2. Unsupervised Machine Learning

Supervised Machine Learning, is the technique to find the function based on labeled data, to determine the appropriate result for the unlabeled data.

A Semantic Analysis of Amazon Fine Food Reviews (8)

Supervised Machine Learning is generally used to solve two basic problems:

  1. Classification problem: In classification, there will be a pre-defined number of classes(fixed) and given an example, we need to identify to which class it belongs. The result is always a discrete value.
  2. Regression problem: In regression, we try to establish a relationship between dependent and independent variables. The independent variables are the features, while the dependent variable is the output we want to predict and is continuous.

Note: Since this is not the tutorial series on Machine Learning, I have only discussed a few things to know as we get started. To understand more: you can review this coursera’s foundation course on Machine Learning

Applying K-NN algorithm

First of all, we will find the optimal value of “k” using Cross-Validation.

Cross-Validation is the resampling procedure used in machine learning to find the best estimator for the machine learning models.

A Semantic Analysis of Amazon Fine Food Reviews (9)
A Semantic Analysis of Amazon Fine Food Reviews (10)

K-NN with optimal K

knn = KNeighborsClassifier(n_neighbors=optimal_k)
knn.fit(X_train, Y_train)
pred = knn.pred(X_test)

Model Evaluation

We evaluate our classification model using two metrics:

  1. Confusion Matrix
plt.figure()
cm = confusion_matrix(Y_test, pred)
class_label = ["negative", "positive"]
df_cm_test = pd.DataFrame(cm, index = class_label, columns = class_label)
sns.heatmap(df_cm_test , annot = True, fmt = "d")
plt.title("Confusion Matrix for Test datas")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")Accuracy
A Semantic Analysis of Amazon Fine Food Reviews (11)

2. Accuracy

print("The accuracy for the model with BOW encoding is ",round(accuracy_score(Y_test, pred), 4))
A Semantic Analysis of Amazon Fine Food Reviews (12)

The accuracy of the model is not so satisfactory as this was only for learning purposes.We can use other embedding techniques such as Word2Vec or TF/IDF and compare the result.

You can view the complete code here.

But, I recommend you to try it on your own, encounter errors and search for solutions.

Thank you for taking your time and reading. Stay home and stay safe in this pandemic.

Keep reading.

A Semantic Analysis of Amazon Fine Food Reviews (2024)
Top Articles
Latest Posts
Article information

Author: Lilliana Bartoletti

Last Updated:

Views: 5902

Rating: 4.2 / 5 (73 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Lilliana Bartoletti

Birthday: 1999-11-18

Address: 58866 Tricia Spurs, North Melvinberg, HI 91346-3774

Phone: +50616620367928

Job: Real-Estate Liaison

Hobby: Graffiti, Astronomy, Handball, Magic, Origami, Fashion, Foreign language learning

Introduction: My name is Lilliana Bartoletti, I am a adventurous, pleasant, shiny, beautiful, handsome, zealous, tasty person who loves writing and wants to share my knowledge and understanding with you.