A Semantic Analysis of Amazon Fine Food Reviews (2024)

A Semantic Analysis of Amazon Fine Food Reviews (1)

Published in

Artificial Intelligence in Plain English

8 min read

Apr 15, 2020

Exploratory Data Analysis (EDA) and Preprocessing

Exploratory data analysis is the process of analyzing data sets to summarize their main characteristics, often with visualization techniques.

It is essential to explore the datasets before actual implementation once to be aware of the existing attributes, their relationships, and nature. Let’s explore our datasets as below:

"""
Importing necessary packages
"""# for large and multidimensional arrays
import numpy as np
import re# for data manipulation and analysis
import pandas as pd# Natural Language processing tool-kit
import nltk# stopwords corpus
from nltk.corpus import stopwords# Stemmer
from nltk.stem import PorterStemmer# for Bag of Words Vector
from sklearn.feature_extraction.text import CountVectorizer# For TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer# for word2Vec
from gensim.models import word2vec"""
Loading Datasets
"""
df = pd.read_csv("amazon-fine-food-reviews/Reviews.csv")"""
Exploratory Data Analysis
"""
print(df.head(10)) # See first 10 datas from the dataset
print(df.columns) # Find all the columns
print(df[df["Score"] > 3]) # See values having positive score

Data preprocessing allows transforming raw data to more useful and efficient for Analysis.Here, we will perform following preprocessing task:

1 . Data Cleansing : Remove duplicate entries

Reviews might contain duplicate entries. So, we need to remove the duplicate entries so that we get unbiased data for Analysis.

final_df = df.drop_duplicates(subset=("UserId", "ProfileName", "Time", "Text"))

2. Helpfulness numerator should always be less than helpfulness denominator

The helpful numerator is the number of users who found the review helpful. The helpfulness denominator is the number of users whether they found the review useful or not.

final = final_df[final_df["HelpfulnessNumerator"] <=final_df["HelpfulnessDenominator"]]final_X = final["Text"]
final_Y = final["Score"]

3. Stemming and removing stopword

Stemming is the process of converting a word into its base/root word. This will help to reduce the vector dimension as we don’t consider all the similar words. For eg: cats to the cat, playing to play, etc.

Stopword is the process of removing those unnecessary words from the text, which, when removed, doesn’t change the sentiment of the text. For eg: Ball is green => Ball green

tmp = []
snow_stemmer = nltk.stem.SnowballStemmer(language="english")for sentence in final_X: # converting the sentence to lowercase
 sentence = sentence.lower()
 clean = re.compile("<.*?>")
 sentence = re.sub(clean, " ", sentence) # remove html tags
 sentence = re.sub(r"[?|!|\'|\"|#]", r"", sentence)
 sentence = re.sub(r"[.|,|:|(|)|\|/]", r" ",sentence) # removing puntuations # removing stopwords and then stemming the result
 words = [snow_stemmer.stem(word)for word in sentence.split()
 if word not in stopwords.words("english")] tmp.append(words)
final_X = tmp# preparing sentences from list of words
sent = []
for row in final_X:
 sequence = " "
 for word in row:
 sequence = sequence + word
 sent.append(sequence)
final_X = sent

Encoding Technique ( Bag of Words)

After EDA and preprocessing, we are now sure that our reviews contain only text which needs to be encoded to specific numerical patterns i.e. vectors, as machines cannot understand that word directly.

In this tutorial, we will be using a bag of words encoding technique where we count the frequency of a word that appears in each document and prepare a dictionary.

A Semantic Analysis of Amazon Fine Food Reviews (4)

But in practice, we use a binary bag of words in which we do not count the frequency of the word. Instead, we place ‘1’ if it appears in the review or ‘0’ otherwise. Here, we will use CountVectorizer, a package provided by sci-kit-learn, to perform BOW.

count_vect = CountVectorizer(max_features=5000)
binary_bow_data = count_vect.fit_transform(binary=True)

There is a disadvantage of using unigram BOW, which I will explain in the example below:

A Semantic Analysis of Amazon Fine Food Reviews (5)

To tackle this problem, we use the bigram/N-gram technique. In unigram, we take one word at a time, and in bigram, we consider two words at a time. Similarly, in N-gram, we take ’N’ words at a time, as shown below:

A Semantic Analysis of Amazon Fine Food Reviews (6)

A Semantic Analysis of Amazon Fine Food Reviews (7)

To learn more on Word Embedding techniques, read this article

Scaling Data

We need to scale our independent features in the dataset to a fixed range to handle highly varying magnitude or values or units.

Here we use the “StandardScaler” package of sci-kit-learn that assumes your data is normally distributed within each feature and will scale them such that the distribution is now centered around mean 0, with a standard deviation 1.

final_bow_np = StandardScaler(with_mean=False).fit_transform(binary_bow_data)

Train-Test Split

As we have already sorted our data, we now use 70% of our data set for the training set with cross-validation and the remaining 30% for testing.

I will leave this to you guys. Hints: use “final_bow_np” as our X and “score” as Y.

K-NN Algorithm

It is the supervised machine learning algorithm used for both classification and regression problems.

This fundamentally depends on distance metrics. The better that metrics reflect label similarity, the better the classified will be.

Wait, a minute, What is Supervised Machine Learning?

Machine Learning is broadly classified into two groups:

Supervised Machine Learning
Unsupervised Machine Learning

Supervised Machine Learning, is the technique to find the function based on labeled data, to determine the appropriate result for the unlabeled data.

A Semantic Analysis of Amazon Fine Food Reviews (8)

Supervised Machine Learning is generally used to solve two basic problems:

Classification problem: In classification, there will be a pre-defined number of classes(fixed) and given an example, we need to identify to which class it belongs. The result is always a discrete value.
Regression problem: In regression, we try to establish a relationship between dependent and independent variables. The independent variables are the features, while the dependent variable is the output we want to predict and is continuous.

Note: Since this is not the tutorial series on Machine Learning, I have only discussed a few things to know as we get started. To understand more: you can review this coursera’s foundation course on Machine Learning

Applying K-NN algorithm

First of all, we will find the optimal value of “k” using Cross-Validation.

Cross-Validation is the resampling procedure used in machine learning to find the best estimator for the machine learning models.

A Semantic Analysis of Amazon Fine Food Reviews (9)

A Semantic Analysis of Amazon Fine Food Reviews (10)

K-NN with optimal K

knn = KNeighborsClassifier(n_neighbors=optimal_k)
knn.fit(X_train, Y_train)
pred = knn.pred(X_test)

Model Evaluation

We evaluate our classification model using two metrics:

Confusion Matrix

plt.figure()
cm = confusion_matrix(Y_test, pred)
class_label = ["negative", "positive"]
df_cm_test = pd.DataFrame(cm, index = class_label, columns = class_label)
sns.heatmap(df_cm_test , annot = True, fmt = "d")
plt.title("Confusion Matrix for Test datas")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")Accuracy

A Semantic Analysis of Amazon Fine Food Reviews (11)

2. Accuracy

print("The accuracy for the model with BOW encoding is ",round(accuracy_score(Y_test, pred), 4))

A Semantic Analysis of Amazon Fine Food Reviews (12)

The accuracy of the model is not so satisfactory as this was only for learning purposes.We can use other embedding techniques such as Word2Vec or TF/IDF and compare the result.

You can view the complete code here.

But, I recommend you to try it on your own, encounter errors and search for solutions.

Thank you for taking your time and reading. Stay home and stay safe in this pandemic.
Keep reading.