Published in · 8 min read · Apr 15, 2020
--
Text is the heart of how we communicate with others, especially companies online. Each type of communication, whether it is chatting, or post in social media, reviews of any product or service, contains potentially relevant and useful information that needs to be captured and understood by the companies for outstanding and progressive results.
Capturing the information is not the hard part. But the real pain lies in understanding what is being said and mapping it to a large scale.
Humans understand the text with the help of the knowledge of the language and the context on which the text is being said they already know. But machines cannot rely on the same techniques.
So, how do these giant tech companies like youtube, twitter, Facebook, and so on know what you might prefer? How does online bot knows what you need to know? Amazing..isn’t it?
The magic happening behind is the different techniques provided by Natural Language Processing (NLP), and Semantic Analysis happens to be one of those.
Semantic Analysis describes the process of understanding natural language — the way humans can communicate with meaning and context.
The semantic analysis of a natural language content starts with reading all the words in the material to capture the meaning of the text. It identifies the text elements and assigns them to their logical and grammatical role.
It analyzes the context in the surrounding text, and it examines the text structure to accurately disambiguate the proper meaning of words that have more than one definition.
In this project, we create classifiers to classify positive and negative reviews using various machine learning techniques.
Now, you may have questions about how this is going to be useful for companies like Amazon? And even if it’s helpful to them, why use Machine Learning?
Then, folks to answer your queries: this type of classifiers will allow the companies to know which of their product/service is going right and which of those needs improvement. Also, these companies will receive millions of reviews per day, which is very difficult to analyze without ML techniques.
1. Basic Understanding of Python and Statistics.
2. Familiarity with anaconda and Jupiter notebook. You can also use other platforms like colab or kaggle.
We will be using a freely available dataset from Kaggle. This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012.
Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories. You can download the dataset here.
Exploratory Data Analysis (EDA) and Preprocessing
Exploratory data analysis is the process of analyzing data sets to summarize their main characteristics, often with visualization techniques.
It is essential to explore the datasets before actual implementation once to be aware of the existing attributes, their relationships, and nature. Let’s explore our datasets as below:
"""
Importing necessary packages
"""# for large and multidimensional arrays
import numpy as np
import re# for data manipulation and analysis
import pandas as pd# Natural Language processing tool-kit
import nltk# stopwords corpus
from nltk.corpus import stopwords# Stemmer
from nltk.stem import PorterStemmer# for Bag of Words Vector
from sklearn.feature_extraction.text import CountVectorizer# For TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer# for word2Vec
from gensim.models import word2vec"""
Loading Datasets
"""
df = pd.read_csv("amazon-fine-food-reviews/Reviews.csv")"""
Exploratory Data Analysis
"""
print(df.head(10)) # See first 10 datas from the dataset
print(df.columns) # Find all the columns
print(df[df["Score"] > 3]) # See values having positive score
Data preprocessing allows transforming raw data to more useful and efficient for Analysis.Here, we will perform following preprocessing task:
1 . Data Cleansing : Remove duplicate entries
Reviews might contain duplicate entries. So, we need to remove the duplicate entries so that we get unbiased data for Analysis.
final_df = df.drop_duplicates(subset=("UserId", "ProfileName", "Time", "Text"))
2. Helpfulness numerator should always be less than helpfulness denominator
The helpful numerator is the number of users who found the review helpful. The helpfulness denominator is the number of users whether they found the review useful or not.
final = final_df[final_df["HelpfulnessNumerator"] <=final_df["HelpfulnessDenominator"]]final_X = final["Text"]
final_Y = final["Score"]
3. Stemming and removing stopword
Stemming is the process of converting a word into its base/root word. This will help to reduce the vector dimension as we don’t consider all the similar words. For eg: cats to the cat, playing to play, etc.
Stopword is the process of removing those unnecessary words from the text, which, when removed, doesn’t change the sentiment of the text. For eg: Ball is green => Ball green
tmp = []
snow_stemmer = nltk.stem.SnowballStemmer(language="english")for sentence in final_X: # converting the sentence to lowercase
sentence = sentence.lower()
clean = re.compile("<.*?>")
sentence = re.sub(clean, " ", sentence) # remove html tags
sentence = re.sub(r"[?|!|\'|\"|#]", r"", sentence)
sentence = re.sub(r"[.|,|:|(|)|\|/]", r" ",sentence) # removing puntuations # removing stopwords and then stemming the result
words = [snow_stemmer.stem(word)for word in sentence.split()
if word not in stopwords.words("english")] tmp.append(words)
final_X = tmp# preparing sentences from list of words
sent = []
for row in final_X:
sequence = " "
for word in row:
sequence = sequence + word
sent.append(sequence)
final_X = sent
Encoding Technique ( Bag of Words)
After EDA and preprocessing, we are now sure that our reviews contain only text which needs to be encoded to specific numerical patterns i.e. vectors, as machines cannot understand that word directly.
In this tutorial, we will be using a bag of words encoding technique where we count the frequency of a word that appears in each document and prepare a dictionary.
But in practice, we use a binary bag of words in which we do not count the frequency of the word. Instead, we place ‘1’ if it appears in the review or ‘0’ otherwise. Here, we will use CountVectorizer, a package provided by sci-kit-learn, to perform BOW.
count_vect = CountVectorizer(max_features=5000)
binary_bow_data = count_vect.fit_transform(binary=True)
There is a disadvantage of using unigram BOW, which I will explain in the example below:
To tackle this problem, we use the bigram/N-gram technique. In unigram, we take one word at a time, and in bigram, we consider two words at a time. Similarly, in N-gram, we take ’N’ words at a time, as shown below:
To learn more on Word Embedding techniques, read this article
Scaling Data
We need to scale our independent features in the dataset to a fixed range to handle highly varying magnitude or values or units.
Here we use the “StandardScaler” package of sci-kit-learn that assumes your data is normally distributed within each feature and will scale them such that the distribution is now centered around mean 0, with a standard deviation 1.
final_bow_np = StandardScaler(with_mean=False).fit_transform(binary_bow_data)
Train-Test Split
As we have already sorted our data, we now use 70% of our data set for the training set with cross-validation and the remaining 30% for testing.
I will leave this to you guys. Hints: use “final_bow_np” as our X and “score” as Y.
K-NN Algorithm
It is the supervised machine learning algorithm used for both classification and regression problems.
This fundamentally depends on distance metrics. The better that metrics reflect label similarity, the better the classified will be.
Wait, a minute, What is Supervised Machine Learning?
Machine Learning is broadly classified into two groups:
- Supervised Machine Learning
- Unsupervised Machine Learning
Supervised Machine Learning, is the technique to find the function based on labeled data, to determine the appropriate result for the unlabeled data.
Supervised Machine Learning is generally used to solve two basic problems:
- Classification problem: In classification, there will be a pre-defined number of classes(fixed) and given an example, we need to identify to which class it belongs. The result is always a discrete value.
- Regression problem: In regression, we try to establish a relationship between dependent and independent variables. The independent variables are the features, while the dependent variable is the output we want to predict and is continuous.
Note: Since this is not the tutorial series on Machine Learning, I have only discussed a few things to know as we get started. To understand more: you can review this coursera’s foundation course on Machine Learning
Applying K-NN algorithm
First of all, we will find the optimal value of “k” using Cross-Validation.
Cross-Validation is the resampling procedure used in machine learning to find the best estimator for the machine learning models.
K-NN with optimal K
knn = KNeighborsClassifier(n_neighbors=optimal_k)
knn.fit(X_train, Y_train)
pred = knn.pred(X_test)
Model Evaluation
We evaluate our classification model using two metrics:
- Confusion Matrix
plt.figure()
cm = confusion_matrix(Y_test, pred)
class_label = ["negative", "positive"]
df_cm_test = pd.DataFrame(cm, index = class_label, columns = class_label)
sns.heatmap(df_cm_test , annot = True, fmt = "d")
plt.title("Confusion Matrix for Test datas")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")Accuracy
2. Accuracy
print("The accuracy for the model with BOW encoding is ",round(accuracy_score(Y_test, pred), 4))
The accuracy of the model is not so satisfactory as this was only for learning purposes.We can use other embedding techniques such as Word2Vec or TF/IDF and compare the result.
You can view the complete code here.
But, I recommend you to try it on your own, encounter errors and search for solutions.
Thank you for taking your time and reading. Stay home and stay safe in this pandemic.
Keep reading.