Project - Final Report

Personalized Hotel Recommender using Sentiment Analysis

Project - Final Report

Introduction

One of the most basic human activity is travel. Finding quality lodging when traveling is the hardest part. According to data from travel reservations in 2018, there were around 148 million online bookings, of which 82% were made using an app or website [1]. This highlights how dependent individuals are on internet and applications to find lodging. Most of these websites and applications offer some sort of measure to assist customers in making hotel reservations, but frequently they don't focus on the user and don't take into consideration their unique needs. Many studies on this subject of hotel recommendation systems focus just on user preferences and ignore hotel reviews or fail to account for the important differences between various users [2]. The main goal of this project to provide user with personalized hotel recommendation.

The dataset that we are using consists reviews of various hotels present in Europe. It contains features such as Hotel Address, Review Date, Average Score (Average of ratings for a single hotel), Hotel Name, Reviewer Nationality, Total Number of Reviews, Positive Review, Negative Review, Reviewer Score (raiting given to hotel by the customer), days since review.

Problem Definition and Motivation

When reserving a hotel, the most popular approach for making a good decision has been to read the ratings and reviews. But the main issue here is that these evaluations were made based on the viewpoint of one individual. A 5-star rating for one person may be a 2-star rating for another. Some people would select a hotel with a lower price over the ambiance, while others might do so regardless of cost.

Instead of displaying a generic hotel rating, our goal is to help the customer by making more individualized hotel suggestions that consider their preferences for different factors like pricing, cuisine, atmosphere, etc. In turn, this facilitates a quicker and more effective process for the customer to choose the best hotel depending on their preferences.

Dataset Collection and Cleaning

The project makes use of the “515k-hotel-reviews-data-in-europe” dataset for training the models. The data was scraped from Booking.com. The data comprises of reviews evaluated by 515,000 people for 1493 upmarket hotels in Europe. There are 17 feature columns in total, including Hotel Address, Review Date, Average Score, Hotel Name, etc., in the csv file. The unused columns in the dataset were removed because they don't have significant contribution to the model.

Additionally, some of the data samples were just given with the overall ratings rather than a descriptive comments. Since the project concentrates on the user preferences rather than quantitative review value, such data is replaced with an empty string.

Screenshots of both the original and modified data sets are provided below.

Original Dataset from Kaggle:

Download Original Dataset

Processed Dataset:

Download Cleaned Dataset

Data Preporcessing:

The Preprocessing step includes tokenization, tagging parts of speech, removing stop words, stemming, noun extraction and noun filtering. Tokenization mainly focuess on detecting words. Then the words get tags that determine their syntactic role (such as verb, adjective, etc.) in the sentence. The stop words like "a", "an", "the", "that", "of", "from" are deleted. Then the suffix and prefix of the words are deleted and the stem of the word is retained.

The stemming process helps to reduce all derivatives of a word, which are not semantically different, into a common concept. For example, if a document contains words like ‘‘eating’’ and ‘‘eaten’’, they are all considered as ‘‘eat’’. As we are looking for preferences of the user that are usually in the form of nouns, the words that have received the noun tag are extracted. Since the number of these nouns may be very large, unrelated nouns are filtered.

Data Analysis:

Analysing the number of reviews based on the reason for reviewer visit

The bar chart shows that most people visit hotels for leisure trips as couples or by themselves. Fewer people came with their family or group, and even fewer came with friends. Out of 515k reviews, there are 100k reviews tagged as business, which means 19% of the reviewers came for business trips. However, we should consider that people who came for leisure trips are usually more likely to have time or are more willing to write reviews, while those who came for business trips may be too busy or simply do not want to write any reviews.

Aspects that attracts the customers

For Positive Reviews, most people are probably satisfied with the location, very convenient and easy-to-find restaurants, friendly and helpful staff, clean room, and comfortable bed.

## Aspects that need improvements

The negative reviews also mentioned “breakfast”, “room” and “staff” quite often, but maybe people were complaining about the staff who were being rude, the small rooms, and the coffee provided during the breakfast. The air conditioning or the shower system may need improvements as we see words like “hot”, “cold”, “air”, “bathroom” and “shower” in the word cloud. The hotel may also need to solve issues related to soundproofing and parking.

Results and Discussions:

Supervised Learning

Approach:

For the purpose of categorizing our data into classes of positive, negative, and neutral information, we use the supervised learning methods like Multi-class Logistic Regression, Multinomial Naive Bayes, and Support Vector Machine. For training and testing, the data was split in an 60:20:20 ratio.

The cleaned and analyzed raw datset has uncategorized reviews. To categorize them, we use a library called Vader Sentiment Analyzer which will calculate the sentiment score for each review in the dataset. The range of Sentiment Score varies between -1 and +1. In order to classiy them into the buckets of negative, neutral and positive, a threshold is fixed. The classification of the reviews based on the sentiment score is as follows:

Define the class as Negative, if the sentiment score is between -1 and -0.25
Define the class as Neutral, if the sentiment score is between -0.25 and +0.25
Define the class as Positive, if the sentiment score is between +0.25 and +1.

This classificaion of data from the Vader Semtiment Analyzer is assumed to be the ground truth labels of the data. With data split of 60:20:20 as train, validation, and test samples. Supervised learning algorithms are applied to classify the data. Further, the model's performance is evaluated by various metrics such as F1 score, Accuracy, Precision, Recall, and ROC-AUC.

Implementation:

Multinomial Naive Bayes, Logistic Regression and Support Vector Machine were utilized to classify the reviews into the positive, negative and neutral.
The confustion matrix for all the models are as follows:

## Naive bayes (Confusion Matrix):

Logistic regression (Confusion Matrix):

Support Vector machine (Confusion Matrix):

Evaluation Metrics:

Based on the confusion matrix above, Precision, Recall and F1 Score for all the three models are computed and and tabulated below:

Naive Bayes	Precision	Recall	F1 Score
Negative	0.62	0.66	0.64
Neutral	0.65	0.54	0.59
Positive	0.70	0.78	0.74

Logistic Regression	Precision	Recall	F1 Score
Negative	0.70	0.67	0.68
Neutral	0.69	0.68	0.69
Positive	0.81	0.84	0.82

Support Vector Machine	Precison	Recall	F1 Score
Negative	0.65	0.71	0.68
Neutral	0.68	0.65	0.66
Positive	0.82	0.78	0.80

Based on the Precision, Recall and F1 Score calculated above, The metrics like Macro-average Precision, Weighted Average Precision, Macro Average Recall, Weighted Average Recall, Macro Average F-1 Score, Weighted Average F-1 Score and accuracy are computed and tabulated below.

Evaluation Metrics	Macro Average Precision	Weighted Average Precision	Macro Average Recall	Weighted Average Recall	Macro Average F-1 Score	Weighted Average F-1 Score	Accuracy
Naive Bayes	0.66	0.65	0.66	0.66	0.65	0.65	0.65
Logistic Regression	0.73	0.73	0.73	0.73	0.73	0.73	0.73
Support Vector Machine	0.72	0.72	0.71	0.71	0.71	0.71	0.71

From the above metrics, it is observed that Logistic Regression performs the best in classifying the reviews into positive, negative and neutral followed by the Support Vector Machine and Naive Bayes.

Further, to complement the justification, Receiver Operating Characteristic (ROC) Curves for all the models were plotted.

ROC Graphs for our implementations:

Naive bayes (ROC Curve):

Logistic Regression (ROC Curve):

Support Vector Machine (ROC Curve):

From the ROC curves, it is justified that the logistic Regression performs better than the Support Vector Machine and Naive Bayes.

Unsupervised Learning :

Approach:

The purpose of unsupervised learning is to cluster the nouns with similar semantics into clusters and thereby these clusters are named manullay based on the words in the clusters. To achieve this task, First, we extracted the noun in the sentence using an open-Source tool called pyABSA, and its confidence factor with the help of the supervised learning algorithm implemented above. Secondly, the words extracted are translated into word embeddings with the help of BERT to get the contexual information. These word embeddings are used for clustering the words. Once the clusters are generated, we manually named each cluster with an appropriate aspect. Finally, clusters are mapped to each hotel based on the word set and confidence level.

Implementation:

Clustering of words is implemented using KMeans, Gaussian Mixture Models(GMM), and Hierarchical Clustering. Considering the advantage that the Hierarchical clustering does not need to have the number of clusters defined at beginning, we chose to implement it by following the below steps

Hierarchical Clustering:

Steps:

Firstly, we create a noun vector from each review. Ex: [ ‘Food’, ‘Noodles’, ‘room’]
After the noun vector is formed, we use the Spacy library to determine how similar each word is with other word in the vector. A matrix of size (words x words) with similarity values is generated.
The Coorelation matrix for a single example is as shown below:

The dendogram obtained from Hierarchial clustering is dissected into clusters.

The threshold value to split the dendogram obtained in the previous step is decided based on the number of clusters required.

K-means & GMM:

The clusters obtained from the Hierarchical clustering were pretty accurate. However, when it is implemented using the entire dataset, the model resulted in a large number of clusters which could not be limited by modifying the threshold value and, hence, is not suitable for this usecase. Therefore, We have implemented K-means clustering algorithm, with 8 as the number of clusters and was able to get desired cluster of words at the expense of accuracy. We implemented the same using GMM (Gaussian Mixture Model) with 8 components, and the results show a higher accuracy compared to that of K-means. These clustering algorithm were evaluated using metrics such as Silhouette Coeffifcient, Calinski-Harabasz Index, Davies-Bouldin Index to identify the one with better performance.

Evaluation Metric

Unsupervised Learning

Model	Silhouette coefficient	Davies-Bouldin Index	Calinski - Harabasz Index
K-means	0.43	0.705	850.62
GMM	0.45	0.68	1050

Team Contribution

Sunil Ravilla - Supervised Learning, Dashboard
Prasanth Bathala - Supervised Learning, Evaluation Metrics
Nigam Katta - Unsupervised Learning, Evaluation Metrics
Hemanth Sai Surya Kumar Tammana - Unsupervised Learning, Exploratory Data Analysis
Sahaj Jha - Unsupervised Learning, Data Collection and Cleaning

References

[1] https://www.stratosjets.com/blog/online-travel-statistics/

[2] K. Takuma, J. Yamamoto, S. Kamei and S. Fujita, "A Hotel Recommendation System Based on Reviews: What Do You Attach Importance To?," 2016 Fourth International Symposium on Computing and Networking (CANDAR), 2016, pp. 710-712, doi: 10.1109/CANDAR.2016.0129.

[3] Abro, Sindhu, et al. "Aspect Based Sentimental Analysis of Hotel Reviews: A Comparative Study." Sukkur IBA Journal of Computing and Mathematical Sciences 4.1 (2020): 11-20.

[4] Schouten, Kim, and Flavius Frasincar. "Survey on aspect-level sentiment analysis." IEEE Transactions on Knowledge and Data Engineering 28.3 (2015): 813-830. Abro, Sindhu, et al. "Aspect Based Sentimental Analysis of Hotel Reviews: A Comparative Study." Sukkur IBA Journal of Computing and Mathematical Sciences 4.1 (2020): 11-20.

[5] Musto, Cataldo, et al. "A multi-criteria recommender system exploiting aspect-based sentiment analysis of users' reviews." Proceedings of the eleventh ACM conference on recommender systems. 2017.

[6] Pathuri, Siva Kumar, N. Anbazhagan, and G. Balaji Prakash. "Feature Based Sentimental Analysis for Prediction of Mobile Reviews Using Hybrid Bag-Boost algorithm." 2020 7th International Conference on Smart Structures and Systems (ICSSS). IEEE, 2020.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
_layouts		_layouts
CleanedDataFrame.jpg		CleanedDataFrame.jpg
Ganntchart.xlsx		Ganntchart.xlsx
LR.jpeg		LR.jpeg
LRROC.jpeg		LRROC.jpeg
ML_7641.pptx		ML_7641.pptx
NB.jpeg		NB.jpeg
NBROC.jpeg		NBROC.jpeg
Negative Reviews.png		Negative Reviews.png
OriginalSnippet.jpg		OriginalSnippet.jpg
Positive Reviews.png		Positive Reviews.png
README.md		README.md
SVM.jpeg		SVM.jpeg
SVMROC.jpeg		SVMROC.jpeg
_config.yml		_config.yml
correlation.png		correlation.png
dashboard.png		dashboard.png
gannt.png		gannt.png
hierar.png		hierar.png
newplot.png		newplot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project - Final Report

Introduction

Problem Definition and Motivation

Dataset Collection and Cleaning

Original Dataset from Kaggle:

Processed Dataset:

Data Preporcessing:

Data Analysis:

Analysing the number of reviews based on the reason for reviewer visit

Aspects that attracts the customers

Results and Discussions:

Supervised Learning

Approach:

Implementation:

Logistic regression (Confusion Matrix):

Support Vector machine (Confusion Matrix):

Evaluation Metrics:

ROC Graphs for our implementations:

Naive bayes (ROC Curve):

Logistic Regression (ROC Curve):

Support Vector Machine (ROC Curve):

Unsupervised Learning :

Approach:

Implementation:

Hierarchical Clustering:

Steps:

K-means & GMM:

Evaluation Metric

Unsupervised Learning

Team Contribution

References

About

Releases

Packages

Languages

NigamKatta/Personalized-Recommender-System-Using-NLP

Folders and files

Latest commit

History

Repository files navigation

Project - Final Report

Introduction

Problem Definition and Motivation

Dataset Collection and Cleaning

Original Dataset from Kaggle:

Processed Dataset:

Data Preporcessing:

Data Analysis:

Analysing the number of reviews based on the reason for reviewer visit

Aspects that attracts the customers

Results and Discussions:

Supervised Learning

Approach:

Implementation:

Logistic regression (Confusion Matrix):

Support Vector machine (Confusion Matrix):

Evaluation Metrics:

ROC Graphs for our implementations:

Naive bayes (ROC Curve):

Logistic Regression (ROC Curve):

Support Vector Machine (ROC Curve):

Unsupervised Learning :

Approach:

Implementation:

Hierarchical Clustering:

Steps:

K-means & GMM:

Evaluation Metric

Unsupervised Learning

Team Contribution

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages