This project focuses on performing sentiment analysis on hotel reviews. The goal is to collect, clean, and analyze review data to develop a machine learning model that can predict the sentiment (positive, negative, or neutral) of the reviews. The project involves four key stages: data collection, data cleaning, building a sentiment analysis machine learning model and visualizing the dataset in a map
- Project Overview
- Requirements
- Data Collection
- Data Cleaning
- Exploratory Data Analysis (EDA)
- Model Training
- Evaluation
- How to Run
- File Structure
- Contributors
To run this project, you will need the following libraries installed:
- Python 3.x
- Pandas
- NumPy
- Pickle (Model loading)
- Scikit-learn
- NLTK (Natural Language Toolkit)
- BeautifulSoup (Web scraping)
- Wordcloud
- Keras
- SQLalchemy
- Pymongo
- Matplotlib (for data visualization)
You can install all dependencies by running:
bash
Copy code
pip install -r requirements.txt
Hotel reviews can be collected from various sources, such as:
- Web scraping hotel booking platforms (e.g., TripAdvisor, Yelp.com). See 00 Data Collection for details
- Public datasets from sources like Kaggle (Booking hotel reviews dataset).
Raw data may contain noise, missing values, or irrelevant information. The data_cleaning.py script handles data preprocessing, which includes:
- Removing duplicates
- Filling or removing missing values
- Normalizing text (removing punctuation, converting to lowercase)
- Tokenization, stopword removal, and lemmatization using NLTK or SpaCy
- Encoding sentiment labels (e.g., Positive = 1, Negative = 0)
For sentiment analysis, several machine learning models can be used, including:
- Logistic Regression
- Support Vector Machine (SVM)
- K-nearest neighbor
The model_training.py script contains the process of training the model. It includes:
- Vectorization: Use TF-IDF or Word2Vec for text vectorization.
- Model Training: Train the machine learning model using labeled data.
- Hyperparameter Tuning: Apply GridSearchCV or RandomSearch for optimizing hyperparameters. Example:
After training, the model is evaluated using various metrics:
- Accuracy
- Precision
- Recall
- F1-Score
- Confusion Matrix
Contributors Hugo Villanueva ([email protected]) Feel free to open an issue or contribute to the project!