This project aims to predict emotions from text data using various machine learning and deep learning models. It includes preprocessing steps, different vectorization techniques, and a Streamlit web application for interactive predictions.
- Project Overview
- Dataset
- Data Preprocessing
- Models Used
- Project Structure
- Installation
- Usage
- Streamlit App
- Conclusion
The main goal of this project is to predict emotions such as happiness, anger, and sadness from textual data. We employed various models and vectorization techniques to find the best performing model.
The dataset used in this project consists of WhatsApp chat data from Indian users. This data presents unique challenges as many Indian users often type in their native languages using the English script, which includes a lot of slang and colloquial expressions. For example, the word "khatarnaak" (खतरनाक), which means "dangerous" in Hindi, is often used to describe something intense or impressive in a positive way. This linguistic mix makes it challenging for models to accurately interpret and predict emotions.
Despite these challenges, the best performing model, which uses a Linear SVM with Word2Vec, achieved a validation accuracy of 73%.
The text preprocessing pipeline includes the following steps:
- Convert to lowercase
- Remove whitespace
- Remove newline characters
- Remove ".com" substrings
- Remove URLs
- Remove punctuation
- Remove HTML tags
- Remove emojis
- Handle problematic characters within words( ’, iâm, 🙠and so on)
- Convert acronyms
- Expand contractions
- handle slangs and abbreviations
- Correct spelling
- Lemmatize text
- Discard non-alphabetic characters
- Keep specific parts of speech
- Remove stopwords
graph TD;
A[Input Text] --> B[Convert to lowercase];
B --> C[Remove whitespace];
C --> D[Remove newline characters];
D --> E[Remove .com];
E --> F[Remove URLs];
F --> G[Remove punctuation];
G --> H[Remove HTML tags];
H --> I[Remove emojis];
I --> J[Handle problematic characters];
J --> K[Convert acronyms];
K --> L[Expand contractions];
L --> M[handle slangs and abbreviations];
M --> N[Correct spelling];
N --> O[Lemmatize text];
O --> P[Discard non-alphabetic characters];
P --> Q[Keep specific parts of speech];
Q --> R[Remove stopwords];
R --> S[Preprocessed Text];
10 models were Tested for each vectorization method, Best performed:
- TF-IDF Vectorizer with XGBoost
- Bag of Words (BoW) Vectorizer with XGBoost
- Word2Vec with Linear SVM
- GloVe with Bidirectional LSTM
├── Datasets/
| ├── angriness.csv
| ├── happiness.csv
| ├── sadness.csv
├── assets/
| ├── comment.png
| ├── streamlit app overview.png
├── streamlit app/
| ├── pkl files/
│ | ├── best_xgb_model.pkl
│ | ├── bow_vectorizer.pkl
│ ├── main.py
│ ├── text_normalization.py
│ ├── requirement.txt
├── emotion_classificatiom.ipynb
├── project report.pdf
├── README.md
- Clone the repository:
git clone https://github.com/vn33/Intensity-Analysis-EmotionClassification.git
- Install the dependencies:
pip install -r streamlit app/requirement.txt
- Download necessary NLTK data:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
- Download Spacy model:
python -m spacy download en_core_web_sm
Run the Streamlit app:
streamlit run app.py
Enter text into the input box and click "Predict" to see the emotion prediction.
The Streamlit app allows users to input text and get an emotion prediction. It uses the pre-trained models and vectorizers to preprocess the text and make predictions.
This project demonstrates the use of various text preprocessing techniques and machine learning models to predict emotions from text. Despite the challenges posed by the unique linguistic characteristics of the dataset, we achieved a validation accuracy of 73% with our best model. The Streamlit app provides an interactive way to test the models and see their predictions in real-time.