Twitter-Sentiment-Analysis

SC1015 Introduction to Data Science and Artificial Intelligence Mini-Project

About

Mini-project for SC1015 - "Data Science and Artificial Intelligence" focusing on the detection of hate speech in tweets using DS and NLP concepts.

The dataset can be found here: https://www.kaggle.com/datasets/dv1453/twitter-sentiment-analysis-analytics-vidya?select=train_E6oV3lV.csv

Contributors

Bhat Sachin → @Sachin-Bhat
Nalin Sharma → @nalin0503

Motivations/Problem Statement

Motivations: In a democratic context, the right to free speech is deemed essential by many. People wish to voice their opinions on key decisions, capturing the essence of a democracy. This fundamental right can be used for promoting collaborative action, spreading awareness and fostering a two-way communication between the citizens of a country and its government. However, we must consider the flip side - the inclusion of derogatory, hurtful and biased opinion on a public platform may be the bad apple that can plague the collective mindset of our societies, negatively affecting them in ways that may be irreversible. The presence of hate speech online can materialise itself into physical hate crimes, and so it is probable that the government may wish to regulate the online presence of its citizens. If this were to happen, what would be the best approach algorithmically?

Problem statement/ definition - Effective implementation of Data Science and Natural Language Processing (NLP) concepts to find the best model to detect hate speech in tweets.

How can we effectively detect hate speech in tweets?

Features

Bag-Of-Words
TF-IDF (Term Frequency - Inverse Document Frequency)
Word Embeddings
Word2Vec
Doc2Vec

Models (Classification)

Support Vector Machine (SVM)
Logistic Regression (LReg)
RandomForest (RF)
XGBoost (XGB)

Conclusion

Overall, XGBoost turned out to be the best module
Because it works by boosting the tree towards the best solution i.e. it is a greedy algorithm
Specifically, Word2Vec was the best parameter due to the volume of data points available
We further tried to optimise the XGBoost model using hyperparameter tuning and grid search
This gave us better f1 scores.
Furthermore these predictions when processed could be useful for analysing hate crime motives.

Limitations

The program may take a long time to run due to the high number of epochs and the large sample size. You may reduce either one or both if you specifically need faster results, although that would compromise accuracy.
For hyperparameter tuning, the update sequence is manual.

Reflections/Learning Points

Acquired knowledge on the interconnectedness between jupyter notebook, VSCode and GitHub.
Learnt about the functionalities of the programs stated above.
Soft skills - learnt how to present a DSAI project in a structured, articulate manner, training us for our professional capacities in the future.
Performing Data Prep, Cleaning and EDA on a large textual dataset.
Basics of 'text mining' in general.
An understanding of APIs and its documentation.
Natural Language Processing concepts such as text normalisation, wordclouds to represent data, extracting features from tokenised strings, word embeddings and the workings of the various models as stated in previous sections.
Computation of F1 Scores
Use of added modules such as gensim and PorterStemmer to aid our project

Contributions

Bhat Sachin → Data Collection, Model Building for SVM and XGBoost, Feature Extraction, Hyperparameter Tuning
Nalin Sharma → Data Preparation, Cleaning, EDA, Model Building Logistic Regression and RandomForest, Presentation slides and script

References

https://www.washingtonpost.com/nation/2018/11/30/
how-online-hate-speech-is-fueling-real-life-violence/
https://time.com/6121915/reddit-international-hate-speech/
https://scikit-learn.org/stable/
https://docs.python.org/3/
https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47
https://monkeylearn.com/blog/what-is-tf-idf/
https://medium.com/red-buffer/doc2vec-computing-similarity-between-the-documents-47daf6c828cd
https://www.educative.io/edpresso/what-is-the-f1-score
https://machinelearningmastery.com/gentle-introduction-bag-words-model/
https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4
https://anchormen.nl/blog/digital-transformation/accuracy-precision-recall-models/
https://hackinghate.eu/news/when-online-hate-speech-goes-extreme-the-case-of-hate-crimes/
https://www.kdnuggets.com/2020/12/xgboost-what-when.html https://cloud.google.com/ai-platform/training/docs/hyperparameter-tuning-overview

General Setup Instructions

Not all modules are available by default in the Anaconda Navigator package environment. For the project to be run on your system, kindly add conda-forge to your list of channels as shown below.

When a module needs to be installed, please install it by running the following command in a terminal:

conda install name-of-module

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Twitter-Sentiment-Analysis

About

Contributors

Motivations/Problem Statement

Features

Models (Classification)

Conclusion

Limitations

Reflections/Learning Points

Contributions

References

General Setup Instructions

Files

README.md

Latest commit

History

README.md

File metadata and controls

Twitter-Sentiment-Analysis

About

Contributors

Motivations/Problem Statement

Features

Models (Classification)

Conclusion

Limitations

Reflections/Learning Points

Contributions

References

General Setup Instructions