COVID-Vaccine-Stance-Detection

The idea is to do semantic analysis and detect category from Arabic Tweets. For every tweet , we should classify it as positive or negative or natural regrading to COVID219 Vaccine. The second task to do classification based on cateogry for same tweets.We have 7 types: Personal,News..etc

This serves as the project for the Natural Language Processing course taught to juniors in CUFE for 2023. We ranked 3 in our class

Technologies 📚

Running the Application 🚀

Clone the repository

$ git clone https://github.com/Iten-No-404/COVID-Vaccine-Stance-Detection

Pipeline

Preprocessing:

We defined our own tokenizer splitting on whitespaces & punctuations. Cleaning tokens from punctuation.Removing tashkeel. Replacing variants of all arabic letters to their original parents (ؤ to و and أ to ا and so on). Removing all extra Arabic characters (like المد). Remove any non-arabic charactersDefined our own set of stop words consisting of 2 & 3 characters (so as not to lose much semantics), then removed them.We used NLTK Stemmer to stem our output tokens. We created 11 variants of our dataset, each with either Downsampling or Upsampling or both. Each with either equalised Stance Classes, or equalised Category Classes. All those Variants along with our Original Dataset without any up/downSampling.

Feature Extraction:

Classical features:

Sometimes , we used SMOTE to oversampling features to solve the problem of overfitting

TF-IDF
N-gram

Trainable features/embeddings:

Word2Vec: CBOW using gensim models.
Word2Vec: Skip-Gram using gensim models.

Classical Classifiers:

Multinomial Naive Bayes
SVM Linear
Random Forest
Logistic Regression
Model parameters are optimised using GridSearchCV like Random Forest
10-Fold Cross validation with random forest but it was overfitting

Deep Learning models:

Pytorch’s RNN
AraBert’s Transformer (Using transfer learning to train the last Fully Connected layer)
XML Roberta (Using transfer learning to train the last Fully Connected layer)
Qarib
MarBert

Final models:

For both the stance & category detection, we ended up using two arabert models trained on different variants of the dataset (Train 3 for stance & train 10 for category). They were chosen because they gave the best F1-scores & especially the macro acreage scores. They ended up yielding nearly the best accuracy along with actually learning the dataset and predicting based on the tweet contents, not just predicting a single class & plainly overfitting.

Best Results

Arabert with stance

Train data	f1-score	accuracy	Macro avg
Train 3	0.57,0.90,0.53	0.81	0.66

Arabert with category

Train data	accuracy	Macro avg
Train 10	0.588	.438
Train 11	0.522	0.392

EXtra Results

Features: TF-IDF Features

Classifier	Naive Bayes	Logistic Regression	SVM linear	Random Forest (100)	Random Forest(300)
stance	69.6	67.6	67.4	69.3	71.0
category	56.6	63.08	65.0	65.3	64.5

Grid Search RandomForest for stance 0.67 accuracy , with f1-score :0.36,0.4,0.79 and macro avg :0.52

Features: Bigram

Classifier	Naive Bayes	Logistic Regression	SVM linear	Random Forest (100)
Stance F1-Score	0.3,0.39,0.83	0.33,0.34,0.81	0.32,0.36,0.81	0.41,0.41,0.82
Stance Macro avg	0.51	0.49	0.5	0.52
CategoryMacroavg	0.13	0.31	0.31	0.23

Features: TF-IDF and Bigram

Classifier	Naive Bayes	Logistic Regression	SVM linear	Random Forest (100)	Random Forest (300)
Stance	12.6	80.4	12.6	80.4	84.0
Category	54.5	54.5	54.5	54.5	64.5

Features : CBOW & Skip Gram & TF-IDF

Classifier	Naive Bayes	Logistic Regression	SVM linear	Random Forest (100)	Random Forest(300)
stance	67.7	70.5	12.7	71.0	76.8
category	49.7	57.1996	13.9	12.8	54.555

Arabert Word Embeddings with RNN

Task	f1-score	Macro avg
stance	0.00,0.00,0.85	0.30
category	0.71, the rest are 0.0	0.07

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
dataset		dataset
output		output
stopwords		stopwords
train/runs		train/runs
.gitignore		.gitignore
LSTM.ipynb		LSTM.ipynb
NLP Project Team 1 Document.pdf		NLP Project Team 1 Document.pdf
NLP Project Team 1 Presentation.pptx		NLP Project Team 1 Presentation.pptx
README.md		README.md
arabert.ipynb		arabert.ipynb
feature_extraction.ipynb		feature_extraction.ipynb
nltk.png		nltk.png
preprocessing.ipynb		preprocessing.ipynb
pytorch.png		pytorch.png
rnn.ipynb		rnn.ipynb
roberta.ipynb		roberta.ipynb
xlm_roberta.ipynb		xlm_roberta.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID-Vaccine-Stance-Detection

Technologies 📚

Running the Application 🚀

Pipeline

Preprocessing:

Feature Extraction:

Classical features:

Trainable features/embeddings:

Classical Classifiers:

Deep Learning models:

Final models:

Best Results

Arabert with stance

Arabert with category

EXtra Results

Features: TF-IDF Features

Features: Bigram

Features: TF-IDF and Bigram

Features : CBOW & Skip Gram & TF-IDF

Arabert Word Embeddings with RNN

About

Releases

Packages

Contributors 4

Languages

Iten-No-404/COVID-Vaccine-Stance-Detection

Folders and files

Latest commit

History

Repository files navigation

COVID-Vaccine-Stance-Detection

Technologies 📚

Running the Application 🚀

Pipeline

Preprocessing:

Feature Extraction:

Classical features:

Trainable features/embeddings:

Classical Classifiers:

Deep Learning models:

Final models:

Best Results

Arabert with stance

Arabert with category

EXtra Results

Features: TF-IDF Features

Features: Bigram

Features: TF-IDF and Bigram

Features : CBOW & Skip Gram & TF-IDF

Arabert Word Embeddings with RNN

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages