Text classification is achieved through the next Machine Learning techniques:
- Logistic Regression
- Linear Discriminant Analysis
- Quadratic Discriminant Analysis
- Random Forest
- Multinomial Naive Bayes
- SVM
- Ada Boost
- Extreme Gradient Boost
- Deep Network (CNN - RNN)
Additionally the datasets were pre-processed under the NLP guidelines, which covers:
- Standard punctuation removal
- Tokenise
- Lowercase
- Stop words removal
- Threshold words
- Stemming
- Lemmatise
The pre-processed files were vectorized in 3 different ways:
- Tf-Idf
- Word2Vec
- Glove
All previous models were applied in the following datasets:
- Toxic comment
- Spam email
- Movie reviews
Datasets can be found at:
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge