Skip to content

Text classification of different datasets using Machine Learning and Natural Language Processing techniques

Notifications You must be signed in to change notification settings

ajimenezjulio/text_classification

Repository files navigation

A Journey through Machine Learning

Text classification is achieved through the next Machine Learning techniques:

  • Logistic Regression
  • Linear Discriminant Analysis
  • Quadratic Discriminant Analysis
  • Random Forest
  • Multinomial Naive Bayes
  • SVM
  • Ada Boost
  • Extreme Gradient Boost
  • Deep Network (CNN - RNN)

Additionally the datasets were pre-processed under the NLP guidelines, which covers:

  • Standard punctuation removal
  • Tokenise
  • Lowercase
  • Stop words removal
  • Threshold words
  • Stemming
  • Lemmatise

The pre-processed files were vectorized in 3 different ways:

  • Tf-Idf
  • Word2Vec
  • Glove

All previous models were applied in the following datasets:

  • Toxic comment
  • Spam email
  • Movie reviews

Datasets can be found at:

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

http://csmining.org/index.php/enron-spam-datasets.html

About

Text classification of different datasets using Machine Learning and Natural Language Processing techniques

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published