GitHub - chenhe95/TwitterEmotion: Predicting the emotion of the tweeter from the words and picture of the tweet.

chenhe95 / TwitterEmotion Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Predicting the emotion of the tweeter from the words and picture of the tweet.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
liblinear		liblinear
lightspeed		lightspeed
report		report
README.txt		README.txt
common-english-words.txt		common-english-words.txt
common-french-words.txt		common-french-words.txt
common-spanish-words.txt		common-spanish-words.txt
cv_test_10f.m		cv_test_10f.m
english		english
find_top.py		find_top.py
group.txt		group.txt
knn.m		knn.m
load_models.m		load_models.m
main.m		main.m
models.mat		models.mat
models_knn.mat		models_knn.mat
models_nb.mat		models_nb.mat
models_svm_pca.mat		models_svm_pca.mat
nb.m		nb.m
predict_labels.m		predict_labels.m
predict_labels_knn.m		predict_labels_knn.m
predict_labels_nb.m		predict_labels_nb.m
predict_labels_random_forest.m		predict_labels_random_forest.m
predict_labels_rf.m		predict_labels_rf.m
predict_labels_svm_pca.m		predict_labels_svm_pca.m
random_forest.m		random_forest.m
readme		readme
reshape_img.m		reshape_img.m
save_models.m		save_models.m
std_word_counts.m		std_word_counts.m
stemmer.c		stemmer.c
svm_pca.m		svm_pca.m
test_pred.m		test_pred.m
topwords.csv		topwords.csv
useless		useless

Repository files navigation

Prereq:

We include the liblinear and lightspeed toolboxes in our path.

file: startup.m

Word pre-processing:

We remove English, French and Spanish stop words. We found some lists on the web:
- english: http://www.textfixer.com/resources/common-english-words.txt
- spanish/french: http://www.ranks.nl/stopwords/

We remove 'http://' and 'rt'. We also remove one-characters word, as well as unicode. Finally, we change the word dataset in a 0/1 matrix indicating the absence/presence of a word.

files: common-spanish-words.tx, common-french-words.txt, find_top.py, useless, english 

Description of the four models:

1). Generative: Naïve Bayes
2). Discriminative:  Random Forest
3). Instance based method: KNN
4). Semi-supervised learning: PCA then SVM

1). We trained Naïve Bayes on the words. This method gave us an accuracy of 80% without pre-processing the words.  Pre-processing allows the accuracy to rise to 81.42% on the leaderboard.

files: nb.m, predict_labels_nb.m, models_nb.mat

2). We trained random forest on the words and 77.76% test accuracy on the leaderboard.

We note, supported by http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm, that we don't use cross validation for training random forests:
"In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows: Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree."

files: random_forest.m, predict_labels_random_forest.m, models_random_forest.mat

3).  Using KNN, we get a cross-validation error of 40.78%.

files: knn.m, predict_labels_knn.m, models_knn.mat

4).  We use PCA on the words, then perform SVM. We get a 22.87% cross validation error.

files: svm_pca.m, predict_labels_svm_pca.m, models_svm_pca.mat