This file contains our analysis of our Stage 2 Results.
Author: Dennis Asamoah Owusu
Team: Hopper (Jon, Dennis, Sai)
Task: Multilingual Emoji Prediction (English and Spanish)
For stage 2, we tried a number of strategies to improve upon our baseline results. The first approach was to use a character based bidirectional LSTM classifier as was done by Barber et al. [1]. Due to the slowness of testing the models, we were able to try two sets of parameters for the neural network and embedding size. A neural network size and embedding size of 128 seemed to work better than 64 so we used the former in experimenting. We wondered how a uni-directional character based LSTM and a word based LSTM will fare and so tried those too. We also tried a character based LSTM combined with a deep convolutional neural network (CNN) inspired by dos Santos and Maira Gatti’s work on using CNN for sentiment analysis of short texts [2].
For the combined LSTM and CNN, we performed an additional experiment where, for training, we chose an equal number of tweets from each class. As we had a very dominant class (accounting for roughly 20% of tweets), we wondered whether this could be affecting the classifier performance since the performance of naive-Bayes classifiers, for instance, degrade when there is such a dominant class [3]. The change in using the combined CNN and LSTM on a dataset with equally represented classes compared to using it on the entire training set was negligible. We also combined the CNN with a bidirectional LSTM. Not too excited by the results of our neural network models, we tried a Bag of Words model with a Linear Support Vector Machine (LSVM) classifier. This turned out to provide the best results. The table below shows the Macro F1 scores of the various classifiers. Since we performed 10-fold cross-validation, we show the mean and standard deviation for each classifier’s results set. We add the results from our baseline - Bag of Words with Bernoulli naive-Bayes classifier - for comparison. En and Es stand for English and Spanish respectively.
MEAN (En) SDEV (En) MEAN (Es) SDEV (Es)
Word-LSTM 23.27 1.03 N/A N/A
LSTM 27.83 0.47 N/A N/A
Bi-LSTM 28.23 2.14 N/A N/A
CNN + LSTM 29.30 0.36 N/A N/A
CNN + LSTM (Fair)* 29.35 0.44 N/A N/A
CNN + Bi-LSTM 29.29 0.36 16.15 0.49
LSVM 32.73 0.24 17.98 0.31
Baseline 29.10 0.20 16.49 0.42
* Fair means each class was equally represented in the training data.
LSTMs and CNNs were character based except for Word-LSTM.
Another experiment we performed, worth noting, was done to understand how much semantically similar classes such as camera and camera with flash impact misclassification. From our stage 1 results, we noticed that our model confused label 5 (smiling face with sunglasses) with label 1 (smiling face with heart eyes) significantly. We inferred that the model might have trouble discriminating between classes with similar semantics. That inference is supported by the fact that the model also struggles to discriminate between labels 10 (Camera) and 18 (Camera with flash).
To measure the impact of these semantically similar classes on our model, we collapsed classes that were semantically similar and rerun the BOW model. All the “heart” emojis were put into one class; the “smiling” emojis (except the hearty eyes one) were put in another class; the “camera” emojis were put in another class; and each remaining emoji had its own class. In the end, we had 13 classes. After running the model on the 13 classes, the macro F1 score improved by 6 points suggesting that the semantic similarity between classes such as Camera and Camera with Flash does have an effect although the effect was not nearly as significant as we had expected. It is worth noting that after collapsing the semantically similar classes, a lot of tweets were misclassified as the most frequent class. Thus, it may be that the semantic similarity of the classes does really matter but the gain in performance from collapsing the classes was offset by the fact that we now had a class that was really huge and ate up too many classifications.
One interesting observation about our character based neural models was the fact they succeeded in classifying the following tweet which our Bag of Words model with Bernoulli naive Bayes classifier misclassified: “Fall in love with yourself first. @user photolaju @ New York, New York”. We supposed from our analysis of our BOW model that “photolaju” - an important indicator that this tweet should be labelled camera was not being considered since subsequence information is not well captured by BOW models. Once we used a character based tokenization where subsequence information is better captured as in the neural models, our models successfully classified the tweet into the camera class.
ANALYSIS OF LSVM RESULTS
Since the Bag of words model with LSVM classifier provided the best results, we present an analysis of its performance. The general idea of a Bag of words model for classification is determining how closely each word in the training data correlates with each class and then use that knowledge in determining what class to assign to any document based on the words in that document. Thus, if a document contains words that are more closely related to class A than any other class, then the document is classified as class A. Take the following tweets from our training data.
“I voted! @ Upper Tampa Bay Regional Public Library”
“We Voted. #thebraunfamily #thisonesforyougrandpa…”
“VOTED #NoMoneyTeam @ Tottenville, Staten Island”
“I voted! God bless America ! @ Quail Hollow Golf & Country Club”
“Doing my part today #ivoted #electionday # @ Richmond's First Baptist Church”
In our training data, all the above tweets (documents) were classified as having the American flag emoji. All these tweets (and many other tweets in this class) contain the word “voted” and, thus, “voted” shares a close relation with this class. While the word “voted” may appear in other classes, it likely doesn’t appear in those classes as frequently as it appears in this class. “Voted” in a tweet, therefore, is a strong suggestion that the tweet may belong to this class. By looking at all the words in each tweet, and which class each word strongly suggests, a sense of what class the tweet belongs to may be derived.
The job of the classifier then is to look at the training data and derive these associations between the words and classes and then use that knowledge to predict the class for new tweets. Of the classifiers we tried with a Bag of Words model, the classifier that performed best was the Linear Support Vector Machine (LSVM) classifier. LSVMs are by default binary classifiers. Each document to be classified is represented as a vector based on the features of the document. In our case, the features was the words making up the document (Bag of words). The document turned feature vector is then represented in some n-dimensional vector space where n is the dimensions of the feature vectors.
Having represented these vectors, the Support Vector machine learns a hyperplane that separates the vectors based on the class. Ideally, you want all the vectors belonging to the first class to be, say, above the hyperplane and all the vectors belonging to the second class to be below the hyperplane. Each new document to be classified is represented in the vector space and its class is determined based on whether it is below or above the plane. The way the separating hyperplane is drawn (e.g. its slope) is determined by vectors close to it known as support vectors. Specifically, the hyperplane is drawn so as to maximize the distance between it and these support vectors. This is a simplification of how LSVMs work but sufficient to show that it can be used with a Bag of Words feature to classify documents into one of two classes. The LSVM binary classifier can be employed in different ways to perform a multi-classification and, in our case, we used a one-vs-rest strategy for doing multi-classification. This means the multi-classification is broken up into many binary-classifications, one for each class. Each binary-classification decides whether the input belongs more in the class or one of the rest of the classes. We compare all the results and pick the class with the highest probability. (See “Other Helpful References” at the end).
Below is a confusion matrix for the English results (first fold):
0: 8848 610 248 245 42 100 42 59 27 62 36 34 173 7 5 13 2 57 6 6 ∑ = 10622
1: 72 2903 665 368 143 165 143 143 71 69 90 74 0 9 13 28 12 67 38 4 ∑ = 5077
2: 34 828 3171 119 173 125 123 51 26 51 68 57 0 5 23 75 10 61 54 13 ∑ = 5067
3: 46 872 255 933 30 124 52 89 75 99 43 32 0 11 12 21 4 30 12 3 ∑ = 2743
4: 13 381 382 71 1168 31 88 72 12 14 60 27 0 5 9 52 1 21 30 2 ∑ = 2439
5: 29 732 456 202 76 327 100 60 29 69 41 39 0 2 17 17 14 54 16 1 ∑ = 2281
6: 18 553 489 90 133 98 385 77 26 24 35 49 1 1 19 33 6 22 20 3 ∑ = 2082
7: 22 510 221 146 107 57 87 422 14 21 57 24 0 3 9 19 5 48 27 6 ∑ = 1805
8: 33 524 180 346 34 63 70 54 236 44 24 40 0 6 11 10 3 22 10 2 ∑ = 1712
9: 30 505 232 268 38 113 40 41 26 234 22 19 0 8 11 16 4 29 9 3 ∑ = 1648
10: 4 167 157 29 50 30 35 34 14 4 769 13 0 1 6 6 3 10 198 0 ∑ = 1530
11: 14 285 146 40 28 41 57 27 21 6 21 909 0 2 3 14 0 8 11 0 ∑ = 1633
12: 658 12 6 4 2 4 14 2 3 0 0 0 622 0 0 0 0 0 1 0 ∑ = 1328
13: 29 434 160 284 27 49 24 40 48 48 12 20 0 82 8 6 2 10 7 0 ∑ = 1290
14: 7 325 433 92 35 80 81 31 14 25 16 17 0 5 45 20 4 25 17 4 ∑ = 1276
15: 10 245 387 60 142 33 68 41 29 13 28 16 0 1 10 244 2 11 11 0 ∑ = 1351
16: 8 429 383 50 57 97 64 37 10 16 19 27 0 1 12 9 28 24 8 2 ∑ = 1281
17: 10 111 58 21 24 19 20 27 5 4 4 7 0 1 3 1 0 937 2 1 ∑ = 1255
18: 5 174 172 30 62 22 25 49 9 8 423 12 0 1 8 11 2 10 305 3 ∑ = 1331
19: 9 291 460 85 58 75 69 22 15 15 18 17 0 2 19 12 4 16 17 5 ∑ = 1209
The first column has the emoji labels (classes). Label numbers and their corresponding emojis are shown below. The last column shows the total number of tweets in the dataset that correctly belong to each class. Thus, 10,622 of the tweets correctly belong to label 0 and 5,077 tweets correctly belong to label 1. The columns in between the first and the last show how our model classified the tweets. For instance, for label 0, out of the 10,622 tweets belonging to that class, we labelled 8848 as belonging to label 0, 610 as belonging to label 1, 248 as belonging to label 2 and so on.
0 ❤ _red_heart_
1 😍 _smiling_face_with_hearteyes_
2 😂 _face_with_tears_of_joy_
3 💕 _two_hearts_
4 🔥 _fire_
5 😊 _smiling_face_with_smiling_eyes_
6 😎 _smiling_face_with_sunglasses_
7 ✨ _sparkles_
8 💙 _blue_heart_
9 😘 _face_blowing_a_kiss_
10 📷 _camera_
11 🇺🇸 _United_States_
12 ☀ _sun_
13 💜 _purple_heart_
14 😉 _winking_face_
15 💯 _hundred_points_
16 😁 _beaming_face_with_smiling_eyes_
17 🎄 _Christmas_tree_
18 📸 _camera_with_flash_
19 😜 _winking_face_with_tongue_
The first most important trend we observe is that labels 0, 1 and 2 perform pretty well in terms of true positives - ~83% for label 0 (8848/10622), ~57% for label 1 (2903/5077) and ~63% for label 2 (3171/5067) while at the same time being false positives for many classes. Take label 5 for instance. 327 tweets are correctly classified as belonging to label 5. However, 732 tweets that should have been classified as label 5 are misclassified as label 1. The trend of misclassifying more tweets as label 1 can be seen in the rows showing labels 6, 7, 8, 13, 14, 15, 16 and 19. Similarly, label 2 is incorrectly assigned to many tweets and the number of misclassified tweets often exceeds the number of tweets correctly classified. See the row for label 6 as an example. We suppose that the size of these three classes contribute to this phenomenon. Labels 1 and 2 are almost twice the size of the fourth most populous class while label 0 is about four times larger.
Below are some of the tweets that were classified as 1 when they should have been classified as 5.
Hi. I have a light skin friend named Felicia. This is her. Hi Felicia @ Waveland Bowl
It's an 8.1 Blood Orange! - Drinking an Ogre by @user at @user —
Great films and a BEAUTIFUL night (@ Columbus Circle in New York, NY)
#FellowHeadliner The Amazing Julia @ Edc New York Citi Field 2016
Full Set of Lash Extensions! @ Geneva, New York
Below are some tweets that were correctly classified as 5.
I think today is about to be a great day..
I want to thank @user for this awesome bag. Thanks man
Fun weekend with great people @ Glass Bowl
@user sounds good see you there
Good morning Church Flow @ New Mercies
Classification for label 10 fared pretty well - 50%(769/1530). Confusion with either label 0 or label 1 or label 2 was also not as significant - 4, 167, 157 respectively. It is interesting to note, though, that 198 of label 10 tweets were misclassified as 18. Thus, the confusion between camera and camera_with_flash persists. Classifications for label 11, label 12 and label 17 also fared very well.
Finally, we performed a comparison between the results of our Bag Of Words model with LSVM and our Bag Of Words model with Bernoulli naive Bayes classifier to see if we could understand why the LSVM performed a few percentage points better. We show the true positives for each label for both models below.
BOW LSVM
0 8534 8848
1 2331 2903
2 2696 3171
3 789 933
4 1043 1168
5 243 327
6 334 385
7 339 422
8 219 236
9 286 234
10 667 769
11 919 909
12 697 622
13 68 82
14 46 45
15 173 244
16 55 28
17 882 937
18 153 305
18 6 5
With the exception of label 14 where performance is nearly equal and label 16 where the LSVM performs less, the LSVM performs better for each label. We conclude that for this task and data then, the LSVM is a better classifier than the Bernoulli naive Bayes classifier. Below are some tweets for label 18 that the LSVM succeed in classifying that the Bernoulli naive Bayes could not find. We choose label 18 because the percentage difference in performance (in favor of the LSVM) is greatest here.
Social Damage's final west coast tour. :vvntal #Hardcore @ Bridgetown DIY
Different angles to the same goal. by @user @ New…
When iris.apfel speaks...knowledge and wisdom is all you hear so listen up... :@drummondphotog…
Below is a confusion matrix for the Spanish results (first fold):
0: 1362 288 62 63 22 19 17 4 7 20 4 2 1 1 0 2 8 0 0 ∑ = 1882
1: 299 690 128 44 41 18 21 7 11 50 12 0 2 1 0 3 11 0 0 ∑ = 1338
2: 109 198 484 13 28 13 18 8 6 15 5 0 1 2 0 0 5 1 2 ∑ = 908
3: 327 164 30 56 12 12 3 2 7 13 3 0 0 0 1 4 7 0 0 ∑ = 641
4: 133 225 78 21 97 21 16 10 15 15 9 0 1 1 0 2 3 0 0 ∑ = 647
5: 152 115 47 18 23 48 10 8 4 6 0 0 1 0 0 1 5 0 0 ∑ = 438
6: 71 65 64 21 24 6 120 9 8 3 2 1 1 0 0 2 0 0 0 ∑ = 397
7: 83 99 81 9 20 9 18 26 4 13 5 0 1 0 0 7 9 0 0 ∑ = 384
8: 98 127 35 18 20 6 13 9 40 8 4 0 1 0 0 4 2 0 0 ∑ = 385
9: 22 70 10 1 9 0 7 0 2 209 1 0 0 0 0 0 0 0 0 ∑ = 331
10: 66 107 58 6 27 5 7 1 9 20 15 0 1 0 1 5 3 0 1 ∑ = 332
11: 126 71 27 22 9 4 4 2 5 9 4 7 0 1 0 3 3 0 0 ∑ = 297
12: 145 74 19 21 10 2 5 1 2 6 1 0 4 0 0 0 6 0 0 ∑ = 296
13: 41 94 63 7 16 5 10 7 2 5 5 0 0 1 0 1 1 0 0 ∑ = 258
14: 132 74 12 27 6 6 4 0 4 2 0 0 0 0 1 2 6 0 0 ∑ = 276
15: 92 54 20 15 13 3 2 6 2 9 1 0 0 0 0 14 11 0 0 ∑ = 242
16: 72 62 54 9 8 4 2 7 2 2 1 2 0 0 0 0 58 0 0 ∑ = 283
17: 154 59 6 17 3 5 2 0 5 5 1 1 0 0 0 2 2 0 0 ∑ = 262
18: 49 101 53 2 23 9 5 3 1 9 2 0 0 1 0 0 3 0 0 ∑ = 261
Similar to the confusion matrix for English, the first column has the emoji labels. Label numbers and their corresponding emojis are shown below. Note that label 19 is missing from the confusion matrix. This is because there was not a single tweet in the training data that had that emoji. The data in the other columns is exactly as described for the English.
0 ❤ _red_heart_
1 😍 _smiling_face_with_hearteyes_
2 😂 _face_with_tears_of_joy_
3 💕 _two_hearts_
4 😊 _smiling_face_with_smiling_eyes_
5 😘 _face_blowing_a_kiss_
6 💪 _flexed_biceps_
7 😉 _winking_face_
8 👌 _OK_hand_
9 🇪🇸 _Spain_
10 😎 _smiling_face_with_sunglasses_
11 💙 _blue_heart_
12 💜 _purple_heart_
13 😜 _winking_face_with_tongue_
14 💞 _revolving_hearts_
15 ✨ _sparkles_
16 🎶 _musical_notes_
17 💘 _heart_with_arrow_
18 😁 _beaming_face_with_smiling_eyes_
19 🔝 _TOP_arrow_
From the confusion matrix, we observe the same trend where the first three labels have high true positives but at the same time are false positives for many other labels. A similar analysis like we did for the English data will show this to be the case. One other glaring thing in the Spanish confusion matrix is that the model fares very well for label 9 (Spanish flag) - 63%(209/331). Below are some of those tweets.
Label 9 Tweets in Spanish
Views From España @ Hotel Riviera Ibiza
Puerta de la mar #portadelamar #Valencia #España #spain @ Porta de…
@ Plaza de Colón
Amigos como estos hay muy pocos, los quiero, zoquetes #Madrid #España #Europe #Eurotrip # …
#MuseodelPrado #Madrid # @ Museo Nacional del Prado
#fountain #Valladolid #spain #drummingaroundtheworld @ Plaza Mayor, Valladolid
Park chillings in Parque del Retiro #parquedelretiro #spain #madrid #citytrip #city…
Translation of Above Tweets in English (Google Translate)
Views From Spain @ Riviera Ibiza Hotel
Door of the sea #portadelamar #Valencia # Spain #spain @ Porta de ...
@ Plaza de Colón
Friends like these there are very few, I love them, zoquetes #Madrid # Spain #Europe #Eurotrip # ...
#MuseodelPrado #Madrid # @ Museo Nacional del Prado
#fountain #Valladolid #spain #drummingaroundtheworld @ Plaza Mayor, Valladolid
Park chillings in Retiro Park #parquedelretiro #spain #madrid #citytrip # city ...
SO, in the end, it will seem that the flag emoji - whether American or Spanish - is predictable but for the other emojis we are not so sure at this point! :-)
CITED
[1] Francesco Barbieri, Miguel Ballesteros, and Horacio Saggion. 2017. Are emojis predictable? In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, pages 105–111. http://www.aclweb.org/anthology/E17-2017.
[2]Cicero dos Santos and Maira Gatti. 2014. Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Dublin
[3]Jason D. M. Rennie, Lawrence Shih, Jaime Teevan, and David R. Karger. 2003. Tackling the poor assumptions of naive bayes text classifiers. In In Proceedings of the Twentieth International Conference on Machine Learning. pages 616–623.
OTHER HELPFUL REFERENCES
https://machinelearningmastery.com/support-vector-machines-for-machine-learning/
https://nlp.stanford.edu/IR-book/html/htmledition/multiclass-svms-1.html