Skip to content

How to handle emoji in Python + a quick Python script to count emoji in Tweets as an example. (python 2.7)

Notifications You must be signed in to change notification settings

minikanyas/emoji-counter

 
 

Repository files navigation

This is a quick python script to count emoji in Tweets. The iPython notebook is a long explanation of how character encoding works.

Credit to http://apps.timwhitlock.info/emoji/tables/unicode and, of course, unicode.org.

There is also a pretty helpful (for understanding character encodings) Stack Overflow post here: http://stackoverflow.com/questions/700187/unicode-utf-ascii-ansi-format-differences

And webpage: http://csharpindepth.com/Articles/General/Unicode.aspx

I have taken the emoji data from http://www.unicode.org/Public/emoji/2.0/emoji-data.txt (emoji_data.txt), parsed the modifiers to add skin tone modifiers, and added regional country letter indicators (to detect flags). If this page is updated you should be able to copy it and update the dict.

Run parse_unicode_tables.py to create the emoji dictionary that we use to count emoji. Edit the encoding (top of the file) to create a UTF-8 vs UTF-32 encoded dictionary (right now it creates a UTF-8 dictionary). The parse_unicode_tables.py assigns a unique ID to each emoji in the dict, so that you can chose which characters to count (uncomment print statements at the end to see a table of all of the emoji being recorded and thier modifiers). The dictionaries are saved as pickled files emoji_dict_utf-{8,32}.pkl. Running for UTF-8 will also save unicode_markers_utf-8.pkl (a dictionary of marker bytes in utf-8).

Then run: cat tweet_fie.json | python parse_utf8.py (if you created a UTF-8 dict)

Or: cat tweet_fie.json | python parse_utf32.py (if you created the UTF-32 dict)

They both do exactly the same thing, just use different inputs and parse the strings differently. No huge speed difference. I've added them both as two examples of solving the problem.

Old version (don't use, this is a very silly way to solve the problem): find_emoji.py, emoji_dict.py

All scripts expect a json payload, one record per line, and count emoji in the "body" field.

example_tweet_ids.txt is a file of ids for Tweet containing emoji. Most of the emoji in emoji_dict are covered here. Try using twurl to get these Tweets from the public Twitter API.

Feel free to use/modify. No guarantees of anything.

About

How to handle emoji in Python + a quick Python script to count emoji in Tweets as an example. (python 2.7)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 71.6%
  • Jupyter Notebook 28.4%