Collection of keyword lists used to censor chat messages on chat apps and live streaming apps used in China. Each of these apps implements censorship on the client-side, which allows us to reverse engineer them and extract the keyword lists used to block content in chat messages. Our analysis shows how keyword lists are downloaded to the applications and how encryption and decryption of lists is implemented (when it is present). Changes to keyword lists are tracked over the data collection period for each application.
Full details on reverse engineering method and results are avalible in reports below:
[Chat program censorship and surveillance in China: Tracking TOM-Skype and Sina UC] (http://firstmonday.org/ojs/index.php/fm/article/view/4628/3727)
[Asia Chats: Investigating Regionally-based Keyword Censorship in LINE] (https://citizenlab.org/2013/11/asia-chats-investigatingregionally-based-keyword-censorship-line/)
[Every Rose Has Its Thorn: Censorship and Surveillance on Social Video Platforms in China] (https://www.usenix.org/conference/foci15/workshop-program/presentation/knockel)
Datasets include raw keyword lists collected from the applications and processed datasets that include translations of keywords. Keywords were translated to English using combination of machine and human translation. Based on interpreting these translations with contextual information, we coded each keyword into content categories grouped under six general [themes] (https://github.com/citizenlab/chat-censorship/blob/master/themes_keyword_censorship.csv) according to a [code book] (https://github.com/citizenlab/chat-censorship/blob/master/categories_keyword_censorship.csv)
All data is provided under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International and available in full here and summarized here