-
Notifications
You must be signed in to change notification settings - Fork 20
SIGARRA News Corpus
I manually annotated a subset of SIGARRA news, from its different domains, using the brat tool.
SIGARRA is the information system of the University of Porto (UP), where every organic unit has its own domain. SIGARRA has a news section so I manually annotated some them using the BRAT tool. First, the news were gathered from the information system and saved to a csv file with the attributes being: news id, title, subtitle, source url, content and published date. The gathered news were published between 2016-12-14 and 2017-03-01.
Apart from the HAREM collection, this is the only publicly available Portuguese (from Portugal) annotated corpus to this date, to my knowledge. The developed corpus is twice the size of the HAREM collection (HAREM with approximately 86k tokens, and SIGARRA with 185k tokens), with twice the number of entity annotations (HAREM with 7255, and SIGARRA with 12644 entity annotations).
Entity classes: Hora (Hour), Evento (Event), Organizacao (Organization), Curso (Course), Pessoa (Person), Localizacao (Location), Data (Date) and UnidadeOrganica (Organic Unit).
Entity tag | Number of annotated classes | % |
---|---|---|
Data | 2811 | 22.23% |
Organizacao | 2320 | 18.35% |
Pessoa | 2159 | 17.08% |
UnidadeOrganica | 1814 | 14.35% |
Localizacao | 1593 | 12.60% |
Hora | 1015 | 8.03% |
Curso | 521 | 4.12% |
Evento | 411 | 3.25% |
Total | 12644 | 100% |