This repository has been archived by the owner on Feb 21, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 39
Available Corpora
knoxa edited this page Aug 3, 2017
·
8 revisions
Below is a list of freely available text corpora, which may be useful for the development or testing of Baleen. The list is not exhaustive, and Baleen has not been developed to specifically work with any of the following so performance may vary.
- Relationship and Entity Extraction Evaluation Dataset: https://github.com/dstl/re3d
- Enron Email Dataset: https://www.cs.cmu.edu/~./enron/
- Machine Understanding Conference (MUC) datasets: http://www-nlpir.nist.gov/related_projects/muc/muc_data/muc_data_index.html
- The MUC-3 dataset converted to HTML and made available as a GitHub project: https://github.com/dstl/muc3
- Reuters-21578: http://www.daviddlewis.com/resources/testcollections/reuters21578/
- VAST 2014 Challenge: http://vacommunity.org/VAST+Challenge+2014
- John Smith corpus for document and corpus coreference: http://alias-i.com/lingpipe/demos/data/johnSmith.tar.gz and see http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html Natural Language Clustering section. "197 New York Times articles about 35 different people named John Smith. Each article mentions a single John Smith. The clusters with more than one document contain the following numbers of documents: 2, 2, 2, 4, 4, 5, 9, 15, 20, 22, 88". This has been used in many corpus coreference papers as a standard, publicly available dataset.
- Four newsgroups for document classification: Download from http://alias-i.com/lingpipe/demos/data/fourNewsGroups.tar.gz, see http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html Natural Language Clustering section. "The four newsgroups data is a subset of 178 newsgroup posts balanced among the groups alt.atheism, misc.forsale, soc.religion.christian, and talk.religion.misc, a particularly challenging subset."
A larger list of corpora, along with a list of other NLP related tools, is available on Stanford University's website: http://www-nlp.stanford.edu/links/statnlp.html#Corpora