Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add semeval datasets. Fix #18 #19

Merged
merged 7 commits into from
Feb 7, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,11 +106,15 @@ To load a model or corpus, use either the Python or command line interface of [G
|------|-----------|-----------|-------------|---------|
| 20-newsgroups | 13 MB | <ul><li>http://qwone.com/~jason/20Newsgroups/</li></ul> | The notorious collection of approximately 20,000 newsgroup posts, partitioned (nearly) evenly across 20 different newsgroups. | not found |
| fake-news | 19 MB | <ul><li>https://www.kaggle.com/mrisdal/fake-news</li></ul> | News dataset, contains text and metadata from 244 websites and represents 12,999 posts in total from a specific window of 30 days. The data was pulled using the webhose.io API, and because it's coming from their crawler, not all websites identified by their BS Detector are present in this dataset. Data sources that were missing a label were simply assigned a label of 'bs'. There are (ostensibly) no genuine, reliable, or trustworthy news sources represented in this dataset (so far), so don't trust anything you read. | https://creativecommons.org/publicdomain/zero/1.0/ |
| patent-2017 | 2944 MB | <ul><li>http://patents.reedtech.com/pgrbft.php</li></ul> | Patent Grant Full Text. Contains the full text including tables, sequence data and 'in-line' mathematical expressions of each patent grant issued in 2017. | not found |
| quora-duplicate-questions | 20 MB | <ul><li>https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs</li></ul> | Over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line contains a duplicate pair or not. | probably https://www.quora.com/about/tos |
| semeval-2016-2017-task3-subtaskA-unannotated | 223 MB | <ul><li>http://alt.qcri.org/semeval2016/task3/</li> <li>http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf</li> <li>https://github.com/RaRe-Technologies/gensim-data/issues/18</li> <li>https://github.com/Witiko/semeval-2016_2017-task3-subtaskA-unannotated-english</li></ul> | SemEval 2016 / 2017 Task 3 Subtask A unannotated dataset contains 189,941 questions and 1,894,456 comments in English collected from the Community Question Answering (CQA) web forum of Qatar Living. These can be used as a corpus for language modelling. | These datasets are free for general research use. |
| semeval-2016-2017-task3-subtaskBC | 6 MB | <ul><li>http://alt.qcri.org/semeval2017/task3/</li> <li>http://alt.qcri.org/semeval2017/task3/data/uploads/semeval2017-task3.pdf</li> <li>https://github.com/RaRe-Technologies/gensim-data/issues/18</li> <li>https://github.com/Witiko/semeval-2016_2017-task3-subtaskB-english</li></ul> | SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the task paper http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf linked in section “Papers” of https://github.com/RaRe-Technologies/gensim-data/issues/18. | All files released for the task are free for general research use |
| text8 | 31 MB | <ul><li>http://mattmahoney.net/dc/textdata.html</li></ul> | First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets. | not found |
| wiki-english-20171001 | 6214 MB | <ul><li>https://dumps.wikimedia.org/enwiki/20171001/</li></ul> | Extracted Wikipedia dump from October 2017. Produced by `python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o wiki-en.gz` | https://dumps.wikimedia.org/legal.html |

### Models

| name | num vectors | file size | base dataset | read_more | description | parameters | preprocessing | license |
|------|-------------|-----------|--------------|------------|-------------|------------|---------------|---------|
| conceptnet-numberbatch-17-06-300 | 1917247 | 1168 MB | ConceptNet, word2vec, GloVe, and OpenSubtitles 2016 | <ul><li>http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972</li> <li>https://github.com/commonsense/conceptnet-numberbatch</li> <li>http://conceptnet.io/</li></ul> | ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting. | <ul><li>dimension - 300</li></ul> | - | https://github.com/commonsense/conceptnet-numberbatch/blob/master/LICENSE.txt |
Expand All @@ -123,10 +127,9 @@ To load a model or corpus, use either the Python or command line interface of [G
| glove-wiki-gigaword-300 | 400000 | 376 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 300</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-300.txt`. | http://opendatacommons.org/licenses/pddl/ |
| glove-wiki-gigaword-50 | 400000 | 65 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 50</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`. | http://opendatacommons.org/licenses/pddl/ |
| word2vec-google-news-300 | 3000000 | 1662 MB | Google News (about 100 billion words) | <ul><li>https://code.google.com/archive/p/word2vec/</li> <li>https://arxiv.org/abs/1301.3781</li> <li>https://arxiv.org/abs/1310.4546</li> <li>https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf</li></ul> | Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality' (https://code.google.com/archive/p/word2vec/). | <ul><li>dimension - 300</li></ul> | - | not found |
| word2vec-ruscorpora-300 | 184973 | 198 MB | Russian National Corpus (about 250M words) | <ul><li>https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models</li> <li>http://rusvectores.org/en/</li> <li>https://github.com/RaRe-Technologies/gensim-data/issues/3</li></ul> | Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words. | <ul><li>window_size - 10</li> <li>dimension - 300</li></ul> | The corpus was lemmatized and tagged with Universal PoS | https://creativecommons.org/licenses/by/4.0/deed.en |

| word2vec-ruscorpora-300 | 184973 | 198 MB | Russian National Corpus (about 250M words) | <ul><li>https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models</li> <li>http://rusvectores.org/en/</li> <li>https://github.com/RaRe-Technologies/gensim-data/issues/3</li></ul> | Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words. | <ul><li>dimension - 300</li> <li>window_size - 10</li></ul> | The corpus was lemmatized and tagged with Universal PoS | https://creativecommons.org/licenses/by/4.0/deed.en |

(this table is generated automatically by [generate_table.py](https://github.com/RaRe-Technologies/gensim-data/blob/master/generate_table.py) based on [list.json](https://github.com/RaRe-Technologies/gensim-data/blob/master/list.json))
(generated by generate_table.py based on list.json)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? It's better with the links. More concrete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of auto-generated stuff, In general, the script has no idea where exactly it is contained on the GitHub (these links were spelled out by hand).

Copy link
Owner

@piskvorky piskvorky Feb 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha. Can you add the links back? (probably to the script)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky links never were part of the script, I can add it only manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPD: Done 80ad749

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a good idea to do it manually. Please add to the script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky please look again to #19 (comment)

Copy link
Owner

@piskvorky piskvorky Feb 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? Add the links to the script, so you don't have to do this manually the next time(s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky imagine: I moved/renamed/removed this script or lists.json, the link will be broken.

Done: fcc89c2

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.


----

Expand Down
52 changes: 52 additions & 0 deletions list.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,57 @@
{
"corpora": {
"semeval-2016-2017-task3-subtaskBC": {
"num_records": -1,
"record_format": "dict",
"file_size": 6344358,
"reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py",
"license": "All files released for the task are free for general research use",
"fields": {
"2016-train": ["..."],
"2016-dev": ["..."],
"2017-test": ["..."],
"2016-test": ["..."]
},
"description": "SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the task paper http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf linked in section “Papers” of https://github.com/RaRe-Technologies/gensim-data/issues/18.",
"checksum": "701ea67acd82e75f95e1d8e62fb0ad29",
"file_name": "semeval-2016-2017-task3-subtaskBC.gz",
"read_more": ["http://alt.qcri.org/semeval2017/task3/", "http://alt.qcri.org/semeval2017/task3/data/uploads/semeval2017-task3.pdf", "https://github.com/RaRe-Technologies/gensim-data/issues/18", "https://github.com/Witiko/semeval-2016_2017-task3-subtaskB-english"],
"parts": 1
},
"semeval-2016-2017-task3-subtaskA-unannotated": {
"num_records": 189941,
"record_format": "dict",
"file_size": 234373151,
"reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskA-unannotated-eng/__init__.py",
"license": "These datasets are free for general research use.",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a link supporting this conclusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC: @Witiko

Copy link

@Witiko Witiko Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The datasets were put together from the following files:

  1. QL-unannotated-data-subtaskA.xml.zip
  2. semeval2016-task3-cqa-ql-traindev-v3.2.zip
  3. semeval2016_task3_test.zip
  4. semeval2017_task3_test.zip

Files 2 and 4 contain explicit license notices – see the “License” section of #18. Files 1 and 3 contain no licensing notices, so technically all rights are reserved. However, to me it seems like a clear oversight on the side of the task authors who left some of the ZIP archives without instructions. I can check this with the task authors if you want all bases covered.

"fields": {
"THREAD_SEQUENCE": "",
"RelQuestion": {
"RELQ_CATEGORY": "question category, according to the Qatar Living taxonomy",
"RELQ_DATE": "date of posting",
"RELQ_ID": "question indentifier",
"RELQ_USERID": "identifier of the user asking the question",
"RELQ_USERNAME": "name of the user asking the question",
"RelQBody": "body of question",
"RelQSubject": "subject of question"
},
"RelComments": [
{
"RelCText": "text of answer",
"RELC_USERID": "identifier of the user posting the comment",
"RELC_ID": "comment identifier",
"RELC_USERNAME": "name of the user posting the comment",
"RELC_DATE": "date of posting"
}

]
},
"description": "SemEval 2016 / 2017 Task 3 Subtask A unannotated dataset contains 189,941 questions and 1,894,456 comments in English collected from the Community Question Answering (CQA) web forum of Qatar Living. These can be used as a corpus for language modelling.",
"checksum": "2de0e2f2c4f91c66ae4fcf58d50ba816",
"file_name": "semeval-2016-2017-task3-subtaskA-unannotated.gz",
"read_more": ["http://alt.qcri.org/semeval2016/task3/", "http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf", "https://github.com/RaRe-Technologies/gensim-data/issues/18", "https://github.com/Witiko/semeval-2016_2017-task3-subtaskA-unannotated-english"],
"parts": 1
},
"patent-2017": {
"num_records": 353197,
"record_format": "dict",
Expand Down