-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add semeval datasets. Fix #18 #19
Changes from all commits
92489bf
edf74bf
ebb873e
e71079b
edcd6dc
30d6734
2db13e2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,57 @@ | ||
{ | ||
"corpora": { | ||
"semeval-2016-2017-task3-subtaskBC": { | ||
"num_records": -1, | ||
"record_format": "dict", | ||
"file_size": 6344358, | ||
"reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py", | ||
"license": "All files released for the task are free for general research use", | ||
"fields": { | ||
"2016-train": ["..."], | ||
"2016-dev": ["..."], | ||
"2017-test": ["..."], | ||
"2016-test": ["..."] | ||
}, | ||
"description": "SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the task paper http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf linked in section “Papers” of https://github.com/RaRe-Technologies/gensim-data/issues/18.", | ||
"checksum": "701ea67acd82e75f95e1d8e62fb0ad29", | ||
"file_name": "semeval-2016-2017-task3-subtaskBC.gz", | ||
"read_more": ["http://alt.qcri.org/semeval2017/task3/", "http://alt.qcri.org/semeval2017/task3/data/uploads/semeval2017-task3.pdf", "https://github.com/RaRe-Technologies/gensim-data/issues/18", "https://github.com/Witiko/semeval-2016_2017-task3-subtaskB-english"], | ||
"parts": 1 | ||
}, | ||
"semeval-2016-2017-task3-subtaskA-unannotated": { | ||
"num_records": 189941, | ||
"record_format": "dict", | ||
"file_size": 234373151, | ||
"reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskA-unannotated-eng/__init__.py", | ||
"license": "These datasets are free for general research use.", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have a link supporting this conclusion? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CC: @Witiko There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The datasets were put together from the following files:
Files 2 and 4 contain explicit license notices – see the “License” section of #18. Files 1 and 3 contain no licensing notices, so technically all rights are reserved. However, to me it seems like a clear oversight on the side of the task authors who left some of the ZIP archives without instructions. I can check this with the task authors if you want all bases covered. |
||
"fields": { | ||
"THREAD_SEQUENCE": "", | ||
"RelQuestion": { | ||
"RELQ_CATEGORY": "question category, according to the Qatar Living taxonomy", | ||
"RELQ_DATE": "date of posting", | ||
"RELQ_ID": "question indentifier", | ||
"RELQ_USERID": "identifier of the user asking the question", | ||
"RELQ_USERNAME": "name of the user asking the question", | ||
"RelQBody": "body of question", | ||
"RelQSubject": "subject of question" | ||
}, | ||
"RelComments": [ | ||
{ | ||
"RelCText": "text of answer", | ||
"RELC_USERID": "identifier of the user posting the comment", | ||
"RELC_ID": "comment identifier", | ||
"RELC_USERNAME": "name of the user posting the comment", | ||
"RELC_DATE": "date of posting" | ||
} | ||
|
||
] | ||
}, | ||
"description": "SemEval 2016 / 2017 Task 3 Subtask A unannotated dataset contains 189,941 questions and 1,894,456 comments in English collected from the Community Question Answering (CQA) web forum of Qatar Living. These can be used as a corpus for language modelling.", | ||
"checksum": "2de0e2f2c4f91c66ae4fcf58d50ba816", | ||
"file_name": "semeval-2016-2017-task3-subtaskA-unannotated.gz", | ||
"read_more": ["http://alt.qcri.org/semeval2016/task3/", "http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf", "https://github.com/RaRe-Technologies/gensim-data/issues/18", "https://github.com/Witiko/semeval-2016_2017-task3-subtaskA-unannotated-english"], | ||
"parts": 1 | ||
}, | ||
"patent-2017": { | ||
"num_records": 353197, | ||
"record_format": "dict", | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? It's better with the links. More concrete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of auto-generated stuff, In general, the script has no idea where exactly it is contained on the GitHub (these links were spelled out by hand).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha. Can you add the links back? (probably to the script)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky links never were part of the script, I can add it only manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UPD: Done 80ad749
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a good idea to do it manually. Please add to the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky please look again to #19 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? Add the links to the script, so you don't have to do this manually the next time(s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky imagine: I moved/renamed/removed this script or lists.json, the link will be broken.
Done: fcc89c2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.