-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add semeval datasets. Fix #18 #19
Conversation
"record_format": "dict", | ||
"file_size": 234373151, | ||
"reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskA-unannotated-eng/__init__.py", | ||
"license": "These datasets are free for general research use.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a link supporting this conclusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC: @Witiko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The datasets were put together from the following files:
- QL-unannotated-data-subtaskA.xml.zip
- semeval2016-task3-cqa-ql-traindev-v3.2.zip
- semeval2016_task3_test.zip
- semeval2017_task3_test.zip
Files 2 and 4 contain explicit license notices – see the “License” section of #18. Files 1 and 3 contain no licensing notices, so technically all rights are reserved. However, to me it seems like a clear oversight on the side of the task authors who left some of the ZIP archives without instructions. I can check this with the task authors if you want all bases covered.
Last needed changes
CC: @Witiko |
According to Section 5 of the 2016 task paper linked in section “Papers” of #18, the main evaluation metric is MAP (Mean Average Precision). Supplementary evaluation metrics include Mean Reciprocal Rank (MRR), Average Recall (AvgRec), Precision, Recall, F1, and Accuracy. |
Along with the updated datatype of the
I apologize for these late changes. |
SemEval 2016 / 2017 Task 3 Subtask A unannotated dataset contains 189,941 questions and 1,894,456 comments in English collected from the Community Question Answering (CQA) web forum of Qatar Living. These can be used as a corpus for language modelling. SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the 2016 task paper linked in section “Papers” of #18. |
The main data field for Subtask B is |
Using the Subtask A unannotated dataset, we build a corpus: import gensim.downloader as api
from gensim.utils import simple_preprocess
corpus = []
for thread in api.load("semeval-2016-2017-task3-subtaskA-unannotated"):
corpus.append(simple_preprocess(thread["RelQuestion"]["RelQSubject"]))
corpus.append(simple_preprocess(thread["RelQuestion"]["RelQBody"]))
for relcomment in thread["RelComments"]:
corpus.append(simple_preprocess(relcomment["RelCText"])) The below code example for Subtasks B and C and takes the corpus we have just built. For each original thread, we then extract the question from the original thread and compare it against the questions in the related threads (for subtask B) and comments in the related threads (for subtask C) using cosine similarity. This produces rankings that we evaluate using the Mean Average Precision (MAP) evaluation metric. import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.similarities import MatrixSimilarity
from gensim.utils import simple_preprocess
import numpy as np
corpus = []
for thread in api.load("semeval-2016-2017-task3-subtaskA-unannotated"):
corpus.append(simple_preprocess(thread["RelQuestion"]["RelQSubject"]))
corpus.append(simple_preprocess(thread["RelQuestion"]["RelQBody"]))
for relcomment in thread["RelComments"]:
corpus.append(simple_preprocess(relcomment["RelCText"]))
dictionary = Dictionary(corpus)
datasets = api.load("semeval-2016-2017-task3-subtaskBC")
def produce_test_data(dataset):
for orgquestion in datasets[dataset]:
relquestions = [
(
dictionary.doc2bow(
simple_preprocess(thread["RelQuestion"]["RelQSubject"]) \
+ simple_preprocess(thread["RelQuestion"]["RelQBody"])),
thread["RelQuestion"]["RELQ_RELEVANCE2ORGQ"] \
in ("PerfectMatch", "Relevant"))
for thread in orgquestion["Threads"]]
relcomments = [
(
dictionary.doc2bow(simple_preprocess(relcomment["RelCText"])),
relcomment["RELC_RELEVANCE2ORGQ"] == "Good")
for thread in orgquestion["Threads"]
for relcomment in thread["RelComments"]]
orgquestion = dictionary.doc2bow(
simple_preprocess(orgquestion["OrgQSubject"]) \
+ simple_preprocess(orgquestion["OrgQBody"]))
yield (orgquestion, dict(subtaskB=relquestions, subtaskC=relcomments))
def average_precision(similarities, relevance):
precision = [
(num_correct + 1) / (num_total + 1) \
for num_correct, num_total in enumerate(
num_total for num_total, (_, relevant) in enumerate(
sorted(zip(similarities, relevance), reverse=True)) \
if relevant)]
return np.mean(precision) if precision else 0.0
def evaluate(dataset, subtask):
results = []
for orgquestion, subtasks in produce_test_data(dataset):
documents, relevance = zip(*subtasks[subtask])
index = MatrixSimilarity(documents, num_features=len(dictionary))
similarities = index[orgquestion]
assert len(similarities) == len(documents)
results.append(average_precision(similarities, relevance))
return np.mean(results) * 100.0
for dataset in ("2016-dev", "2016-test", "2017-test"):
print("MAP score on the %s dataset:\t%.02f (Subtask B)\t%.02f (Subtask C)" % (
dataset, evaluate(dataset, "subtaskB"), evaluate(dataset, "subtaskC"))) The above code produces the following output for me:
|
Can I help with this? |
|
||
(this table is generated automatically by [generate_table.py](https://github.com/RaRe-Technologies/gensim-data/blob/master/generate_table.py) based on [list.json](https://github.com/RaRe-Technologies/gensim-data/blob/master/list.json)) | ||
(generated by generate_table.py based on list.json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? It's better with the links. More concrete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of auto-generated stuff, In general, the script has no idea where exactly it is contained on the GitHub (these links were spelled out by hand).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha. Can you add the links back? (probably to the script)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky links never were part of the script, I can add it only manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UPD: Done 80ad749
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a good idea to do it manually. Please add to the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky please look again to #19 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? Add the links to the script, so you don't have to do this manually the next time(s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky imagine: I moved/renamed/removed this script or lists.json, the link will be broken.
Done: fcc89c2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
No description provided.