-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQuAD to DPR dataset converter #765
Conversation
First commit of the squad2dpr script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your efforts on this! Its very cool to have this script and your code is really well organised. I have made some comments regarding a few pieces of functionality.
Regarding the parameters for initializing Document Stores and Retrievers, what you have is already plenty. I expect that large majority of users will be happy with default values.
It seems to me that your script handles SQuAD v1 datasets well. I just added some suggestions that should help with handling both SQuAD v1 and v2.
As for where this file should go, I think I would couple it with the retriever files i.e. haystack/retriever/squad_to_dpr.py
Feel free to reach out again if you need any clarification or assistance!
test/benchmarks/squad_to_dpr.py
Outdated
""" | ||
|
||
|
||
class IteratorAsList(list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I usually use list(iter_obj)
in order to cast iterators into a list! Unless I'm missing something I think that should do the trick :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi thanks for your thorough review ! I will answer inline :)
yes, indeed! I was under the impression that using this trick (from StackOverflow), I could have json.dump()
treat the iterator as a list and thus allow for an end to end lazy evaluation of my iterator. I tested with list()
, as usual, it resolves the iterator before sending the produced list to dump()
. It does work also, and surprisingly it does not seem to fill the available RAM (the whole reason I am using iterators).
I will remove this class and use the classic list()
.
test/benchmarks/squad_to_dpr.py
Outdated
parser.add_argument("--dpr_output_path", dest="dpr_output_path", | ||
help="The name of the DPR JSON formatted output file", | ||
metavar="DPR_out", required=True) | ||
parser.add_argument("--num_hard_negative_ctxs", dest="nb_hard_neg_ctxs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into an error when running this script. Should this be dest="nb_hard_neg_ctxs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes indeed, fixed it also !
test/benchmarks/squad_to_dpr.py
Outdated
|
||
# 7. Split (train, dev, test) and save dataset | ||
total_nb_questions = get_number_of_questions(squad_data) | ||
split_and_save_dataset(iter_dpr=iter_DPR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a cool feature! But some input datasets might already be divided into train and dev. For example, the Squad datasets are already split. Can we either remove this step or make it step optional?
When not performing the dataset split, we will also want to be able to specify the name of the output file in the command line e.g.
python squad_to_dpr.py --squad_file_path ~/data/squad20/dev-v2.0.json --dpr_output_filename ./squad_dev.dpr.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, ofc. I am stuck thinking in my own use case. I will add this option.
test/benchmarks/squad_to_dpr.py
Outdated
tqdm.write(f"Using SQuAD-like file {squad_file_path}") | ||
|
||
# 1. Load squad file data | ||
squad_impossible_path, squad_data = add_is_impossible(load_squad_file(squad_file_path=squad_file_path), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are generally using datasets in Squad v2 format where the is_impossible
field already exists. Let's have a check here to see if the is_impossible
field exists already. If it doesn't (e.g. Squad1) we'll call add_is_impossible
to add it. Otherwise, let's bypass it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we should skip the samples where is_impossible=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before. Will add this check.
test/benchmarks/squad_to_dpr.py
Outdated
context = paragraph["context"] | ||
for question in paragraph["qas"]: | ||
answers = [a["text"] for a in question["answers"]] | ||
hard_negative_ctxs = get_hard_negative_context(retriever=retriever, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some datasets, like Squadv2 dev, each question might have multiple answers. It might be safer here to pass not just the first answer but all the answers into get_hard_negative_context()
test/benchmarks/squad_to_dpr.py
Outdated
|
||
main(squad_file_path=squad_file_path, dpr_output_path=dpr_output_path, | ||
document_store_type_config=("ElasticsearchDocumentStore", store_dpr_config), | ||
# retriever_type_config=("ElasticsearchRetriever", retriever_bm25_config), # dpr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the commented out alternative retriever_type_config is a great idea! Makes for easy switching between DPR and BM25. I assume you actually meant to have
retriever_type_config=("DensePassageRetriever", retriever_dpr_config), # dpr
or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is another typo. Should read retriever_type_config=("DensePassageRetriever", retriever_dpr_config), # dpr
indeed.
Just made a few small changes in the latest commits:
|
@psorianom Thanks a lot for you work on this! This all looks good to me so I'll be merging it into master! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Duplicate file removed by 5665d55 |
Hello,
This is a first proposal for a squad to dpr dataset converter, as mentioned here: #705
I tried to take into account @brandenchan comments, although they may not follow what you had in mind (notably regarding to the API design). Still, it is a nice overview of where I am going with this.
I have also some doubts of how we could handle the large number of possible parameters (different arguments for different stores/retrievers). I took the path of "let the user handle that", so you can pass a dict with the appropriate config keywords to each document store/retriever you want to use.
As a side question, should we be able to treat SQuAD v1 datasets (without an
is_impossible
field) ? For now, since we work with v1 SQuAD datasets, I just add theis_impossible
field with a helper function.Also, I put the script in
benchmarks
only cause there is a siimilar converter file there (nq2squad) ! I really did not know where to place it :)Cheers!