New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

SQuAD to DPR dataset converter #765

Merged

brandenchan merged 5 commits into deepset-ai:master from psorianom:patch-1

Feb 1, 2021

Contributor

psorianom commented Jan 24, 2021 •

edited

Loading

Hello,

This is a first proposal for a squad to dpr dataset converter, as mentioned here: #705

I tried to take into account @brandenchan comments, although they may not follow what you had in mind (notably regarding to the API design). Still, it is a nice overview of where I am going with this.
I have also some doubts of how we could handle the large number of possible parameters (different arguments for different stores/retrievers). I took the path of "let the user handle that", so you can pass a dict with the appropriate config keywords to each document store/retriever you want to use.

As a side question, should we be able to treat SQuAD v1 datasets (without an is_impossible field) ? For now, since we work with v1 SQuAD datasets, I just add the is_impossible field with a helper function.

Also, I put the script in benchmarks only cause there is a siimilar converter file there (nq2squad) ! I really did not know where to place it :)
Cheers!


          Create squad_to_dpr.py

e56b5e7

First commit of the squad2dpr script.

brandenchan self-assigned this

brandenchan suggested changes

View reviewed changes

Contributor

brandenchan left a comment

Thanks for your efforts on this! Its very cool to have this script and your code is really well organised. I have made some comments regarding a few pieces of functionality.

Regarding the parameters for initializing Document Stores and Retrievers, what you have is already plenty. I expect that large majority of users will be happy with default values.

It seems to me that your script handles SQuAD v1 datasets well. I just added some suggestions that should help with handling both SQuAD v1 and v2.

As for where this file should go, I think I would couple it with the retriever files i.e. haystack/retriever/squad_to_dpr.py

Feel free to reach out again if you need any clarification or assistance!

test/benchmarks/squad_to_dpr.py Outdated

		"""


		class IteratorAsList(list):

Contributor

brandenchan Jan 26, 2021

So I usually use list(iter_obj) in order to cast iterators into a list! Unless I'm missing something I think that should do the trick :)

Contributor Author

psorianom Jan 28, 2021 •

edited

Loading

hi thanks for your thorough review ! I will answer inline :)
yes, indeed! I was under the impression that using this trick (from StackOverflow), I could have json.dump() treat the iterator as a list and thus allow for an end to end lazy evaluation of my iterator. I tested with list(), as usual, it resolves the iterator before sending the produced list to dump(). It does work also, and surprisingly it does not seem to fill the available RAM (the whole reason I am using iterators).
I will remove this class and use the classic list().

test/benchmarks/squad_to_dpr.py Outdated

+                  parser.add_argument("--dpr_output_path", dest="dpr_output_path",
+                                      help="The name of the DPR JSON formatted output file",
+                                      metavar="DPR_out", required=True)
+                  parser.add_argument("--num_hard_negative_ctxs", dest="nb_hard_neg_ctxs",

Contributor

brandenchan Jan 26, 2021

I ran into an error when running this script. Should this be dest="nb_hard_neg_ctxs?

Contributor Author

psorianom Jan 28, 2021

yes indeed, fixed it also !

test/benchmarks/squad_to_dpr.py Outdated

+                  # 7. Split (train, dev, test) and save dataset
+                  total_nb_questions = get_number_of_questions(squad_data)
+                  split_and_save_dataset(iter_dpr=iter_DPR,

Contributor

brandenchan Jan 26, 2021

This is a cool feature! But some input datasets might already be divided into train and dev. For example, the Squad datasets are already split. Can we either remove this step or make it step optional?

When not performing the dataset split, we will also want to be able to specify the name of the output file in the command line e.g.

python squad_to_dpr.py --squad_file_path ~/data/squad20/dev-v2.0.json --dpr_output_filename ./squad_dev.dpr.json

Contributor Author

psorianom Jan 28, 2021

yes, ofc. I am stuck thinking in my own use case. I will add this option.

test/benchmarks/squad_to_dpr.py Outdated

+                  tqdm.write(f"Using SQuAD-like file {squad_file_path}")
+                  # 1. Load squad file data
+                  squad_impossible_path, squad_data = add_is_impossible(load_squad_file(squad_file_path=squad_file_path),

Contributor

brandenchan Jan 26, 2021

We are generally using datasets in Squad v2 format where the is_impossible field already exists. Let's have a check here to see if the is_impossible field exists already. If it doesn't (e.g. Squad1) we'll call add_is_impossible to add it. Otherwise, let's bypass it.

Contributor

brandenchan Jan 26, 2021 •

edited

Loading

Also, we should skip the samples where is_impossible=True

Contributor Author

psorianom Jan 28, 2021

same as before. Will add this check.

test/benchmarks/squad_to_dpr.py Outdated

+                          context = paragraph["context"]
+                          for question in paragraph["qas"]:
+                              answers = [a["text"] for a in question["answers"]]
+                              hard_negative_ctxs = get_hard_negative_context(retriever=retriever,

Contributor

brandenchan Jan 26, 2021

In some datasets, like Squadv2 dev, each question might have multiple answers. It might be safer here to pass not just the first answer but all the answers into get_hard_negative_context()

test/benchmarks/squad_to_dpr.py Outdated

+                  main(squad_file_path=squad_file_path, dpr_output_path=dpr_output_path,
+                       document_store_type_config=("ElasticsearchDocumentStore", store_dpr_config),
+                       # retriever_type_config=("ElasticsearchRetriever", retriever_bm25_config), # dpr

Contributor

brandenchan Jan 26, 2021

Having the commented out alternative retriever_type_config is a great idea! Makes for easy switching between DPR and BM25. I assume you actually meant to have
retriever_type_config=("DensePassageRetriever", retriever_dpr_config), # dpr
or something similar?

Contributor Author

psorianom Jan 28, 2021

yes, it is another typo. Should read retriever_type_config=("DensePassageRetriever", retriever_dpr_config), # dpr indeed.


          adding review corrections/improvements

09440ff

psorianom requested a review from brandenchan

January 29, 2021 07:44

brandenchan added 2 commits

February 1, 2021 14:44


          Merge master 5bf351e

b2b674e


          Move script, add docstring

8f8483b

Contributor

brandenchan commented Feb 1, 2021

Just made a few small changes in the latest commits:

Merged latest master to fix a CI issue
Moved the script to haystack/retriever/squad_to_dpr.py
Added a docstring to give an example usage of the script


          Add type hints

411a891

Contributor

brandenchan commented Feb 1, 2021

@psorianom Thanks a lot for you work on this! This all looks good to me so I'll be merging it into master!

brandenchan reviewed

View reviewed changes

Contributor

brandenchan left a comment

LGTM

brandenchan merged commit 16b8291 into deepset-ai:master

Contributor

brandenchan commented Feb 1, 2021

Duplicate file removed by 5665d55

brandenchan mentioned this pull request

Use SQuAD Data for DPR training (Conversion) #705

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet