Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQuAD to DPR dataset converter #765

Merged
merged 5 commits into from
Feb 1, 2021
Merged

Conversation

psorianom
Copy link
Contributor

@psorianom psorianom commented Jan 24, 2021

Hello,

This is a first proposal for a squad to dpr dataset converter, as mentioned here: #705

I tried to take into account @brandenchan comments, although they may not follow what you had in mind (notably regarding to the API design). Still, it is a nice overview of where I am going with this.
I have also some doubts of how we could handle the large number of possible parameters (different arguments for different stores/retrievers). I took the path of "let the user handle that", so you can pass a dict with the appropriate config keywords to each document store/retriever you want to use.

As a side question, should we be able to treat SQuAD v1 datasets (without an is_impossible field) ? For now, since we work with v1 SQuAD datasets, I just add the is_impossible field with a helper function.

Also, I put the script in benchmarks only cause there is a siimilar converter file there (nq2squad) ! I really did not know where to place it :)
Cheers!

First commit of the squad2dpr script.
@brandenchan brandenchan self-assigned this Jan 25, 2021
Copy link
Contributor

@brandenchan brandenchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your efforts on this! Its very cool to have this script and your code is really well organised. I have made some comments regarding a few pieces of functionality.

Regarding the parameters for initializing Document Stores and Retrievers, what you have is already plenty. I expect that large majority of users will be happy with default values.

It seems to me that your script handles SQuAD v1 datasets well. I just added some suggestions that should help with handling both SQuAD v1 and v2.

As for where this file should go, I think I would couple it with the retriever files i.e. haystack/retriever/squad_to_dpr.py

Feel free to reach out again if you need any clarification or assistance!

"""


class IteratorAsList(list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I usually use list(iter_obj) in order to cast iterators into a list! Unless I'm missing something I think that should do the trick :)

Copy link
Contributor Author

@psorianom psorianom Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi thanks for your thorough review ! I will answer inline :)
yes, indeed! I was under the impression that using this trick (from StackOverflow), I could have json.dump() treat the iterator as a list and thus allow for an end to end lazy evaluation of my iterator. I tested with list(), as usual, it resolves the iterator before sending the produced list to dump(). It does work also, and surprisingly it does not seem to fill the available RAM (the whole reason I am using iterators).
I will remove this class and use the classic list().

parser.add_argument("--dpr_output_path", dest="dpr_output_path",
help="The name of the DPR JSON formatted output file",
metavar="DPR_out", required=True)
parser.add_argument("--num_hard_negative_ctxs", dest="nb_hard_neg_ctxs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into an error when running this script. Should this be dest="nb_hard_neg_ctxs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes indeed, fixed it also !


# 7. Split (train, dev, test) and save dataset
total_nb_questions = get_number_of_questions(squad_data)
split_and_save_dataset(iter_dpr=iter_DPR,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool feature! But some input datasets might already be divided into train and dev. For example, the Squad datasets are already split. Can we either remove this step or make it step optional?

When not performing the dataset split, we will also want to be able to specify the name of the output file in the command line e.g.

python squad_to_dpr.py --squad_file_path ~/data/squad20/dev-v2.0.json --dpr_output_filename ./squad_dev.dpr.json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, ofc. I am stuck thinking in my own use case. I will add this option.

tqdm.write(f"Using SQuAD-like file {squad_file_path}")

# 1. Load squad file data
squad_impossible_path, squad_data = add_is_impossible(load_squad_file(squad_file_path=squad_file_path),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are generally using datasets in Squad v2 format where the is_impossible field already exists. Let's have a check here to see if the is_impossible field exists already. If it doesn't (e.g. Squad1) we'll call add_is_impossible to add it. Otherwise, let's bypass it.

Copy link
Contributor

@brandenchan brandenchan Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should skip the samples where is_impossible=True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before. Will add this check.

context = paragraph["context"]
for question in paragraph["qas"]:
answers = [a["text"] for a in question["answers"]]
hard_negative_ctxs = get_hard_negative_context(retriever=retriever,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some datasets, like Squadv2 dev, each question might have multiple answers. It might be safer here to pass not just the first answer but all the answers into get_hard_negative_context()


main(squad_file_path=squad_file_path, dpr_output_path=dpr_output_path,
document_store_type_config=("ElasticsearchDocumentStore", store_dpr_config),
# retriever_type_config=("ElasticsearchRetriever", retriever_bm25_config), # dpr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the commented out alternative retriever_type_config is a great idea! Makes for easy switching between DPR and BM25. I assume you actually meant to have
retriever_type_config=("DensePassageRetriever", retriever_dpr_config), # dpr
or something similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is another typo. Should read retriever_type_config=("DensePassageRetriever", retriever_dpr_config), # dpr indeed.

@psorianom psorianom requested a review from brandenchan January 29, 2021 07:44
@brandenchan
Copy link
Contributor

Just made a few small changes in the latest commits:

  • Merged latest master to fix a CI issue
  • Moved the script to haystack/retriever/squad_to_dpr.py
  • Added a docstring to give an example usage of the script

@brandenchan
Copy link
Contributor

@psorianom Thanks a lot for you work on this! This all looks good to me so I'll be merging it into master!

Copy link
Contributor

@brandenchan brandenchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@brandenchan brandenchan merged commit 16b8291 into deepset-ai:master Feb 1, 2021
@brandenchan
Copy link
Contributor

Duplicate file removed by 5665d55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants