Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add DPR Training #273

Closed
wants to merge 23 commits into from
Closed

WIP: Add DPR Training #273

wants to merge 23 commits into from

Conversation

kolk
Copy link
Contributor

@kolk kolk commented Jul 29, 2020

  1. created squad_schema
  2. added convert_df_to_squad and convert_dpr_to_squad functions in utils.py

@kolk kolk requested a review from tholor July 29, 2020 11:46
@tholor tholor changed the title dpr json to squad json conversion Add DPR Training Jul 29, 2020
@tholor tholor removed their request for review July 29, 2020 14:29
@tholor tholor changed the title Add DPR Training WIP: Add DPR Training Jul 29, 2020
squad_data.append(data(title=title, paragraphs=paras_in_title))
return SquadSchema(data=squad_data).json()

def convert_dpr_to_squad(json_data, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this function so that it takes "input_file" and "output_file" as args and writes the results to the output file

"""
# escape all (
answer = r"\b{}\b".format(re.escape(answer))
return [m.start() for m in re.finditer(answer, text)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens for these cases:

  • answer not found in text?
  • multiple answers found?
    Do we ensure that we have in the end correct labels indexed even if those cases happen? Does the user notice if he he has such cases in his data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now, multiple answers are treated as different 'answer's in the list of answers for a question in qas.
I'll add a warning for no answer found and for multiple answers but handle the multiple answers as I have mentioned


def find_answer_start(answer: str, text: str):
"""
get index of beginning of answer in text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For docstrings: Please start with uppercase and have one empty line between the description and the :param section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -0,0 +1,25 @@
from pydantic import BaseModel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename this file to schemas.py

@kolk kolk changed the base branch from eval_dpr to master July 31, 2020 11:02
@antoniolanza1996
Copy link
Contributor

Hi @kolk @tholor @Timoeller ,
is this PR still in your roadmap? Are you going to integrate DPR training on Haystack or not? If so, do you have an approximate release date?

Thanks

@tholor
Copy link
Member

tholor commented Sep 9, 2020

Hey @antoniolanza1996 ,

is this PR still in your roadmap? Are you going to integrate DPR training on Haystack or not?

Yes, absolutely. @kolk is still working on the implementation, but we moved to an implementation in FARM as we want to re-use as much of the training utility features there as possible (checkpointing, streaming, parallelization ...).

If so, do you have an approximate release date?

You can follow the progress in deepset-ai/FARM#513. We are 95% done with the preprocessing pipeline and will then continue with implementing the model + training. As there are still a few specialties ahead of us (dual encoder model that doesn't really fit into FARM's concept of an AdaptiveModel, eval with avg. rank ...), I can't commit an exact release date yet, but I would roughly estimate 2-4 weeks from now.

@tholor tholor added this to the #3 milestone Oct 21, 2020
@kolk kolk closed this Oct 28, 2020
@julian-risch julian-risch deleted the dpr_data_preprocessing branch November 15, 2021 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants