-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add DPR Training #273
Conversation
haystack/utils.py
Outdated
squad_data.append(data(title=title, paragraphs=paras_in_title)) | ||
return SquadSchema(data=squad_data).json() | ||
|
||
def convert_dpr_to_squad(json_data, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change this function so that it takes "input_file" and "output_file" as args and writes the results to the output file
haystack/utils.py
Outdated
""" | ||
# escape all ( | ||
answer = r"\b{}\b".format(re.escape(answer)) | ||
return [m.start() for m in re.finditer(answer, text)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens for these cases:
- answer not found in text?
- multiple answers found?
Do we ensure that we have in the end correct labels indexed even if those cases happen? Does the user notice if he he has such cases in his data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now, multiple answers are treated as different 'answer's in the list of answers for a question in qas
.
I'll add a warning for no answer found and for multiple answers but handle the multiple answers as I have mentioned
haystack/utils.py
Outdated
|
||
def find_answer_start(answer: str, text: str): | ||
""" | ||
get index of beginning of answer in text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For docstrings: Please start with uppercase and have one empty line between the description and the :param
section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
haystack/squad_schema.py
Outdated
@@ -0,0 +1,25 @@ | |||
from pydantic import BaseModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename this file to schemas.py
Hi @kolk @tholor @Timoeller , Thanks |
Hey @antoniolanza1996 ,
Yes, absolutely. @kolk is still working on the implementation, but we moved to an implementation in FARM as we want to re-use as much of the training utility features there as possible (checkpointing, streaming, parallelization ...).
You can follow the progress in deepset-ai/FARM#513. We are 95% done with the preprocessing pipeline and will then continue with implementing the model + training. As there are still a few specialties ahead of us (dual encoder model that doesn't really fit into FARM's concept of an AdaptiveModel, eval with avg. rank ...), I can't commit an exact release date yet, but I would roughly estimate 2-4 weeks from now. |
convert_df_to_squad
andconvert_dpr_to_squad
functions inutils.py