WIP: Add DPR Training #273

kolk · 2020-07-29T11:46:11Z

created squad_schema
added convert_df_to_squad and convert_dpr_to_squad functions in utils.py

…eval

tholor · 2020-07-29T14:31:27Z

haystack/utils.py

+        squad_data.append(data(title=title, paragraphs=paras_in_title))
+    return SquadSchema(data=squad_data).json()
+
+def convert_dpr_to_squad(json_data, **kwargs):


Let's change this function so that it takes "input_file" and "output_file" as args and writes the results to the output file

tholor · 2020-07-29T14:34:21Z

haystack/utils.py

+    """
+    # escape all (
+    answer = r"\b{}\b".format(re.escape(answer))
+    return [m.start() for m in re.finditer(answer, text)]


What happens for these cases:

answer not found in text?

multiple answers found?
Do we ensure that we have in the end correct labels indexed even if those cases happen? Does the user notice if he he has such cases in his data?

right now, multiple answers are treated as different 'answer's in the list of answers for a question in qas.
I'll add a warning for no answer found and for multiple answers but handle the multiple answers as I have mentioned

tholor · 2020-07-29T14:35:46Z

haystack/utils.py

+
+def find_answer_start(answer: str, text: str):
+    """
+    get index of beginning of answer in text


For docstrings: Please start with uppercase and have one empty line between the description and the :param section

tholor · 2020-07-29T14:49:33Z

haystack/squad_schema.py

@@ -0,0 +1,25 @@
+from pydantic import BaseModel


Please rename this file to schemas.py

…x in utils

…aining.ipynb

antoniolanza1996 · 2020-09-08T21:23:45Z

Hi @kolk @tholor @Timoeller ,
is this PR still in your roadmap? Are you going to integrate DPR training on Haystack or not? If so, do you have an approximate release date?

Thanks

tholor · 2020-09-09T08:56:06Z

Hey @antoniolanza1996 ,

is this PR still in your roadmap? Are you going to integrate DPR training on Haystack or not?

Yes, absolutely. @kolk is still working on the implementation, but we moved to an implementation in FARM as we want to re-use as much of the training utility features there as possible (checkpointing, streaming, parallelization ...).

If so, do you have an approximate release date?

You can follow the progress in deepset-ai/FARM#513. We are 95% done with the preprocessing pipeline and will then continue with implementing the model + training. As there are still a few specialties ahead of us (dual encoder model that doesn't really fit into FARM's concept of an AdaptiveModel, eval with avg. rank ...), I can't commit an exact release date yet, but I would roughly estimate 2-4 weeks from now.

tholor and others added 18 commits July 17, 2020 15:20

WIP add eval for DPR

e258b72

increase timeout for ES retrieval. add describe_documents()

03c991f

Update tutorials

9dae1e6

Add comment in tutorial

eff84ff

add basic tests for eval

1b0175b

WIP refactor add_eval_data

15f5bc4

refactor retriever.eval() to use Label objects and allow open domain …

fdc4720

…eval

refactor Document class. Introducing UUID

0d7944c

refactor FARMReader.eval()

585d45b

fix ES and inmemory tests

2e2113c

Refactor SQL store to comply with the changes

58a00f7

Refactor InMemory Document Store to comply with the changes

7ebbe4c

Update test fixtures

83964e9

Fix breaking tests for reader

c489f72

Fix SQL query

1a33498

Make eval comptabile with SQL and InMemory Document Stores

74a4f2e

Add default index value

55245bd

dpr json to squad json conversion

c587be7

kolk requested a review from tholor July 29, 2020 11:46

tholor changed the title ~~dpr json to squad json conversion~~ Add DPR Training Jul 29, 2020

tholor removed their request for review July 29, 2020 14:29

tholor changed the title ~~Add DPR Training~~ WIP: Add DPR Training Jul 29, 2020

tholor requested changes Jul 29, 2020

View reviewed changes

tanaysoni force-pushed the eval_dpr branch from e138b88 to cce3f81 Compare July 30, 2020 07:10

kolk added 2 commits July 30, 2020 14:34

changed filename and arguments, added checks in find_answer_start

0f1fc63

removed testing lines

6608b93

kolk changed the base branch from eval_dpr to master July 31, 2020 11:02

This was referenced Aug 1, 2020

DensePassageRetriever exception for use_amp=O1 #277

Closed

Combination of extractive QA and FAQ-style #280

Closed

tutorial 7: DPR training data preprocessing, find_answer_start bug fi…

9a3d1e2

…x in utils

Bug fix for Tutorial7_Create_preprocessed_data_for_Dense_Retrieval_tr…

28106e4

…aining.ipynb

kolk requested a review from Timoeller August 4, 2020 13:46

Timoeller mentioned this pull request Aug 7, 2020

Different document embedding values between DPR repository and Haystack DPR implementation #295

Closed

antoniolanza1996 mentioned this pull request Aug 7, 2020

Added title during DPR passage embedding && ElasticsearchDocumentStore #298

Merged

HFBertEncoder replaced with transformer DPR encoders

541c414

tholor mentioned this pull request Aug 20, 2020

Query regarding FAQ-style QA, real time QA and Elasterini #300

Closed

tholor assigned kolk Sep 14, 2020

tholor added this to the #3 milestone Oct 21, 2020

kolk closed this Oct 28, 2020

julian-risch deleted the dpr_data_preprocessing branch November 15, 2021 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add DPR Training #273

WIP: Add DPR Training #273

kolk commented Jul 29, 2020

tholor Jul 29, 2020

tholor Jul 29, 2020

kolk Jul 30, 2020

tholor Jul 29, 2020

kolk Jul 30, 2020

tholor Jul 29, 2020

antoniolanza1996 commented Sep 8, 2020

tholor commented Sep 9, 2020

WIP: Add DPR Training #273

WIP: Add DPR Training #273

Conversation

kolk commented Jul 29, 2020

tholor Jul 29, 2020

Choose a reason for hiding this comment

tholor Jul 29, 2020

Choose a reason for hiding this comment

kolk Jul 30, 2020

Choose a reason for hiding this comment

tholor Jul 29, 2020

Choose a reason for hiding this comment

kolk Jul 30, 2020

Choose a reason for hiding this comment

tholor Jul 29, 2020

Choose a reason for hiding this comment

antoniolanza1996 commented Sep 8, 2020

tholor commented Sep 9, 2020