Combination of extractive QA and FAQ-style #280

mongoose54 · 2020-08-01T02:51:33Z

In your tutorial FAQ_Style QA you mention that a combination of extractive QA and FAQ-style is a very interesting idea and I totally agree. You plan to have a tutorial on this?

In the meantime I want to experiment on the following: I have 1) a corpus dataset (i.e. research papers) without QA, and 2) a small QA dataset (same domain) that I want to combine/utilize both.

I see how to fine-tune the reader model following the fine-tuning tutorial but I want to use DensePassageRetriever but how can I fine-tune it with my corpus dataset?

┆Issue is synchronized with this Jira Task by Unito

The text was updated successfully, but these errors were encountered:

tholor · 2020-08-02T10:13:44Z

Hi @mongoose54 ,

In your tutorial FAQ_Style QA you mention that a combination of extractive QA and FAQ-style is a very interesting idea and I totally agree. You plan to have a tutorial on this?

We will refactor the FAQ-Style QA part in Haystack quite significantly soon to make it more generic to find "most similar documents". Regarding the combination: I would see the combination happening on a rather high level. For example, you could have two Finders: One that does extractive QA, one that finds most similar FAQs. At runtime, you could call both and combine the results. Depending on the use case, you could boost the results from FAQ a bit more as these might be "more trusted". Or: Only query the "slower reader", if FAQ doesn't return results above a certain score.

I see how to fine-tune the reader model following the fine-tuning tutorial but I want to use DensePassageRetriever but how can I fine-tune it with my corpus dataset?

Working on the training / fine-tune option for DPR in #273.
However, DPR training does not operate on a raw corpus either but requires pairs of "query" and "positive passage". So ideally you have these pairs for your domain (e.g. collected from user feedback at runtime via the Haystack API). This is often easier / faster to create than QA annotations. If you don't have this data, there will be an alternative to generate it heuristically from pairs of question and answer strings (without annotated offset in a particular doc), but performance might suffer a bit.

If you really want to experiment on your raw corpus, you could try training your encoders via an Inverse Cloze Task (see https://arxiv.org/pdf/1906.00300.pdf). The performance of this type of retriever is a bit worse than DPR, but the required data is less demanding.

Hope this helps!

mongoose54 · 2020-08-02T18:23:31Z

@tholor

Thank you for the wonderful explanation and description. I have a followup question:

However, DPR training does not operate on a raw corpus either but requires pairs of "query" and "positive passage".

Can you elaborate on the "query" and "positive passage" ? Is the "query" also a question? Also, is there an example to do this on your Haystack API?

tholor · 2020-09-14T14:42:44Z

Can you elaborate on the "query" and "positive passage" ? Is the "query" also a question?

Yes, in our case (i.e. QA):

query = question
"positive passage" = the passage that contains the answer
These training pairs can easily be obtained from public QA datasets, your own QA annotations or user feedback at production time.

Also, is there an example to do this on your Haystack API?

Inference: Yes, have a look at this tutorial that uses the DensePassageRetriever (= "DPR")
Training: As mentioned above, we don't have the training part implemented yet in Haystack, but we are on it :). You can follow our progress here in deepset-ai/FARM#513

guillim · 2020-09-17T11:22:42Z

Hi @mongoose54 ,

In your tutorial FAQ_Style QA you mention that a combination of extractive QA and FAQ-style is a very interesting idea and I totally agree. You plan to have a tutorial on this?

We will refactor the FAQ-Style QA part in Haystack quite significantly soon to make it more generic to find "most similar documents".

Hello @tholor ,
I came across this issue and read you are about to refactor the FAQ tutorial. About that, I would mention that I can't make it work on my side. I am using Elasticsearch, and I have read elsewhere that it could be a reason. Am i the only one with this problem ? I am running it inside a docker image, if it may help troubleshooting

tholor · 2020-09-21T11:47:17Z

Hey @guillim ,

Do you mean you get an error when running Tutorial 4 locally in a docker? If yes, can you please create a new issue with the error and environment details?
I just tried the tutorial with latest master / haystack 0.4.0 on Colab and it worked fine.

guillim · 2020-09-21T12:09:35Z

I think I just found out where the problem comes from. I will create an issue, and a PR associated to it. It comes down to the ElasticsearchDocumentStore initialisation, more precisely on _create_document_index function.

tholor · 2020-09-22T13:08:00Z

The bug should be fixed with #415. Feel free to re-open if not.

We will follow up on the implementation of combining multiple retrievers in #125 and the refactoring of "most similar document search" in #420

mongoose54 added the question label Aug 1, 2020

tholor self-assigned this Sep 17, 2020

tholor closed this as completed Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combination of extractive QA and FAQ-style #280

Combination of extractive QA and FAQ-style #280

mongoose54 commented Aug 1, 2020 •

edited by sync-by-unito bot

Loading

tholor commented Aug 2, 2020

mongoose54 commented Aug 2, 2020

tholor commented Sep 14, 2020

guillim commented Sep 17, 2020

tholor commented Sep 21, 2020

guillim commented Sep 21, 2020

tholor commented Sep 22, 2020

Combination of extractive QA and FAQ-style #280

Combination of extractive QA and FAQ-style #280

Comments

mongoose54 commented Aug 1, 2020 • edited by sync-by-unito bot Loading

tholor commented Aug 2, 2020

mongoose54 commented Aug 2, 2020

tholor commented Sep 14, 2020

guillim commented Sep 17, 2020

tholor commented Sep 21, 2020

guillim commented Sep 21, 2020

tholor commented Sep 22, 2020

mongoose54 commented Aug 1, 2020 •

edited by sync-by-unito bot

Loading