Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combination of extractive QA and FAQ-style #280

Closed
mongoose54 opened this issue Aug 1, 2020 · 7 comments
Closed

Combination of extractive QA and FAQ-style #280

mongoose54 opened this issue Aug 1, 2020 · 7 comments
Assignees

Comments

@mongoose54
Copy link

mongoose54 commented Aug 1, 2020

In your tutorial FAQ_Style QA you mention that a combination of extractive QA and FAQ-style is a very interesting idea and I totally agree. You plan to have a tutorial on this?

In the meantime I want to experiment on the following: I have 1) a corpus dataset (i.e. research papers) without QA, and 2) a small QA dataset (same domain) that I want to combine/utilize both.

I see how to fine-tune the reader model following the fine-tuning tutorial but I want to use DensePassageRetriever but how can I fine-tune it with my corpus dataset?

┆Issue is synchronized with this Jira Task by Unito

@tholor
Copy link
Member

tholor commented Aug 2, 2020

Hi @mongoose54 ,

In your tutorial FAQ_Style QA you mention that a combination of extractive QA and FAQ-style is a very interesting idea and I totally agree. You plan to have a tutorial on this?

We will refactor the FAQ-Style QA part in Haystack quite significantly soon to make it more generic to find "most similar documents". Regarding the combination: I would see the combination happening on a rather high level. For example, you could have two Finders: One that does extractive QA, one that finds most similar FAQs. At runtime, you could call both and combine the results. Depending on the use case, you could boost the results from FAQ a bit more as these might be "more trusted". Or: Only query the "slower reader", if FAQ doesn't return results above a certain score.

I see how to fine-tune the reader model following the fine-tuning tutorial but I want to use DensePassageRetriever but how can I fine-tune it with my corpus dataset?

Working on the training / fine-tune option for DPR in #273.
However, DPR training does not operate on a raw corpus either but requires pairs of "query" and "positive passage". So ideally you have these pairs for your domain (e.g. collected from user feedback at runtime via the Haystack API). This is often easier / faster to create than QA annotations. If you don't have this data, there will be an alternative to generate it heuristically from pairs of question and answer strings (without annotated offset in a particular doc), but performance might suffer a bit.

If you really want to experiment on your raw corpus, you could try training your encoders via an Inverse Cloze Task (see https://arxiv.org/pdf/1906.00300.pdf). The performance of this type of retriever is a bit worse than DPR, but the required data is less demanding.

Hope this helps!

@mongoose54
Copy link
Author

@tholor

Thank you for the wonderful explanation and description. I have a followup question:

However, DPR training does not operate on a raw corpus either but requires pairs of "query" and "positive passage".

Can you elaborate on the "query" and "positive passage" ? Is the "query" also a question? Also, is there an example to do this on your Haystack API?

@tholor
Copy link
Member

tholor commented Sep 14, 2020

Can you elaborate on the "query" and "positive passage" ? Is the "query" also a question?

Yes, in our case (i.e. QA):

  • query = question
  • "positive passage" = the passage that contains the answer
    These training pairs can easily be obtained from public QA datasets, your own QA annotations or user feedback at production time.

Also, is there an example to do this on your Haystack API?

Inference: Yes, have a look at this tutorial that uses the DensePassageRetriever (= "DPR")
Training: As mentioned above, we don't have the training part implemented yet in Haystack, but we are on it :). You can follow our progress here in deepset-ai/FARM#513

@guillim
Copy link
Contributor

guillim commented Sep 17, 2020

Hi @mongoose54 ,

In your tutorial FAQ_Style QA you mention that a combination of extractive QA and FAQ-style is a very interesting idea and I totally agree. You plan to have a tutorial on this?

We will refactor the FAQ-Style QA part in Haystack quite significantly soon to make it more generic to find "most similar documents".

Hello @tholor ,
I came across this issue and read you are about to refactor the FAQ tutorial. About that, I would mention that I can't make it work on my side. I am using Elasticsearch, and I have read elsewhere that it could be a reason. Am i the only one with this problem ? I am running it inside a docker image, if it may help troubleshooting

@tholor tholor self-assigned this Sep 17, 2020
@tholor
Copy link
Member

tholor commented Sep 21, 2020

Hey @guillim ,

Do you mean you get an error when running Tutorial 4 locally in a docker? If yes, can you please create a new issue with the error and environment details?
I just tried the tutorial with latest master / haystack 0.4.0 on Colab and it worked fine.

@guillim
Copy link
Contributor

guillim commented Sep 21, 2020

I think I just found out where the problem comes from. I will create an issue, and a PR associated to it. It comes down to the ElasticsearchDocumentStore initialisation, more precisely on _create_document_index function.

@tholor
Copy link
Member

tholor commented Sep 22, 2020

The bug should be fixed with #415. Feel free to re-open if not.

We will follow up on the implementation of combining multiple retrievers in #125 and the refactoring of "most similar document search" in #420

@tholor tholor closed this as completed Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants