Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing Generative Pseudo Labeling (GPL) #1908

Closed
MichelBartels opened this issue Dec 17, 2021 · 7 comments · Fixed by #2388
Closed

Implementing Generative Pseudo Labeling (GPL) #1908

MichelBartels opened this issue Dec 17, 2021 · 7 comments · Fixed by #2388
Labels
topic:retriever type:feature New feature or request

Comments

@MichelBartels
Copy link
Contributor

UKPLab has recently released a paper and the codebase on Generative Pseudo Labeling. It seems like this could be a useful feature for haystack.

It allows to fine-tune a dense retrieval model on unlabeled data by 1) generating synthetic questions using T5, 2) mining hard negatives using an existing dense retrieval pipeline and 3) creating soft labels using a cross encoder. Step 3) is important because it is making up for errors introduced in step 1) and 2).

Step 1) could be implemented easily by using existing QuestionGenerator.
Step 2) could also be implemented easily by using existing DocumentSearchPipeline.
Step 3) is a bit more difficult as the existing SentenceTransformersRanker only outputs documents so we could either add an argument for outputting the scores or we could just load transformers cross-encoder without using a specific haystack class.

Additionally, we would need to create subclasses of Trainer and DataSilo, but these should be quite easy as it would just be necessary to override __init__ and compute_loss/_dataset_from_chunk.

The training method should probably be a method of DensePassageReader similar to the existing train method.

The dataset input for this method could either be a file path like the train method or a DocumentStore.
An advantage of also using a file would be consistency, but using a DocumentStore would probably be easier for the user as they would have it in a DocumentStore in most use cases anyway.

The results from the different steps could also optionally be saved into a text file (eg. csv file) for caching/debugging. They could also be stored as Label objects, but that would probably require too many modifications of the Label class.

@julian-risch
Copy link
Member

@vblagoje Here is the issue on adding GPL to Haystack. I'll assign it to you now already but we know that you won't start working on it in the next days. @bogdankostic can support if you would like to discuss ideas or like to get feedback on a draft PR.

@vblagoje
Copy link
Member

vblagoje commented Mar 28, 2022

Hi @julian-risch, @bogdankostic and @MichelBartels - I'll be implementing GPL this week. From the outset, I wanted to run four design questions/approaches by you.

  1. Document input format. As @MichelBartels suggested - it makes a lot of sense to have DocumentStore as input, but I would also add BEIR format. So we'll support those two formats as input. If users require more input formats, we can add them later.

  2. GPL dependency-free implementation. I am leaning toward an implementation that doesn't require GPL dependency. Yes, GPL has this nice one-liner for training, but it writes the intermediate files to disk, so I'd instead implement GPL directly using only sentence_transformers. Like Nils implemented it in this collab. We might reuse some internal components as @MichelBartels suggested, we don't have to manage yet another dependency and be at the mercy of potential changes in GPL that we might not like.

  3. Is GPL domain adaptation a single atomic transaction? As you know, there are several distinct steps in GPL (@MichelBartels outlined them above). However, two significant steps emerge, as I see it.
    The first step involves all three steps outlined above (question generation, negative mining and pseudo labelling). I would treat these three steps as a single transaction.
    The second step involves training/adapting retriever on the training data generated in step one.
    My main reason for this delineation is that the first step produces the training data for the second step - an excellent natural border. It seems like an overengineering to checkpoint and restart any of the (question generation, negative mining and pseudo labelling) of the first step. This way, one could restart the second step having the training data saved from the first step, fine-tune the training hyperparameters etc. In addition, one could simply execute these two separate transactions in series and have GPL adaptation done in one shot.

  4. Are we aiming to GPL optimize SBert Bi-encoders and/or DPRs as well? Note that GPL example adaptations are run on SBert Bi-Encoders which are different architecturally to DensePassageRetrievers. Should we support just Bi-encoders or both?

@bogdankostic
Copy link
Contributor

Hi @vblagoje, looking very much forward to your implementation! Regarding your questions:

  1. I like the idea of being able to provide data in BEIR format, this goes well with Integrate BEIR #2333 where we integrated BEIR to evaluate a certain pipeline.
  2. I agree that it would be better to add GPL to haystack without adding another dependency.
  3. I think that splitting GPL into Label Generation (including question generation, negative mining and pseudo labelling) and Training makes sense. This would allow to use the generated training data on different retriever models without having to regenerate them.
  4. It would be great to support both SBert Bi-encoders and DPRs. Adding GPL to sentence-transformers models (which are used in our EmbeddingRetriever) in Haystack might require a bit more work than adding it to DPR, as we currently don't have any training procedures for models used in EmbeddingRetriever.

Happy to answer any further questions and help! :)

@vblagoje
Copy link
Member

vblagoje commented Mar 30, 2022

Tagging #1239 as a soft requirement for this feature. Without batch querying, GPL will be very slow.

@SaraAmd
Copy link

SaraAmd commented Mar 9, 2023

Hi @julian-risch, @bogdankostic and @MichelBartels - I'll be implementing GPL this week. From the outset, I wanted to run four design questions/approaches by you.

  1. Document input format. As @MichelBartels suggested - it makes a lot of sense to have DocumentStore as input, but I would also add BEIR format. So we'll support those two formats as input. If users require more input formats, we can add them later.
  2. GPL dependency-free implementation. I am leaning toward an implementation that doesn't require GPL dependency. Yes, GPL has this nice one-liner for training, but it writes the intermediate files to disk, so I'd instead implement GPL directly using only sentence_transformers. Like Nils implemented it in this collab. We might reuse some internal components as @MichelBartels suggested, we don't have to manage yet another dependency and be at the mercy of potential changes in GPL that we might not like.
  3. Is GPL domain adaptation a single atomic transaction? As you know, there are several distinct steps in GPL (@MichelBartels outlined them above). However, two significant steps emerge, as I see it.
    The first step involves all three steps outlined above (question generation, negative mining and pseudo labelling). I would treat these three steps as a single transaction.
    The second step involves training/adapting retriever on the training data generated in step one.
    My main reason for this delineation is that the first step produces the training data for the second step - an excellent natural border. It seems like an overengineering to checkpoint and restart any of the (question generation, negative mining and pseudo labelling) of the first step. This way, one could restart the second step having the training data saved from the first step, fine-tune the training hyperparameters etc. In addition, one could simply execute these two separate transactions in series and have GPL adaptation done in one shot.
  4. Are we aiming to GPL optimize SBert Bi-encoders and/or DPRs as well? Note that GPL example adaptations are run on SBert Bi-Encoders which are different architecturally to DensePassageRetrievers. Should we support just Bi-encoders or both?

The code in that notebook throws an error. Could you please fix it?

@mayankjobanputra
Copy link
Contributor

@SaraAmd I think that notebook it very old. In case you want to use GPL, I suggest you to look at Haystack's documentation here: https://docs.haystack.deepset.ai/docs/pseudo_label_generator

Feel free to open a discussion in case you have questions about it after reading the docs :)

@SaraAmd
Copy link

SaraAmd commented Mar 9, 2023

@SaraAmd I think that notebook it very old. In case you want to use GPL, I suggest you to look at Haystack's documentation here: https://docs.haystack.deepset.ai/docs/pseudo_label_generator

Feel free to open a discussion in case you have questions about it after reading the docs :)

Thank you for the response. I just need the section for the negative minin,g not the GPL part

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:retriever type:feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants