Implementing Generative Pseudo Labeling (GPL) #1908

MichelBartels · 2021-12-17T17:10:23Z

UKPLab has recently released a paper and the codebase on Generative Pseudo Labeling. It seems like this could be a useful feature for haystack.

It allows to fine-tune a dense retrieval model on unlabeled data by 1) generating synthetic questions using T5, 2) mining hard negatives using an existing dense retrieval pipeline and 3) creating soft labels using a cross encoder. Step 3) is important because it is making up for errors introduced in step 1) and 2).

Step 1) could be implemented easily by using existing QuestionGenerator.
Step 2) could also be implemented easily by using existing DocumentSearchPipeline.
Step 3) is a bit more difficult as the existing SentenceTransformersRanker only outputs documents so we could either add an argument for outputting the scores or we could just load transformers cross-encoder without using a specific haystack class.

Additionally, we would need to create subclasses of Trainer and DataSilo, but these should be quite easy as it would just be necessary to override __init__ and compute_loss/_dataset_from_chunk.

The training method should probably be a method of DensePassageReader similar to the existing train method.

The dataset input for this method could either be a file path like the train method or a DocumentStore.
An advantage of also using a file would be consistency, but using a DocumentStore would probably be easier for the user as they would have it in a DocumentStore in most use cases anyway.

The results from the different steps could also optionally be saved into a text file (eg. csv file) for caching/debugging. They could also be stored as Label objects, but that would probably require too many modifications of the Label class.

The text was updated successfully, but these errors were encountered:

julian-risch · 2022-03-02T14:45:06Z

@vblagoje Here is the issue on adding GPL to Haystack. I'll assign it to you now already but we know that you won't start working on it in the next days. @bogdankostic can support if you would like to discuss ideas or like to get feedback on a draft PR.

vblagoje · 2022-03-28T08:18:20Z

Hi @julian-risch, @bogdankostic and @MichelBartels - I'll be implementing GPL this week. From the outset, I wanted to run four design questions/approaches by you.

Document input format. As @MichelBartels suggested - it makes a lot of sense to have DocumentStore as input, but I would also add BEIR format. So we'll support those two formats as input. If users require more input formats, we can add them later.
GPL dependency-free implementation. I am leaning toward an implementation that doesn't require GPL dependency. Yes, GPL has this nice one-liner for training, but it writes the intermediate files to disk, so I'd instead implement GPL directly using only sentence_transformers. Like Nils implemented it in this collab. We might reuse some internal components as @MichelBartels suggested, we don't have to manage yet another dependency and be at the mercy of potential changes in GPL that we might not like.
Is GPL domain adaptation a single atomic transaction? As you know, there are several distinct steps in GPL (@MichelBartels outlined them above). However, two significant steps emerge, as I see it.
The first step involves all three steps outlined above (question generation, negative mining and pseudo labelling). I would treat these three steps as a single transaction.
The second step involves training/adapting retriever on the training data generated in step one.
My main reason for this delineation is that the first step produces the training data for the second step - an excellent natural border. It seems like an overengineering to checkpoint and restart any of the (question generation, negative mining and pseudo labelling) of the first step. This way, one could restart the second step having the training data saved from the first step, fine-tune the training hyperparameters etc. In addition, one could simply execute these two separate transactions in series and have GPL adaptation done in one shot.
Are we aiming to GPL optimize SBert Bi-encoders and/or DPRs as well? Note that GPL example adaptations are run on SBert Bi-Encoders which are different architecturally to DensePassageRetrievers. Should we support just Bi-encoders or both?

bogdankostic · 2022-03-28T16:08:52Z

Hi @vblagoje, looking very much forward to your implementation! Regarding your questions:

I like the idea of being able to provide data in BEIR format, this goes well with Integrate BEIR #2333 where we integrated BEIR to evaluate a certain pipeline.
I agree that it would be better to add GPL to haystack without adding another dependency.
I think that splitting GPL into Label Generation (including question generation, negative mining and pseudo labelling) and Training makes sense. This would allow to use the generated training data on different retriever models without having to regenerate them.
It would be great to support both SBert Bi-encoders and DPRs. Adding GPL to sentence-transformers models (which are used in our EmbeddingRetriever) in Haystack might require a bit more work than adding it to DPR, as we currently don't have any training procedures for models used in EmbeddingRetriever.

Happy to answer any further questions and help! :)

vblagoje · 2022-03-30T10:12:30Z

Tagging #1239 as a soft requirement for this feature. Without batch querying, GPL will be very slow.

SaraAmd · 2023-03-09T21:14:45Z

Hi @julian-risch, @bogdankostic and @MichelBartels - I'll be implementing GPL this week. From the outset, I wanted to run four design questions/approaches by you.

Document input format. As @MichelBartels suggested - it makes a lot of sense to have DocumentStore as input, but I would also add BEIR format. So we'll support those two formats as input. If users require more input formats, we can add them later.

GPL dependency-free implementation. I am leaning toward an implementation that doesn't require GPL dependency. Yes, GPL has this nice one-liner for training, but it writes the intermediate files to disk, so I'd instead implement GPL directly using only sentence_transformers. Like Nils implemented it in this collab. We might reuse some internal components as @MichelBartels suggested, we don't have to manage yet another dependency and be at the mercy of potential changes in GPL that we might not like.

Is GPL domain adaptation a single atomic transaction? As you know, there are several distinct steps in GPL (@MichelBartels outlined them above). However, two significant steps emerge, as I see it.
The first step involves all three steps outlined above (question generation, negative mining and pseudo labelling). I would treat these three steps as a single transaction.
The second step involves training/adapting retriever on the training data generated in step one.
My main reason for this delineation is that the first step produces the training data for the second step - an excellent natural border. It seems like an overengineering to checkpoint and restart any of the (question generation, negative mining and pseudo labelling) of the first step. This way, one could restart the second step having the training data saved from the first step, fine-tune the training hyperparameters etc. In addition, one could simply execute these two separate transactions in series and have GPL adaptation done in one shot.

Are we aiming to GPL optimize SBert Bi-encoders and/or DPRs as well? Note that GPL example adaptations are run on SBert Bi-Encoders which are different architecturally to DensePassageRetrievers. Should we support just Bi-encoders or both?

The code in that notebook throws an error. Could you please fix it?

mayankjobanputra · 2023-03-09T21:25:36Z

@SaraAmd I think that notebook it very old. In case you want to use GPL, I suggest you to look at Haystack's documentation here: https://docs.haystack.deepset.ai/docs/pseudo_label_generator

Feel free to open a discussion in case you have questions about it after reading the docs :)

SaraAmd · 2023-03-09T22:33:05Z

@SaraAmd I think that notebook it very old. In case you want to use GPL, I suggest you to look at Haystack's documentation here: https://docs.haystack.deepset.ai/docs/pseudo_label_generator

Feel free to open a discussion in case you have questions about it after reading the docs :)

Thank you for the response. I just need the section for the negative minin,g not the GPL part

tstadel added journey:advanced topic:retriever type:feature New feature or request labels Dec 20, 2021

vblagoje mentioned this issue Apr 9, 2022

Add Generative Pseudo Labeling (GPL) #2388

Merged

5 tasks

vblagoje closed this as completed in #2388 Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing Generative Pseudo Labeling (GPL) #1908

Implementing Generative Pseudo Labeling (GPL) #1908

MichelBartels commented Dec 17, 2021

julian-risch commented Mar 2, 2022

vblagoje commented Mar 28, 2022 •

edited

Loading

bogdankostic commented Mar 28, 2022

vblagoje commented Mar 30, 2022 •

edited

Loading

SaraAmd commented Mar 9, 2023

mayankjobanputra commented Mar 9, 2023

SaraAmd commented Mar 9, 2023

Implementing Generative Pseudo Labeling (GPL) #1908

Implementing Generative Pseudo Labeling (GPL) #1908

Comments

MichelBartels commented Dec 17, 2021

julian-risch commented Mar 2, 2022

vblagoje commented Mar 28, 2022 • edited Loading

bogdankostic commented Mar 28, 2022

vblagoje commented Mar 30, 2022 • edited Loading

SaraAmd commented Mar 9, 2023

mayankjobanputra commented Mar 9, 2023

SaraAmd commented Mar 9, 2023

vblagoje commented Mar 28, 2022 •

edited

Loading

vblagoje commented Mar 30, 2022 •

edited

Loading