-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing Generative Pseudo Labeling (GPL) #1908
Comments
@vblagoje Here is the issue on adding GPL to Haystack. I'll assign it to you now already but we know that you won't start working on it in the next days. @bogdankostic can support if you would like to discuss ideas or like to get feedback on a draft PR. |
Hi @julian-risch, @bogdankostic and @MichelBartels - I'll be implementing GPL this week. From the outset, I wanted to run four design questions/approaches by you.
|
Hi @vblagoje, looking very much forward to your implementation! Regarding your questions:
Happy to answer any further questions and help! :) |
Tagging #1239 as a soft requirement for this feature. Without batch querying, GPL will be very slow. |
The code in that notebook throws an error. Could you please fix it? |
@SaraAmd I think that notebook it very old. In case you want to use GPL, I suggest you to look at Haystack's documentation here: https://docs.haystack.deepset.ai/docs/pseudo_label_generator Feel free to open a discussion in case you have questions about it after reading the docs :) |
Thank you for the response. I just need the section for the negative minin,g not the GPL part |
UKPLab has recently released a paper and the codebase on Generative Pseudo Labeling. It seems like this could be a useful feature for haystack.
It allows to fine-tune a dense retrieval model on unlabeled data by 1) generating synthetic questions using T5, 2) mining hard negatives using an existing dense retrieval pipeline and 3) creating soft labels using a cross encoder. Step 3) is important because it is making up for errors introduced in step 1) and 2).
Step 1) could be implemented easily by using existing
QuestionGenerator
.Step 2) could also be implemented easily by using existing
DocumentSearchPipeline
.Step 3) is a bit more difficult as the existing
SentenceTransformersRanker
only outputs documents so we could either add an argument for outputting the scores or we could just load transformers cross-encoder without using a specific haystack class.Additionally, we would need to create subclasses of
Trainer
andDataSilo
, but these should be quite easy as it would just be necessary to override__init__
andcompute_loss
/_dataset_from_chunk
.The training method should probably be a method of
DensePassageReader
similar to the existingtrain
method.The dataset input for this method could either be a file path like the
train
method or aDocumentStore
.An advantage of also using a file would be consistency, but using a
DocumentStore
would probably be easier for the user as they would have it in aDocumentStore
in most use cases anyway.The results from the different steps could also optionally be saved into a text file (eg. csv file) for caching/debugging. They could also be stored as
Label
objects, but that would probably require too many modifications of theLabel
class.The text was updated successfully, but these errors were encountered: