-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings #422
FAISS Store: allow multiple write calls and fix potential memory leak in update_embeddings #422
Conversation
lalitpagaria
commented
Sep 22, 2020
- Allow multiple write calls to existing FAISS index.
- Fixing issue when update_embeddings always create new FAISS index instead of clearing existing one. New index creation may not free existing used memory and cause memory leak.
Looking good. Thanks for adding this @lalitpagaria ! Let me know when it's ready for review |
@tholor technically PR is ready but there is one short coming, please read below -
Currently we follow this link to enable L2 metrics to allow IP search. It method required Thats why in my PR #385 , I abstracted out this functionality to separate class Alternative is to expose BTW Please let me know what do you think. |
@tholor It seems I had wrong assumption I tested this scenario on notebook and verified that multiple write works perfectly. Refer this notebook https://colab.research.google.com/drive/11e3zdFP6kg2xhJ8LvplVt08B8Nf2iEGn?usp=sharing So yes my existing PR will ready for review and it not have issue which I raised in my previous comment. In case if you not able to open notebook try this -
|
Thanks. The code example is very helpful. I will review it in the next days. I want to investigate the transformation via Phi a bit deeper as a bug here will be very hard to trace later and can impact performance a lot. |
@tholor Hope you may find time to review this PR. I just want to add one more point that this PR also have fix for potential memory leak issue in |
I had a look at your test script. Very helpful scenarios you have in there! However, I found one major issue: When changing the init to:
The output is
So it's only working in the scenarios with one write to the L2 index. |
@tholor Not sure why test failed which is unrelated to PR. Locally no test failed on my system. I just rebased with latest master. Is it possible to re-run the build? |
- Fixing issue when update_embeddings always create new FAISS index instead of clearing existing one. New index creation may not free existing used memory and cause memory leak.
@tholor Not sure if this is valid scenario? Pardon my less knowledge about embedding. I assume a model (same model) will produce same embedding for a text, if no changes in any parameters except called on different time. Not sure if model will update it's Here I have question, which is relevant to use case I am trying to solving: Continues fine training.
|
One model will always produce the same embedding for one text. There is no randomness (and therefore no seed) in these models at inference time. Not sure why you are asking this, but if it's related to the seed in the above test scenarios, this is something different: if we don't vary the seed there we basically simulate indexing the same batch of documents twice. However, we usually have different docs in the two batches (xb1 and xb2) and therefore different embeddings.
Yes, every time we have a new model (e.g. after fine-tuning), we'll need to update all embeddings.
Yes, you can now re-run github actions. There is a button on the top right to do this when you are on the page for the failed test |
Thanks for explanation. I created this notebook to test end to end pipeline to verify phi impact see if this help. |
After some further investigation, I think we can actually get rid of the phi normalization trick. Therefore, I think we can merge this PR now and switch from Phi normalization to the native faiss implementation in a separate PR. |