-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re enable IVF random sampling #2225
Re enable IVF random sampling #2225
Conversation
This reverts commit 7edd372. Subsampling is implemented in a separate PR, therefore the reletad code was removed while reverting.
Note this will only compile once #2155 is merged. |
…vf_random_sampling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Love to see the IVF build code shrinking!
* PER_CLUSTER. In both cases, we will use `pq_book_size * max_train_points_per_pq_code` training | ||
* points to train each codebook. | ||
*/ | ||
uint32_t max_train_points_per_pq_code = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 256 here? Have we tested this empirically across many datasets ti verify this is a good default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value is inspired by FAISS, which also has 256 as default. We tested on DEEP-100M here #2052 (comment). I will share results on other datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing this in favor of rapidsai/cuvs#122
* PER_CLUSTER. In both cases, we will use `pq_book_size * max_train_points_per_pq_code` training | ||
* points to train each codebook. | ||
*/ | ||
uint32_t max_train_points_per_pq_code = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value is inspired by FAISS, which also has 256 as default. We tested on DEEP-100M here #2052 (comment). I will share results on other datasets.
Random sampling of training set for IVF methods was reverted in #2144 due to the large memory usage of the subsample method.
PR #2155 implements a new random sampling method. Using that we can now enable random sampling of IVF methods (#2052 and #2077), therefore this PR reverts #2144, and adjust the code to utilize the new sampling method.