-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add random subsampling for IVF methods #2077
Conversation
The build time slightly increases when random subsampling is enabled. Here are measurements with IVF-Flat on H100 with subsets of the DEEP dataset. Here
|
There is also slight variation in recall. DEEP-100M, IVF-Flat:
IVF-PQ for the same dataset
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change! Is there a reason why subsample is added to raft::spatial::knn::detail::utils
namespace? It is used in raft::neighbors
so it make sense to add it under raft::neighbors::detail::utils
.
To establish a baseline on what are the expected variations, I have run test with the original RAFT code (fixed stride subsampling), but with input data where the vectors are shuffled. The tables show recall values in percentage (%). Index building and search was run with 10 different permutation of the input, and statistics on recall variation is presented in the tables below. Dataset is DEEP-10M, subsample ratio is 10. When the number of clusters is larger (while keeping number of probes fixed) then we have a large variation in the results. IVF-Flat 1k clusters
IVF-Flat 10k clusters
IVF-PQ 1k clusters, dim_pq=64, pq_bits=5
IVF-PQ 1k clusters, dim_pq=64, pq_bits=8
IVF-PQ 10k clusters, dim_pq=96, pq_bits=8
|
I have investigated random subsampling versus shuffling the input data and keeping fixed stride subsampling. In both cases the recall value fluctuates with a std < 0.14. I have compared the mean recall using these two methods over 10 iterations. The diff between average recall is between 0.01% and 0.03%. I would conclude that we have no significant change in recall. Example results: Running 10 different build/search iterations with different random seed. The table shows recall values and its variations (orig) datase file shuffled, (this PR). The last columns is the difference between the mean recall values. All recall values are in percentage. The table shows results for deep-10M. Results for bigANN-10M ar similar.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lowener for the review!
why subsample is added to raft::spatial::knn::detail::utils namespace?
Currently the helper function is still in the old namespace. I have started a separate branch to move the ann_utils.cuh
file to the neighbors namespace, I will submit a separate PR for 24.04.
Currently the overhead of the random subsampling is larger than ideal. I am running tests to quantify it. |
Overhead reduced. Switched to batchwise copy of the data: this reduces the time to allocate temporary buffers, and enables to overlap gathering vectors with host to device copies. The relative overhead is larger for IVF-Flat than IVF-PQ (because IVF-PQ building has more work to do, therefore preparing the training set takes a smaller fraction of build time). The attached example with trainset When we consider taking trainset as a smaller fraction of the dataset, then the overhead becomes smaller. The data is gathered into a contiguous buffer before copying to the device. This can be slightly faster than the strided copy using cudaMemcpy2D that we head earlier. In best case H2D copies are slightly accelerated, and completely overlap with data gathering. This can lead to 1% build time speedup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @tfeher! Especially for the comprehensive analysis of the perf impact and for reducing th esize of the ivf_pq_build.cuh file :)
Couple small things regarding the use of new raft primitives below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thansk @achirkin for the review, I have addressed the issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, LGTM!
/merge |
This reverts commit 9c35f73.
Random sampling of training set for IVF methods was reverted in rapidsai/raft#2144 due to the large memory usage of the subsample method. Since then, PR rapidsai/raft#2155 has implemented a new random sampling method with improved memory utilization. Using that we can now enable random sampling of IVF methods (rapidsai/raft#2052 and rapidsai/raft#2077). Random subsampling has measurable overhead for IVF-Flat, therefore it is only enabled for IVF-PQ. Authors: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #122
Random sampling of training set for IVF methods was reverted in rapidsai/raft#2144 due to the large memory usage of the subsample method. Since then, PR rapidsai/raft#2155 has implemented a new random sampling method with improved memory utilization. Using that we can now enable random sampling of IVF methods (rapidsai/raft#2052 and rapidsai/raft#2077). Random subsampling has measurable overhead for IVF-Flat, therefore it is only enabled for IVF-PQ. Authors: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#122
While building IVF-Flat or IVF-PQ indices we usually subsample the dataset to create a smaller training set for k-means clustering. Until now this subsampling was done with a fixed stride, this PR changes it to random subsampling.
The input is always randomized, even if all the vectors of the dataset are used.
Random sampling adds an overhead. The overhead is proportional to the training set size. If dataset is on host, then this overhead can be partially or completely masked by H2D transfer. The overhead is small compared to k-means training.
To completely overlap random sampling of the data with H2D copies, we utilize OpenMP parallelization to increase the effective bandwidth for gathering the data.