-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subsampling for IVF-PQ codebook generation #2052
Subsampling for IVF-PQ codebook generation #2052
Conversation
/ok to test |
Thanks Rui for the PR! @achirkin could you have look at the proposed subsampling step? |
ec0b2f6
to
00d1ece
Compare
Tested the performance of this PR based on the first-level subsampling PR (#2077) on Deep-100M dataset with different build parameters. All the tests are done on A100-80GB-PCIe GPU.
Here is the table for build performance. With the codebook subsampling, we can see about 30%-50% speedup, depending on the amount of subsampling user choose. Here, the 30%-50% speedup is achieved with using 10%-20% of input (after the initial subsampling) for codebook training.
The search performance are shown in the tables below. The maximum recall difference compared to no codebook subsampling is about 0.38%, which means a slightly recall increase with PQ codebook subsampling. This suggests it's more like a run-to-run variation. I am going to rerun the tests to eliminate the effect of run-to-run variation (going to update this PR afterwards). All the search results below are without refinement.
|
Thanks @abc99lr for the measurements! The additional subsampling for PQ codebooks gives a nice improvement in IVF-PQ build time, and I am excited about this change! In many cases we see less than 0.05% diff in recall, and that looks perfect. But there are also other cases where we have larger than 0.1%, in those cases we would like to understand whether it is due to run-to-run variation. I am running additional test with PR #2077 and we will compare the diffs to that. |
Updates on run-to-run variance. I reran the code (both build and search) three times. And find even without this PR, the run-to-run variance is 0.37%. Please see the following tables for recall difference compared to the first run. The tests below are with Deep-100M dataset, tested on A100-80GB-PCIe with 2nd run vs 1st run:
3rd run vs 1st run:
I think the 0.38% difference we saw with this PR is acceptable, if we can see similar run-to-run variance with #2077. The results also show that the run-to-run variance is higher when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Rui for the update. The results look great. I have seen similar recall variations, and I think that looks good as well. Just a few small things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Rui for the update! The PR looks good to me!
/ok to test |
/ok to test |
/ok to test |
/merge |
Hi @achirkin , I think the change you requested has been fixed. Could you please approve this PR? |
Dismissing to get this in before code freeze. Rui has addressed the request.
/ok to test |
/merge |
Sorry for being late, but yes, LGTM! :) |
This reverts commit e272176.
Random sampling of training set for IVF methods was reverted in rapidsai/raft#2144 due to the large memory usage of the subsample method. Since then, PR rapidsai/raft#2155 has implemented a new random sampling method with improved memory utilization. Using that we can now enable random sampling of IVF methods (rapidsai/raft#2052 and rapidsai/raft#2077). Random subsampling has measurable overhead for IVF-Flat, therefore it is only enabled for IVF-PQ. Authors: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #122
Random sampling of training set for IVF methods was reverted in rapidsai/raft#2144 due to the large memory usage of the subsample method. Since then, PR rapidsai/raft#2155 has implemented a new random sampling method with improved memory utilization. Using that we can now enable random sampling of IVF methods (rapidsai/raft#2052 and rapidsai/raft#2077). Random subsampling has measurable overhead for IVF-Flat, therefore it is only enabled for IVF-PQ. Authors: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#122
This PR address #1901 by subsampling the input dataset for PQ codebook training to reduce the runtime.
Currently, a similar strategy is applied to
per_cluster
method, but not to the defaultper_subset
method. This PR fixes this gap. Similar to the subsampling mechanism of theper_cluster
method, we pick at minimum256*max(pq_book_size, pq_dim)
number of input rows for training each code book.raft/cpp/include/raft/neighbors/detail/ivf_pq_build.cuh
Line 408 in cf4e03d
The following performance numbers are generated using Deep-100M dataset. After subsampling, the search time and accuracy are not impacted (within +-5%) except one case where I saw 9% performance drop on search (using 10K batch for search). More extensive benchmarking across datasets seems to be needed for justification.
Note, after subsampling, the PQ codebook generation is no longer a bottleneck in the IVF-PQ index building. More optimizations on PQ codebook generation seem unnecessary. Although we could in theory apply the custom kernel approach (#2050)
with subsampling, my early tests show the current GEMM approach performs better than the custom kernel after subsampling.
Using multiple stream could improve the performance further by overlapping kernels for different
pq_dim
, given kernels are small after subsampling and may not fully utilize GPU. However, as mention above, since the entire PQ codebook is fast, this optimization may not be worthwhile.TODO