[WIP] Custom fusedL2NN kernel for kmeans prediction #2050

abc99lr · 2023-12-07T22:57:26Z

This PR adds a custom fusedL2NN kernel in order to speedup PQ codebook generation step of IVF-PQ index building. This PR partially address #1901 (we are also experimenting other ideas, such as subsampling, to optimize the PQ codebook generation step).

The key kernel to the PQ code generation performance is kmeans predict, which is used to find the cluster that is nearest (defined by distance function) from each data point. Depending on the distance type, this could be viewed as some variants of GEMM between dataset and cluster_centers. In this PR, we focus on L2 distance. The current implementation utilizes CUTLASS kernel for this L2 distance calculation, with a fused reduction part that performs an argmin on the distances and outputs the label of the cluster which is the closest to each input data point.

In codebook generation step, we have a small input matrix for cluster_center (size [n_cluster, n_dim]), since n_dim depends on pq_len which is usually 1-24 and n_cluster depends on 2^pq_bit, which ranges from 16 to 256 (pq_bit from [4, 8]). Also, the dataset is a very tall matrix (size [n_row, n_dim]), n_row is the number of rows in the training set after subsampling, which is usually on the order of millions.

This PR focus on optimizing the GEMM with thin and small inputs. We found a custom non-GEMM kernel is able to achieve up to 33x compared to the original fused GEMM implementation for this problem when n_rows is 10M. For larger n_clusters and n_dim, the fused GEMM is still the best performer, but for codebook generation part, most of the problem sizes are not ideal for GEMM. Please see the following benchmark results are produced on a A100-80GB-PCIe machine.

In order to prevent performance hit, we only use the custom kernel when n_clusters*n_dim is less or equal than 256 in this PR. Those cases are circled in green in the picture above. Please let me know if you have ideas on better heuristics.

Here are E2E results on Deep-100M dataset. More benchmark numbers to be posted. Search performance is not impacted by this PR, Most recall/runtime are within 5% difference; all within 10%. Those variance seems to be run-to-run difference.

Dataset	n_iter	n_list	pq_bits	pq_dim	ratio	Original time (s)	Optimized time (s)	Speedup
Deep-100M	25	50000	4	96	10	129	105	1.23
Deep-100M	25	50000	5	96	10	128	103	1.24
Deep-100M	25	50000	6	96	10	131	107	1.22
Deep-100M	25	50000	7	96	10	129	107	1.21
Deep-100M	25	50000	8	96	10	149	117	1.27

TODO

Unit tests
E2E benchmark numbers (deep-100M, sift-1M)

…custom-kernel-ivfpq-codebook

copy-pr-bot · 2023-12-07T22:57:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

abc99lr · 2023-12-08T17:18:44Z

Tagging @tfeher @achirkin @cjnolet for some early review, before running more benchmarks and writing test cases (although I am pretty sure the accuracy should be fine). Thanks
I'd also like to get some thoughts on

which particular cases we should benchmark
the heuristic to select when to use the custom kernel. Right now it depends on n_cluster*n_dim, if the value is less or equal than 256, use the custom kernel. This is not optimal since it cannot pick up all the cases where the custom kernel could achieve speedup

abc99lr · 2023-12-08T22:48:20Z

I also created another PR for subsampling support. #2052
With subsampling, this PR doesn't seem to be necessary anymore, but I think it's necessary to justify subsampling won't impact the index quality. Any ideas on what benchmark performance numbers need to be collected?

achirkin

Thanks @abc99lr for the PR and the analyis! LGTM apart a small nitpick with sqrt flag.
Two suggestions for bonus points:

Could you please run the prims-bench and maybe add relevant test cases there?

raft/cpp/bench/prims/cluster/kmeans_balanced.cu

Line 81 in 1beb556

{100000, 128}, {1000000, 128}, {10000000, 128},
Would you consider changing the name of the PR to better reflect that it actually makes changes to the k-means rather than to ivf-pq?

achirkin · 2023-12-15T09:06:25Z

cpp/include/raft/distance/detail/fused_l2_nn.cuh

+        * can sometimes have round-off errors, which will cause (aNorm == bNorm) ~ accVal instead.
+        */
+        curr_distance = curr_distance * !((curr_distance * curr_distance < raft::distance::detail::ops::get_clamp_precision<DataT>()) * (dataset_norm[curr_row] == centers_norm[curr_n]));
+        if (sqrt) {


Change it to constexpr to make it more clear for both the reader and the compiler

Suggested change

if (sqrt) {

if constexpr (Sqrt) {

achirkin · 2023-12-15T09:07:06Z

cpp/include/raft/distance/detail/fused_l2_nn.cuh

@@ -380,6 +382,91 @@ void fusedL2NNImpl(OutT* min,
  }
 }

+template <bool sqrt, typename DataT, typename IdxT, typename LabelT>


nitpick: template parameters should start with a capital

Suggested change

template <bool sqrt, typename DataT, typename IdxT, typename LabelT>

template <bool Sqrt, typename DataT, typename IdxT, typename LabelT>

abc99lr · 2024-01-13T00:59:52Z

Thanks for the comments @achirkin. However, after discussing with @tfeher, we believe #2052 is a better optimization idea and it cannot work together with this PR. Given that, I think it makes sense to close this PR for now. Will reopen if needed.

This PR address #1901 by subsampling the input dataset for PQ codebook training to reduce the runtime. Currently, a similar strategy is applied to `per_cluster` method, but not to the default `per_subset` method. This PR fixes this gap. Similar to the subsampling mechanism of the `per_cluster` method, we pick at minimum `256*max(pq_book_size, pq_dim)` number of input rows for training each code book. https://github.com/rapidsai/raft/blob/cf4e03d0b952c1baac73f695f94d6482d8c391d8/cpp/include/raft/neighbors/detail/ivf_pq_build.cuh#L408 The following performance numbers are generated using Deep-100M dataset. After subsampling, the search time and accuracy are not impacted (within +-5%) except one case where I saw 9% performance drop on search (using 10K batch for search). More extensive benchmarking across datasets seems to be needed for justification. Dataset | n_iter | n_list | pq_bits | pq_dim | ratio | Original time (s) | Subsampling (s) | Speedup [subsampling] -- | -- | -- | -- | -- | -- | -- | -- | -- Deep-100M | 25 | 50000 | 4 | 96 | 10 | 129 | 89.5 | 1.44 Deep-100M | 25 | 50000 | 5 | 96 | 10 | 128 | 89.4 | 1.43 Deep-100M | 25 | 50000 | 6 | 96 | 10 | 131 | 90 | 1.46 Deep-100M | 25 | 50000 | 7 | 96 | 10 | 129 | 91.1 | 1.42 Deep-100M | 25 | 50000 | 8 | 96 | 10 | 149 | 93.4 | 1.60 Note, after subsampling, the PQ codebook generation is no longer a bottleneck in the IVF-PQ index building. More optimizations on PQ codebook generation seem unnecessary. Although we could in theory apply the custom kernel approach (#2050) with subsampling, my early tests show the current GEMM approach performs better than the custom kernel after subsampling. Using multiple stream could improve the performance further by overlapping kernels for different `pq_dim`, given kernels are small after subsampling and may not fully utilize GPU. However, as mention above, since the entire PQ codebook is fast, this optimization may not be worthwhile. TODO - [x] Benchmark the performance/accuracy impacts on multiple datasets Authors: - Rui Lan (https://github.com/abc99lr) - Ray Douglass (https://github.com/raydouglass) - gpuCI (https://github.com/GPUtester) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: #2052

abc99lr added 2 commits December 7, 2023 13:38

WIP: adding custom kernel for PQ code generation.

6febb6a

Merge branch 'branch-24.02' of https://github.com/rapidsai/raft into …

c849a4d

…custom-kernel-ivfpq-codebook

abc99lr requested a review from a team as a code owner December 7, 2023 22:57

github-actions bot added the cpp label Dec 7, 2023

abc99lr changed the title ~~[WIP] Custom fusedL2NN kernel for optimizing PQ codebook generation~~ [WIP] Custom fusedL2NN kernel for PQ codebook optimization Dec 8, 2023

abc99lr mentioned this pull request Dec 8, 2023

Subsampling for IVF-PQ codebook generation #2052

Merged

1 task

achirkin approved these changes Dec 15, 2023

View reviewed changes

cjnolet assigned abc99lr Jan 9, 2024

abc99lr changed the title ~~[WIP] Custom fusedL2NN kernel for PQ codebook optimization~~ [WIP] Custom fusedL2NN kernel for kmeans prediction Jan 9, 2024

abc99lr closed this Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Custom fusedL2NN kernel for kmeans prediction #2050

[WIP] Custom fusedL2NN kernel for kmeans prediction #2050

abc99lr commented Dec 7, 2023 •

edited

Loading

copy-pr-bot bot commented Dec 7, 2023

abc99lr commented Dec 8, 2023

abc99lr commented Dec 8, 2023

achirkin left a comment

achirkin Dec 15, 2023

achirkin Dec 15, 2023

abc99lr commented Jan 13, 2024

	template <bool sqrt, typename DataT, typename IdxT, typename LabelT>
	template <bool Sqrt, typename DataT, typename IdxT, typename LabelT>

[WIP] Custom fusedL2NN kernel for kmeans prediction #2050

[WIP] Custom fusedL2NN kernel for kmeans prediction #2050

Conversation

abc99lr commented Dec 7, 2023 • edited Loading

copy-pr-bot bot commented Dec 7, 2023

abc99lr commented Dec 8, 2023

abc99lr commented Dec 8, 2023

achirkin left a comment

Choose a reason for hiding this comment

achirkin Dec 15, 2023

Choose a reason for hiding this comment

achirkin Dec 15, 2023

Choose a reason for hiding this comment

abc99lr commented Jan 13, 2024

abc99lr commented Dec 7, 2023 •

edited

Loading