Optimize fusedL2NN when data is skinny #794

ahendriksen · 2022-08-26T11:26:12Z

The fusedL2NN kernel uses tiling to maximize performance. The current implementation assumes that the input matrices are at least 32 elements wide. When this is not the case, it performs redundant computations.

This PR adds a policy to apply when the matrix is skinny (less than 32 elements wide). This results in a 1.5 - 2x performance improvement across GPU architectures.

ahendriksen · 2022-09-02T12:15:49Z

rerun tests

There were very few test cases for skinny matrices. They have now been added.

In updateReDucedVal, a single warp can contain multiple rows (in registers). A single thread within the warp uses the first element of each row to update an output array (atomically). In the previous implementation, a shuffle was used to move the head of each row into the first thread of the warp. Unfortunately, this would overwrite the value all other rows. This strategy, however, worked when the number of rows per warp equalled 2. Hence, the bug never triggered. In a recent commit, the number of rows per warp was increased to four in certain situations (skinny matrices). Hence, this bug triggered. In the new implementation, the values are not shuffled into the first thread of the warp any more. Instead, the threads that contain the first element of a row update the output in sequential order. The sequential ordering is necessary to avoid deadlock on Pascal architecture.

In the current implementation, it looks like values from different rows are mixed together in what should be a row-wise warp reduce. All tests do pass however. Just in case, I have added a width parameter to the shuffle so that it only shuffles within a row within the warp.

tfeher

Thanks Allard for the PR! It looks good overall, I have just a few smaller comments.

cpp/bench/spatial/fused_l2_nn.cu

cpp/include/raft/distance/detail/fused_l2_nn.cuh

There was a problem with defgroup syntax.

tfeher

Thanks @ahendriksen for addressing the issues. LGTM.

cjnolet · 2022-09-06T22:38:11Z

@gpucibot merge

ahendriksen requested a review from a team as a code owner August 26, 2022 11:26

github-actions bot added the cpp label Aug 26, 2022

ahendriksen assigned cjnolet Aug 26, 2022

ahendriksen added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 26, 2022

ahendriksen force-pushed the enh-faster-fusedl2nn branch from 67e1328 to 1d46f38 Compare September 1, 2022 12:23

ahendriksen assigned tfeher Sep 1, 2022

ahendriksen force-pushed the enh-faster-fusedl2nn branch 2 times, most recently from da8fa38 to 16a6fc9 Compare September 1, 2022 13:17

ahendriksen added 4 commits September 5, 2022 11:04

fusedL2NN: Optimize when data is skinny

7376902

fusedL2NN: Add test cases for skinny matrices

d6027b8

There were very few test cases for skinny matrices. They have now been added.

ahendriksen force-pushed the enh-faster-fusedl2nn branch from 16a6fc9 to 1c6fa14 Compare September 5, 2022 09:09

tfeher requested changes Sep 5, 2022

View reviewed changes

cpp/bench/spatial/fused_l2_nn.cu Outdated Show resolved Hide resolved

cpp/include/raft/distance/detail/fused_l2_nn.cuh Show resolved Hide resolved

tfeher reviewed Sep 5, 2022

View reviewed changes

cpp/include/raft/distance/detail/fused_l2_nn.cuh Show resolved Hide resolved

Fix doxygen-related CI failure

36e5865

There was a problem with defgroup syntax.

ahendriksen force-pushed the enh-faster-fusedl2nn branch from 1c6fa14 to 36e5865 Compare September 6, 2022 08:25

tfeher approved these changes Sep 6, 2022

View reviewed changes

rapids-bot bot merged commit 8c639d9 into rapidsai:branch-22.10 Sep 6, 2022

ahendriksen deleted the enh-faster-fusedl2nn branch March 17, 2023 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize fusedL2NN when data is skinny #794

Optimize fusedL2NN when data is skinny #794

ahendriksen commented Aug 26, 2022

ahendriksen commented Sep 2, 2022

tfeher left a comment

tfeher left a comment

cjnolet commented Sep 6, 2022

Optimize fusedL2NN when data is skinny #794

Optimize fusedL2NN when data is skinny #794

Conversation

ahendriksen commented Aug 26, 2022

ahendriksen commented Sep 2, 2022

tfeher left a comment

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

cjnolet commented Sep 6, 2022