Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize fusedL2NN when data is skinny #794

Merged
merged 5 commits into from
Sep 6, 2022

Conversation

ahendriksen
Copy link
Contributor

The fusedL2NN kernel uses tiling to maximize performance. The current implementation assumes that the input matrices are at least 32 elements wide. When this is not the case, it performs redundant computations.

This PR adds a policy to apply when the matrix is skinny (less than 32 elements wide). This results in a 1.5 - 2x performance improvement across GPU architectures.

@ahendriksen ahendriksen requested a review from a team as a code owner August 26, 2022 11:26
@github-actions github-actions bot added the cpp label Aug 26, 2022
@ahendriksen ahendriksen added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 26, 2022
@ahendriksen ahendriksen force-pushed the enh-faster-fusedl2nn branch 2 times, most recently from da8fa38 to 16a6fc9 Compare September 1, 2022 13:17
@ahendriksen
Copy link
Contributor Author

rerun tests

There were very few test cases for skinny matrices. They have now been
added.
In updateReDucedVal, a single warp can contain multiple rows (in
registers). A single thread within the warp uses the first element of
each row to update an output array (atomically).

In the previous implementation, a shuffle was used to move the head of
each row into the first thread of the warp. Unfortunately, this would
overwrite the value all other rows. This strategy, however, worked when
the number of rows per warp equalled 2. Hence, the bug never triggered.

In a recent commit, the number of rows per warp was increased to four in
certain situations (skinny matrices). Hence, this bug triggered.

In the new implementation, the values are not shuffled into the first
thread of the warp any more. Instead, the threads that contain the first
element of a row update the output in sequential order. The sequential
ordering is necessary to avoid deadlock on Pascal architecture.
In the current implementation, it looks like values from different rows
are mixed together in what should be a row-wise warp reduce. All tests
do pass however.

Just in case, I have added a width parameter to the shuffle so that it
only shuffles within a row within the warp.
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Allard for the PR! It looks good overall, I have just a few smaller comments.

cpp/bench/spatial/fused_l2_nn.cu Outdated Show resolved Hide resolved
There was a problem with defgroup syntax.
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ahendriksen for addressing the issues. LGTM.

@cjnolet
Copy link
Member

cjnolet commented Sep 6, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8c639d9 into rapidsai:branch-22.10 Sep 6, 2022
@ahendriksen ahendriksen deleted the enh-faster-fusedl2nn branch March 17, 2023 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants