-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize fusedL2NN when data is skinny #794
Optimize fusedL2NN when data is skinny #794
Conversation
67e1328
to
1d46f38
Compare
da8fa38
to
16a6fc9
Compare
rerun tests |
There were very few test cases for skinny matrices. They have now been added.
In updateReDucedVal, a single warp can contain multiple rows (in registers). A single thread within the warp uses the first element of each row to update an output array (atomically). In the previous implementation, a shuffle was used to move the head of each row into the first thread of the warp. Unfortunately, this would overwrite the value all other rows. This strategy, however, worked when the number of rows per warp equalled 2. Hence, the bug never triggered. In a recent commit, the number of rows per warp was increased to four in certain situations (skinny matrices). Hence, this bug triggered. In the new implementation, the values are not shuffled into the first thread of the warp any more. Instead, the threads that contain the first element of a row update the output in sequential order. The sequential ordering is necessary to avoid deadlock on Pascal architecture.
In the current implementation, it looks like values from different rows are mixed together in what should be a row-wise warp reduce. All tests do pass however. Just in case, I have added a width parameter to the shuffle so that it only shuffles within a row within the warp.
16a6fc9
to
1c6fa14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Allard for the PR! It looks good overall, I have just a few smaller comments.
There was a problem with defgroup syntax.
1c6fa14
to
36e5865
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ahendriksen for addressing the issues. LGTM.
@gpucibot merge |
The fusedL2NN kernel uses tiling to maximize performance. The current implementation assumes that the input matrices are at least 32 elements wide. When this is not the case, it performs redundant computations.
This PR adds a policy to apply when the matrix is skinny (less than 32 elements wide). This results in a 1.5 - 2x performance improvement across GPU architectures.