-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][CUDA] Non-uniform algorithm implementations for ext_oneapi_cuda. #9671
Conversation
ballot_group. Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
sycl/include/sycl/ext/oneapi/experimental/cuda/non_uniform_algorithms.hpp
Outdated
Show resolved
Hide resolved
sycl/include/sycl/ext/oneapi/experimental/cuda/non_uniform_algorithms.hpp
Outdated
Show resolved
Hide resolved
Signed-off-by: JackAKirk <[email protected]>
All green! @KseniyaTikhomirova friendly ping for a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JackAKirk I do not have expertise with the code in ext/oneapi/experimental/cuda, do you know anyone who has? Would be great to ask them to review it
ControlBarrier(Group g, memory_scope FenceScope, memory_order Order) { | ||
#if defined(__NVPTX__) | ||
__nvvm_bar_warp_sync(detail::ExtractMask(detail::GetMask(g))[0]); | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it intentional change from
#if defined (SPIR)
#elif (NVPTX)
no else here
#endif
to
#if defined (NVPTX)
#else + SPIR
#endif
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it like this purely to be consistent with other cases which are not doing any checks and just calling the _spirv functions directly. I'm not sure what is best here: @Pennycook what do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no preference. But what you've done here is consistent with other parts of DPC++, at least. For example, the sub-group implementation assumes that SPIR-V intrinsics will be supported. I think this makes sense, because some SPIR-V intrinsics are implemented in libclc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, agree
sycl/include/sycl/ext/oneapi/experimental/cuda/non_uniform_algorithms.hpp
Outdated
Show resolved
Hide resolved
sycl/include/sycl/ext/oneapi/experimental/cuda/non_uniform_algorithms.hpp
Outdated
Show resolved
Hide resolved
sycl/include/sycl/ext/oneapi/experimental/cuda/non_uniform_algorithms.hpp
Outdated
Show resolved
Hide resolved
sycl/include/sycl/ext/oneapi/experimental/cuda/non_uniform_algorithms.hpp
Outdated
Show resolved
Hide resolved
sycl/include/sycl/ext/oneapi/experimental/cuda/non_uniform_algorithms.hpp
Outdated
Show resolved
Hide resolved
Refactored impl. Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
Thanks for the reviews. I've verified all types, but I think I should also really add more test coverage for all supported types. But maybe this should be done in a later PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! We seem to have testing for most of this, so I am okay with doing a follow-up expanding the coverage.
Thanks! |
…a. (intel#9671) This PR adds cuda support for fixed_size_group, ballot_group, and opportunistic_group algorithms. All group algorithm support added for the SPIRV impls (those added in e.g. intel#9181) is correspondingly added here for the cuda backend. Everything except the reduce/scans uses the same impl for all non-uniform groups. Reduce algorithms also use the same impl for all group types on sm80 for special IsRedux types/ops pairs. Otherwise reduce/scans have two impl categories: 1.fixed_size_group 2.opportunistic_group, ballot_group, (and tangle_group once it is supported) all use the same impls. Note that tangle_group is still not supported. However all algorithms implemented by ballot_group/opportunistic_group will I think be appropriate for tangle_group when it is supported. --------- Signed-off-by: JackAKirk <[email protected]>
@JackAKirk |
Yes masked shuffle functions are used throughout. Note masked shuffle functions are not currently exposed in the extension. I do not know if there is a demand/plan to expose them as part of this sycl extension. I have not come across them very often so far. |
A blog like https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/ for SYCL will be read by many users. |
…_group_size == 32` (#16646) `syclcompat::permute_sub_group_by_xor` was reported to flakily fail on L0. Closer inspection revealed that the implementation of `permute_sub_group_by_xor` is incorrect for cases where `logical_sub_group_size != 32`, which is one of the test cases. This implies that the test itself is wrong. In this PR we first optimize the part of the implementation that is valid assuming that Intel spirv builtins are correct (which is also the only case realistically a user will program): case `logical_sub_group_size == 32`, in order to: - Ensure the only useful case is working via the correct optimized route. - Check that this improvement doesn't break the suspicious test. A follow on PR can fix the other cases where `logical_sub_group_size != 32`: this is better to do later, since - the only use case I know of for this is to implement non-uniform group algorithms that we already have implemented (e.g. see #9671) and any user is advised to use such algorithms instead of reimplementing them themselves. - This must I think require a complete reworking of the test and would otherwise delay the more important change here. --------- Signed-off-by: JackAKirk <[email protected]>
This PR adds cuda support for fixed_size_group, ballot_group, and opportunistic_group algorithms. All group algorithm support added for the SPIRV impls (those added in e.g. #9181) is correspondingly added here for the cuda backend.
Everything except the reduce/scans uses the same impl for all non-uniform groups. Reduce algorithms also use the same impl for all group types on sm80 for special IsRedux types/ops pairs.
Otherwise reduce/scans have two impl categories:
1.fixed_size_group
2.opportunistic_group, ballot_group, (and tangle_group once it is supported) all use the same impls.
Note that tangle_group is still not supported. However all algorithms implemented by ballot_group/opportunistic_group will I think be appropriate for tangle_group when it is supported.