[SYCL][CUDA] Non-uniform algorithm implementations for ext_oneapi_cuda. #9671

JackAKirk · 2023-05-31T13:33:46Z

This PR adds cuda support for fixed_size_group, ballot_group, and opportunistic_group algorithms. All group algorithm support added for the SPIRV impls (those added in e.g. #9181) is correspondingly added here for the cuda backend.

Everything except the reduce/scans uses the same impl for all non-uniform groups. Reduce algorithms also use the same impl for all group types on sm80 for special IsRedux types/ops pairs.

Otherwise reduce/scans have two impl categories:
1.fixed_size_group
2.opportunistic_group, ballot_group, (and tangle_group once it is supported) all use the same impls.

Note that tangle_group is still not supported. However all algorithms implemented by ballot_group/opportunistic_group will I think be appropriate for tangle_group when it is supported.

ballot_group. Signed-off-by: JackAKirk <[email protected]>

Signed-off-by: JackAKirk <[email protected]>

sycl/include/sycl/ext/oneapi/experimental/cuda/non_uniform_algorithms.hpp

Signed-off-by: JackAKirk <[email protected]>

sycl/include/sycl/detail/spirv.hpp

JackAKirk · 2023-06-05T13:27:33Z

All green! @KseniyaTikhomirova friendly ping for a review.

KseniyaTikhomirova

@JackAKirk I do not have expertise with the code in ext/oneapi/experimental/cuda, do you know anyone who has? Would be great to ask them to review it

KseniyaTikhomirova · 2023-06-05T14:52:40Z

sycl/include/sycl/detail/spirv.hpp

+ControlBarrier(Group g, memory_scope FenceScope, memory_order Order) {
+#if defined(__NVPTX__)
+  __nvvm_bar_warp_sync(detail::ExtractMask(detail::GetMask(g))[0]);
+#else


is it intentional change from
#if defined (SPIR)
#elif (NVPTX)
no else here
#endif
to
#if defined (NVPTX)
#else + SPIR
#endif
?

I changed it like this purely to be consistent with other cases which are not doing any checks and just calling the _spirv functions directly. I'm not sure what is best here: @Pennycook what do you prefer?

I have no preference. But what you've done here is consistent with other parts of DPC++, at least. For example, the sub-group implementation assumes that SPIR-V intrinsics will be supported. I think this makes sense, because some SPIR-V intrinsics are implemented in libclc.

sycl/include/sycl/detail/type_traits.hpp

sycl/include/sycl/ext/oneapi/experimental/cuda/non_uniform_algorithms.hpp

Refactored impl. Signed-off-by: JackAKirk <[email protected]>

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2023-06-12T14:48:44Z

Thanks for the reviews.
Hopefully I've addressed the points now. I've added all missing supported types. The half impl can be made faster but I think the appropriate place to address this is not this PR: see #9809

I've verified all types, but I think I should also really add more test coverage for all supported types. But maybe this should be done in a later PR.

steffenlarsen

LGTM! We seem to have testing for most of this, so I am okay with doing a follow-up expanding the coverage.

JackAKirk · 2023-06-13T10:19:36Z

LGTM! We seem to have testing for most of this, so I am okay with doing a follow-up expanding the coverage.

Thanks!

…a. (intel#9671) This PR adds cuda support for fixed_size_group, ballot_group, and opportunistic_group algorithms. All group algorithm support added for the SPIRV impls (those added in e.g. intel#9181) is correspondingly added here for the cuda backend. Everything except the reduce/scans uses the same impl for all non-uniform groups. Reduce algorithms also use the same impl for all group types on sm80 for special IsRedux types/ops pairs. Otherwise reduce/scans have two impl categories: 1.fixed_size_group 2.opportunistic_group, ballot_group, (and tangle_group once it is supported) all use the same impls. Note that tangle_group is still not supported. However all algorithms implemented by ballot_group/opportunistic_group will I think be appropriate for tangle_group when it is supported. --------- Signed-off-by: JackAKirk <[email protected]>

jinz2014 · 2023-07-10T14:46:09Z

@JackAKirk
I have some question. Is your implementation related to masked shuffle functions in CUDA ? I hope more examples will be available for users. Thanks.

JackAKirk · 2023-07-10T15:07:00Z

@JackAKirk I have some question. Is your implementation related to masked shuffle functions in CUDA ? I hope more examples will be available for users. Thanks.

Yes masked shuffle functions are used throughout. Note masked shuffle functions are not currently exposed in the extension. I do not know if there is a demand/plan to expose them as part of this sycl extension. I have not come across them very often so far.

jinz2014 · 2023-07-10T15:11:37Z

A blog like https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/ for SYCL will be read by many users.

…_group_size == 32` (#16646) `syclcompat::permute_sub_group_by_xor` was reported to flakily fail on L0. Closer inspection revealed that the implementation of `permute_sub_group_by_xor` is incorrect for cases where `logical_sub_group_size != 32`, which is one of the test cases. This implies that the test itself is wrong. In this PR we first optimize the part of the implementation that is valid assuming that Intel spirv builtins are correct (which is also the only case realistically a user will program): case `logical_sub_group_size == 32`, in order to: - Ensure the only useful case is working via the correct optimized route. - Check that this improvement doesn't break the suspicious test. A follow on PR can fix the other cases where `logical_sub_group_size != 32`: this is better to do later, since - the only use case I know of for this is to implement non-uniform group algorithms that we already have implemented (e.g. see #9671) and any user is advised to use such algorithms instead of reimplementing them themselves. - This must I think require a complete reworking of the test and would otherwise delay the more important change here. --------- Signed-off-by: JackAKirk <[email protected]>

JackAKirk and others added 16 commits May 2, 2023 08:42

Basic cuda support for opportunistic_group, fixed_size_group, and

bb39b47

ballot_group. Signed-off-by: JackAKirk <[email protected]>

Fix test failure, add comment in libclc.

4ca058c

Signed-off-by: JackAKirk <[email protected]>

Merge branch 'sycl' into cuda-non-uniform-groups

df0b97d

format

369f25f

Signed-off-by: JackAKirk <[email protected]>

format

f443f81

Signed-off-by: JackAKirk <[email protected]>

Merge branch 'sycl' into cuda-non-uniform-algs

1cb03df

Optimized IdToMaskPosition NVPTX case.

3069a1e

Signed-off-by: JackAKirk <[email protected]>

barrier, broadcast, any_of, all_of, none_of impls

0b1b82a

Signed-off-by: JackAKirk <[email protected]>

reduce/scan impls.

5a9668b

Signed-off-by: JackAKirk <[email protected]>

is_fixed_size_group check for reduce/scan branch impls

4188a17

Signed-off-by: JackAKirk <[email protected]>

cuda reduce/scans use non_uniform_algorithms.hpp

cf3d4e7

Signed-off-by: JackAKirk <[email protected]>

Merge branch 'cuda-non-uniform-groups' into cuda-non-uniform-algs

ebb034b

Enabled cuda in algorithm tests.

cf55f58

Signed-off-by: JackAKirk <[email protected]>

Added missing volatile.

c045fc5

Signed-off-by: JackAKirk <[email protected]>

Merge branch 'sycl' into cuda-non-uniform-algs

37df9fa

Signed-off-by: JackAKirk <[email protected]>

Format and fixed sycl branch merge.

36e59bb

Signed-off-by: JackAKirk <[email protected]>

JackAKirk requested a review from a team as a code owner May 31, 2023 13:33

JackAKirk requested review from KseniyaTikhomirova and Pennycook May 31, 2023 13:33

JackAKirk added 2 commits May 31, 2023 14:49

Format.

f3c8665

Signed-off-by: JackAKirk <[email protected]>

Format.

57c0bd9

Signed-off-by: JackAKirk <[email protected]>

JackAKirk temporarily deployed to aws May 31, 2023 13:58 — with GitHub Actions Inactive

Make Is_Redux nvptx only.

4878586

Signed-off-by: JackAKirk <[email protected]>

JackAKirk temporarily deployed to aws May 31, 2023 14:15 — with GitHub Actions Inactive

Added known_identity.hpp include.

7aa585f

Signed-off-by: JackAKirk <[email protected]>

JackAKirk temporarily deployed to aws May 31, 2023 14:47 — with GitHub Actions Inactive

Pennycook requested changes May 31, 2023

View reviewed changes

sycl/include/sycl/ext/oneapi/experimental/cuda/non_uniform_algorithms.hpp Outdated Show resolved Hide resolved

sycl/include/sycl/ext/oneapi/experimental/cuda/non_uniform_algorithms.hpp Outdated Show resolved Hide resolved

Addressed review comments.

a40e410

Signed-off-by: JackAKirk <[email protected]>

JackAKirk commented Jun 1, 2023

View reviewed changes

sycl/include/sycl/detail/spirv.hpp Show resolved Hide resolved

JackAKirk temporarily deployed to aws June 1, 2023 12:33 — with GitHub Actions Inactive

Merge branch 'sycl' into cuda-non-uniform-algs

557a1a3

JackAKirk temporarily deployed to aws June 5, 2023 08:38 — with GitHub Actions Inactive

JackAKirk temporarily deployed to aws June 5, 2023 09:37 — with GitHub Actions Inactive

JackAKirk closed this Jun 5, 2023

JackAKirk reopened this Jun 5, 2023

JackAKirk temporarily deployed to aws June 5, 2023 11:01 — with GitHub Actions Inactive

JackAKirk temporarily deployed to aws June 5, 2023 11:51 — with GitHub Actions Inactive

KseniyaTikhomirova reviewed Jun 5, 2023

View reviewed changes

JackAKirk requested review from a team and steffenlarsen and removed request for a team June 5, 2023 15:02

steffenlarsen reviewed Jun 7, 2023

View reviewed changes

JackAKirk added 2 commits June 9, 2023 21:25

Added missing types.

eea1d9a

Refactored impl. Signed-off-by: JackAKirk <[email protected]>

is_fixed_size_group moved to detail namespace.

686d117

Signed-off-by: JackAKirk <[email protected]>

JackAKirk temporarily deployed to aws June 9, 2023 21:02 — with GitHub Actions Inactive

JackAKirk temporarily deployed to aws June 9, 2023 22:07 — with GitHub Actions Inactive

Merge branch 'sycl' into cuda-non-uniform-algs

82691c5

JackAKirk temporarily deployed to aws June 12, 2023 09:26 — with GitHub Actions Inactive

JackAKirk temporarily deployed to aws June 12, 2023 10:05 — with GitHub Actions Inactive

steffenlarsen approved these changes Jun 13, 2023

View reviewed changes

steffenlarsen merged commit b7f09d8 into intel:sycl Jun 13, 2023

JackAKirk mentioned this pull request Jan 15, 2025

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if logical_sub_group_size == 32 #16646

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][CUDA] Non-uniform algorithm implementations for ext_oneapi_cuda. #9671

[SYCL][CUDA] Non-uniform algorithm implementations for ext_oneapi_cuda. #9671

JackAKirk commented May 31, 2023

JackAKirk commented Jun 5, 2023

KseniyaTikhomirova left a comment

KseniyaTikhomirova Jun 5, 2023

JackAKirk Jun 5, 2023

Pennycook Jun 5, 2023

KseniyaTikhomirova Jun 5, 2023

JackAKirk commented Jun 12, 2023

steffenlarsen left a comment

JackAKirk commented Jun 13, 2023

jinz2014 commented Jul 10, 2023

JackAKirk commented Jul 10, 2023

jinz2014 commented Jul 10, 2023

[SYCL][CUDA] Non-uniform algorithm implementations for ext_oneapi_cuda. #9671

[SYCL][CUDA] Non-uniform algorithm implementations for ext_oneapi_cuda. #9671

Conversation

JackAKirk commented May 31, 2023

JackAKirk commented Jun 5, 2023

KseniyaTikhomirova left a comment

Choose a reason for hiding this comment

KseniyaTikhomirova Jun 5, 2023

Choose a reason for hiding this comment

JackAKirk Jun 5, 2023

Choose a reason for hiding this comment

Pennycook Jun 5, 2023

Choose a reason for hiding this comment

KseniyaTikhomirova Jun 5, 2023

Choose a reason for hiding this comment

JackAKirk commented Jun 12, 2023

steffenlarsen left a comment

Choose a reason for hiding this comment

JackAKirk commented Jun 13, 2023

jinz2014 commented Jul 10, 2023

JackAKirk commented Jul 10, 2023

jinz2014 commented Jul 10, 2023