Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA] Non-uniform algorithm implementations for ext_oneapi_cuda. #9671

Merged
merged 28 commits into from
Jun 13, 2023

Conversation

JackAKirk
Copy link
Contributor

This PR adds cuda support for fixed_size_group, ballot_group, and opportunistic_group algorithms. All group algorithm support added for the SPIRV impls (those added in e.g. #9181) is correspondingly added here for the cuda backend.

Everything except the reduce/scans uses the same impl for all non-uniform groups. Reduce algorithms also use the same impl for all group types on sm80 for special IsRedux types/ops pairs.

Otherwise reduce/scans have two impl categories:
1.fixed_size_group
2.opportunistic_group, ballot_group, (and tangle_group once it is supported) all use the same impls.

Note that tangle_group is still not supported. However all algorithms implemented by ballot_group/opportunistic_group will I think be appropriate for tangle_group when it is supported.

@JackAKirk JackAKirk requested a review from a team as a code owner May 31, 2023 13:33
JackAKirk added 2 commits May 31, 2023 14:49
Signed-off-by: JackAKirk <[email protected]>
Signed-off-by: JackAKirk <[email protected]>
@JackAKirk JackAKirk temporarily deployed to aws May 31, 2023 13:58 — with GitHub Actions Inactive
Signed-off-by: JackAKirk <[email protected]>
@JackAKirk JackAKirk temporarily deployed to aws May 31, 2023 14:15 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws May 31, 2023 14:47 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws June 1, 2023 12:33 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws June 5, 2023 08:38 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws June 5, 2023 09:37 — with GitHub Actions Inactive
@JackAKirk JackAKirk closed this Jun 5, 2023
@JackAKirk JackAKirk reopened this Jun 5, 2023
@JackAKirk JackAKirk temporarily deployed to aws June 5, 2023 11:01 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws June 5, 2023 11:51 — with GitHub Actions Inactive
@JackAKirk
Copy link
Contributor Author

All green! @KseniyaTikhomirova friendly ping for a review.

Copy link
Contributor

@KseniyaTikhomirova KseniyaTikhomirova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JackAKirk I do not have expertise with the code in ext/oneapi/experimental/cuda, do you know anyone who has? Would be great to ask them to review it

ControlBarrier(Group g, memory_scope FenceScope, memory_order Order) {
#if defined(__NVPTX__)
__nvvm_bar_warp_sync(detail::ExtractMask(detail::GetMask(g))[0]);
#else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it intentional change from
#if defined (SPIR)
#elif (NVPTX)
no else here
#endif
to
#if defined (NVPTX)
#else + SPIR
#endif
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it like this purely to be consistent with other cases which are not doing any checks and just calling the _spirv functions directly. I'm not sure what is best here: @Pennycook what do you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no preference. But what you've done here is consistent with other parts of DPC++, at least. For example, the sub-group implementation assumes that SPIR-V intrinsics will be supported. I think this makes sense, because some SPIR-V intrinsics are implemented in libclc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, agree

@JackAKirk JackAKirk requested review from a team and steffenlarsen and removed request for a team June 5, 2023 15:02
JackAKirk added 2 commits June 9, 2023 21:25
Refactored impl.

Signed-off-by: JackAKirk <[email protected]>
@JackAKirk JackAKirk temporarily deployed to aws June 9, 2023 21:02 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws June 9, 2023 22:07 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws June 12, 2023 09:26 — with GitHub Actions Inactive
@JackAKirk JackAKirk temporarily deployed to aws June 12, 2023 10:05 — with GitHub Actions Inactive
@JackAKirk
Copy link
Contributor Author

Thanks for the reviews.
Hopefully I've addressed the points now. I've added all missing supported types. The half impl can be made faster but I think the appropriate place to address this is not this PR: see #9809

I've verified all types, but I think I should also really add more test coverage for all supported types. But maybe this should be done in a later PR.

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We seem to have testing for most of this, so I am okay with doing a follow-up expanding the coverage.

@steffenlarsen steffenlarsen merged commit b7f09d8 into intel:sycl Jun 13, 2023
@JackAKirk
Copy link
Contributor Author

LGTM! We seem to have testing for most of this, so I am okay with doing a follow-up expanding the coverage.

Thanks!

fineg74 pushed a commit to fineg74/llvm that referenced this pull request Jun 15, 2023
…a. (intel#9671)

This PR adds cuda support for fixed_size_group, ballot_group, and
opportunistic_group algorithms. All group algorithm support added for
the SPIRV impls (those added in e.g.
intel#9181) is correspondingly added here
for the cuda backend.

Everything except the reduce/scans uses the same impl for all
non-uniform groups. Reduce algorithms also use the same impl for all
group types on sm80 for special IsRedux types/ops pairs.

Otherwise reduce/scans have two impl categories:
1.fixed_size_group
2.opportunistic_group, ballot_group, (and tangle_group once it is
supported) all use the same impls.

Note that tangle_group is still not supported. However all algorithms
implemented by ballot_group/opportunistic_group will I think be
appropriate for tangle_group when it is supported.

---------

Signed-off-by: JackAKirk <[email protected]>
@jinz2014
Copy link
Contributor

@JackAKirk
I have some question. Is your implementation related to masked shuffle functions in CUDA ? I hope more examples will be available for users. Thanks.

@JackAKirk
Copy link
Contributor Author

@JackAKirk I have some question. Is your implementation related to masked shuffle functions in CUDA ? I hope more examples will be available for users. Thanks.

Yes masked shuffle functions are used throughout. Note masked shuffle functions are not currently exposed in the extension. I do not know if there is a demand/plan to expose them as part of this sycl extension. I have not come across them very often so far.

@jinz2014
Copy link
Contributor

A blog like https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/ for SYCL will be read by many users.

martygrant pushed a commit that referenced this pull request Jan 20, 2025
…_group_size == 32` (#16646)

`syclcompat::permute_sub_group_by_xor` was reported to flakily fail on
L0. Closer inspection revealed that the implementation of
`permute_sub_group_by_xor` is incorrect for cases where
`logical_sub_group_size != 32`, which is one of the test cases. This
implies that the test itself is wrong.

In this PR we first optimize the part of the implementation that is
valid assuming that Intel spirv builtins are correct (which is also the
only case realistically a user will program): case
`logical_sub_group_size == 32`, in order to:
- Ensure the only useful case is working via the correct optimized
route.
- Check that this improvement doesn't break the suspicious test.

A follow on PR can fix the other cases where `logical_sub_group_size !=
32`: this is better to do later, since
- the only use case I know of for this is to implement non-uniform group
algorithms that we already have implemented (e.g. see
#9671) and any user is advised to use
such algorithms instead of reimplementing them themselves.
- This must I think require a complete reworking of the test and would
otherwise delay the more important change here.

---------

Signed-off-by: JackAKirk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants