Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][PTX][CUDA] Implicit global offset implementation #1773

Merged
merged 3 commits into from
Jul 3, 2020

Conversation

steffenlarsen
Copy link
Contributor

This commit implements implicit global offset behavior for the kernels generated for the PI CUDA backend. This includes the following changes:

  • A new builtin __builtin_ptx_implicit_offset and intrinsic int.nvvm.implicit.offset for getting the global offset. For the ptx-nvidiacl this is used for implementing the __spirv_GlobalOffset builtin.
  • A new pass that iterates over the uses of the int.nvvm.implicit.offset intrinsic, replacing it with a new function parameter. It then moves up the call-tree, adjusting calls to functions with this new parameter by adding a similar parameter to callers without it and adding this parameter to the calls. An exception are entry points, which are instead cloned with the clone being given the new parameter and the original using an offset of {0,0,0} in all uses of the intrinsic or functions with the new parameter. Any entry points that are not cloned are invariant to the offset parameter.

Additionally the PI CUDA backend now includes an offset parameter in the set of arguments for kernels. PI CUDA attempt to load the corresponding kernel both with and without the global offset parameter. If present, the kernel with the offset parameter is used only when a non-zero global offset is used.

@steffenlarsen
Copy link
Contributor Author

Pinging @Naghasan

@erichkeane
Copy link
Contributor

Nothing of interest in the CFE, LGTM.

smaslov-intel
smaslov-intel previously approved these changes Jun 2, 2020
llvm/lib/Target/NVPTX/SYCL/GlobalOffset.h Outdated Show resolved Hide resolved
llvm/lib/Target/NVPTX/SYCL/GlobalOffset.h Outdated Show resolved Hide resolved
llvm/lib/Target/NVPTX/SYCL/GlobalOffset.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/NVPTX/SYCL/GlobalOffset.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/NVPTX/SYCL/GlobalOffset.cpp Show resolved Hide resolved
Func->getAddressSpace());

if (KeepOriginal) {
NewFunc->setName(Func->getName() + "_with_offset");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: this change breaks original mangling, so if other tools rely on mangling rules to unmangle the name of the function (e.g. profiler, debugger, etc.) they might have problems with decoding this suffix. I think we should add a TODO comment to address this later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a TODO comment, but I wonder how to make sure that the PI CUDA backend would be able to easily find the new kernels if the naming changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best way is to not add/rename functions.
What would be the implications of always having offset parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the offset parameter on all entry points would increase the number of registers used by each thread unnecessarily in most cases. In this implementation the compiler is able to optimize out the 0-offset inserted into the original function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when a kernel is called and the offset is 0, there is no performance impact?
How does it work? With static analysis or compiling 2 versions and picking the optimized one at runtime?
Does anyone has some statistics about the utilization of global offset? I have never used it...

Copy link
Contributor Author

@steffenlarsen steffenlarsen Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work? With static analysis or compiling 2 versions and picking the optimized one at runtime?

The latter. It clones most[1] kernels and adds an offset parameter to the clone, then it inserts an array of three 0 into the original kernel, which is then used in all places where either the global offset intrinsic is used or any function that - either directly or down the line - may use the global offset intrinsic. Luckily the compiler is smart enough to figure out that all uses of elements in this array - which remains unchanged throughout - can be simplified and thus the array will in turn be removed at some point during optimization. The PI CUDA backend then selects the version of the kernel with the offset parameter iff any element of the given offset is strictly greater than 0, otherwise it selects the "original".

[1] Kernels that is completely invariant w.r.t. global offset is not cloned or altered.

Does anyone has some statistics about the utilization of global offset? I have never used it...

Me neither, but I have seen it being used a couple of times. I can see it being useful.

bader
bader previously approved these changes Jun 4, 2020
@bader bader added the cuda CUDA back-end label Jun 12, 2020
@bader
Copy link
Contributor

bader commented Jun 16, 2020

@steffenlarsen, could you resolve merge conflicts, please?

Steffen Larsen and others added 2 commits June 16, 2020 13:29
This commit implements implicit global offset behavior for the kernels
generated for the PI CUDA backend. This includes the following changes:

 * A new builtin `__builtin_ptx_implicit_offset` and intrinsic
   `int.nvvm.implicit.offset` for getting the global offset. For the
   `ptx-nvidiacl` this is used for implementing the
   `__spirv_GlobalOffset` builtin.
 * A new pass that iterates over the uses of the
   `int.nvvm.implicit.offset` intrinsic, replacing it with a new
   function parameter. It then moves up the call-tree, adjusting calls
   to functions with this new parameter by adding a similar parameter to
   callers without it and adding this parameter to the calls. An
   exception are entry points, which are instead cloned with the clone
   being given the new parameter and the original using an offset of
   `{0,0,0}` in all uses of the intrinsic or functions with the new
   parameter. Any entry points that are not cloned are invariant to the
   offset parameter.

Additionally the PI CUDA backend now includes an offset parameter in the
set of arguments for kernels. PI CUDA attempt to load the corresponding
kernel both with and without the global offset parameter. If present,
the kernel with the offset parameter is used only when a non-zero global
offset is used.

Co-authored-by: David Wood <[email protected]>
Co-authored-by: Victor Lomuller <[email protected]>
Signed-off-by: Steffen Larsen <[email protected]>
@steffenlarsen
Copy link
Contributor Author

@steffenlarsen, could you resolve merge conflicts, please?

Thank you for notifying me. The merge conflict has been fixed.

bader
bader previously approved these changes Jun 16, 2020
@premanandrao
Copy link
Contributor

Nothing of interest in the CFE, LGTM.

+1.

premanandrao
premanandrao previously approved these changes Jun 16, 2020
@steffenlarsen steffenlarsen dismissed stale reviews from premanandrao and bader via 3c9ba39 June 16, 2020 17:54
@steffenlarsen
Copy link
Contributor Author

I enabled parallel_for_indexers for level0 by accident. It has been re-disabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants