[SYCL][PTX][CUDA] Implicit global offset implementation #1773

steffenlarsen · 2020-05-28T14:54:38Z

This commit implements implicit global offset behavior for the kernels generated for the PI CUDA backend. This includes the following changes:

A new builtin __builtin_ptx_implicit_offset and intrinsic int.nvvm.implicit.offset for getting the global offset. For the ptx-nvidiacl this is used for implementing the __spirv_GlobalOffset builtin.
A new pass that iterates over the uses of the int.nvvm.implicit.offset intrinsic, replacing it with a new function parameter. It then moves up the call-tree, adjusting calls to functions with this new parameter by adding a similar parameter to callers without it and adding this parameter to the calls. An exception are entry points, which are instead cloned with the clone being given the new parameter and the original using an offset of {0,0,0} in all uses of the intrinsic or functions with the new parameter. Any entry points that are not cloned are invariant to the offset parameter.

Additionally the PI CUDA backend now includes an offset parameter in the set of arguments for kernels. PI CUDA attempt to load the corresponding kernel both with and without the global offset parameter. If present, the kernel with the offset parameter is used only when a non-zero global offset is used.

steffenlarsen · 2020-05-28T14:55:55Z

Pinging @Naghasan

erichkeane · 2020-05-28T15:08:44Z

Nothing of interest in the CFE, LGTM.

llvm/lib/Target/NVPTX/SYCL/GlobalOffset.h

llvm/lib/Target/NVPTX/SYCL/GlobalOffset.cpp

llvm/test/CodeGen/NVPTX/global-offset-multiple-entry-points.ll

llvm/lib/Target/NVPTX/SYCL/GlobalOffset.cpp

bader · 2020-06-04T11:40:26Z

llvm/lib/Target/NVPTX/SYCL/GlobalOffset.cpp

+                                         Func->getAddressSpace());
+
+    if (KeepOriginal) {
+      NewFunc->setName(Func->getName() + "_with_offset");


NOTE: this change breaks original mangling, so if other tools rely on mangling rules to unmangle the name of the function (e.g. profiler, debugger, etc.) they might have problems with decoding this suffix. I think we should add a TODO comment to address this later.

I've added a TODO comment, but I wonder how to make sure that the PI CUDA backend would be able to easily find the new kernels if the naming changes.

The best way is to not add/rename functions.
What would be the implications of always having offset parameter?

Having the offset parameter on all entry points would increase the number of registers used by each thread unnecessarily in most cases. In this implementation the compiler is able to optimize out the 0-offset inserted into the original function.

So when a kernel is called and the offset is 0, there is no performance impact?
How does it work? With static analysis or compiling 2 versions and picking the optimized one at runtime?
Does anyone has some statistics about the utilization of global offset? I have never used it...

How does it work? With static analysis or compiling 2 versions and picking the optimized one at runtime?

The latter. It clones most[1] kernels and adds an offset parameter to the clone, then it inserts an array of three 0 into the original kernel, which is then used in all places where either the global offset intrinsic is used or any function that - either directly or down the line - may use the global offset intrinsic. Luckily the compiler is smart enough to figure out that all uses of elements in this array - which remains unchanged throughout - can be simplified and thus the array will in turn be removed at some point during optimization. The PI CUDA backend then selects the version of the kernel with the offset parameter iff any element of the given offset is strictly greater than 0, otherwise it selects the "original".

[1] Kernels that is completely invariant w.r.t. global offset is not cloned or altered.

Does anyone has some statistics about the utilization of global offset? I have never used it...

Me neither, but I have seen it being used a couple of times. I can see it being useful.

bader · 2020-06-16T11:18:59Z

@steffenlarsen, could you resolve merge conflicts, please?

This commit implements implicit global offset behavior for the kernels generated for the PI CUDA backend. This includes the following changes: * A new builtin `__builtin_ptx_implicit_offset` and intrinsic `int.nvvm.implicit.offset` for getting the global offset. For the `ptx-nvidiacl` this is used for implementing the `__spirv_GlobalOffset` builtin. * A new pass that iterates over the uses of the `int.nvvm.implicit.offset` intrinsic, replacing it with a new function parameter. It then moves up the call-tree, adjusting calls to functions with this new parameter by adding a similar parameter to callers without it and adding this parameter to the calls. An exception are entry points, which are instead cloned with the clone being given the new parameter and the original using an offset of `{0,0,0}` in all uses of the intrinsic or functions with the new parameter. Any entry points that are not cloned are invariant to the offset parameter. Additionally the PI CUDA backend now includes an offset parameter in the set of arguments for kernels. PI CUDA attempt to load the corresponding kernel both with and without the global offset parameter. If present, the kernel with the offset parameter is used only when a non-zero global offset is used. Co-authored-by: David Wood <[email protected]> Co-authored-by: Victor Lomuller <[email protected]> Signed-off-by: Steffen Larsen <[email protected]>

Signed-off-by: Steffen Larsen <[email protected]>

steffenlarsen · 2020-06-16T12:32:40Z

@steffenlarsen, could you resolve merge conflicts, please?

Thank you for notifying me. The merge conflict has been fixed.

premanandrao · 2020-06-16T13:26:53Z

Nothing of interest in the CFE, LGTM.

+1.

Signed-off-by: Steffen Larsen <[email protected]>

steffenlarsen · 2020-06-16T17:55:33Z

I enabled parallel_for_indexers for level0 by accident. It has been re-disabled.

steffenlarsen requested review from bader, erichkeane, Fznamznon and a team as code owners May 28, 2020 14:54

steffenlarsen requested a review from smaslov-intel May 28, 2020 14:54

steffenlarsen force-pushed the steffen/cuda-global-offset branch from 86c0a6f to 2a5fe23 Compare May 28, 2020 15:03

smaslov-intel previously approved these changes Jun 2, 2020

View reviewed changes

bader reviewed Jun 4, 2020

View reviewed changes

steffenlarsen dismissed smaslov-intel’s stale review via 3ab1b55 June 4, 2020 14:01

bader previously approved these changes Jun 4, 2020

View reviewed changes

bjoernknafla mentioned this pull request Jun 10, 2020

Re-enable CUDA tests #1542

Closed

bader added the cuda CUDA back-end label Jun 12, 2020

Steffen Larsen and others added 2 commits June 16, 2020 13:29

[SYCL][PTX] Minor changes based on comments

85edd46

Signed-off-by: Steffen Larsen <[email protected]>

steffenlarsen dismissed bader’s stale review via 85edd46 June 16, 2020 12:31

steffenlarsen force-pushed the steffen/cuda-global-offset branch from 3ab1b55 to 85edd46 Compare June 16, 2020 12:31

steffenlarsen requested review from elizabethandrews and premanandrao as code owners June 16, 2020 12:31

bader previously approved these changes Jun 16, 2020

View reviewed changes

premanandrao previously approved these changes Jun 16, 2020

View reviewed changes

[SYCL] Re-disable parallel_for_indexers test for level0

3c9ba39

Signed-off-by: Steffen Larsen <[email protected]>

steffenlarsen dismissed stale reviews from premanandrao and bader via 3c9ba39 June 16, 2020 17:54

bader approved these changes Jun 23, 2020

View reviewed changes

bader merged commit c7bb288 into intel:sycl Jul 3, 2020

XunZhangIntel mentioned this pull request Jul 15, 2020

[SYCL][CUDA] Link error will happen with '-g' compiling option #2120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][PTX][CUDA] Implicit global offset implementation #1773

[SYCL][PTX][CUDA] Implicit global offset implementation #1773

steffenlarsen commented May 28, 2020

steffenlarsen commented May 28, 2020

erichkeane commented May 28, 2020

bader Jun 4, 2020

steffenlarsen Jun 4, 2020

bader Jun 4, 2020

steffenlarsen Jun 4, 2020

keryell Jun 5, 2020

steffenlarsen Jun 5, 2020 •

edited

Loading

bader commented Jun 16, 2020

steffenlarsen commented Jun 16, 2020

premanandrao commented Jun 16, 2020

steffenlarsen commented Jun 16, 2020

[SYCL][PTX][CUDA] Implicit global offset implementation #1773

[SYCL][PTX][CUDA] Implicit global offset implementation #1773

Conversation

steffenlarsen commented May 28, 2020

steffenlarsen commented May 28, 2020

erichkeane commented May 28, 2020

bader Jun 4, 2020

Choose a reason for hiding this comment

steffenlarsen Jun 4, 2020

Choose a reason for hiding this comment

bader Jun 4, 2020

Choose a reason for hiding this comment

steffenlarsen Jun 4, 2020

Choose a reason for hiding this comment

keryell Jun 5, 2020

Choose a reason for hiding this comment

steffenlarsen Jun 5, 2020 • edited Loading

Choose a reason for hiding this comment

bader commented Jun 16, 2020

steffenlarsen commented Jun 16, 2020

premanandrao commented Jun 16, 2020

steffenlarsen commented Jun 16, 2020

steffenlarsen Jun 5, 2020 •

edited

Loading