-
Notifications
You must be signed in to change notification settings - Fork 744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][PTX][CUDA] Implicit global offset implementation #1773
Conversation
Pinging @Naghasan |
86c0a6f
to
2a5fe23
Compare
Nothing of interest in the CFE, LGTM. |
Func->getAddressSpace()); | ||
|
||
if (KeepOriginal) { | ||
NewFunc->setName(Func->getName() + "_with_offset"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: this change breaks original mangling, so if other tools rely on mangling rules to unmangle the name of the function (e.g. profiler, debugger, etc.) they might have problems with decoding this suffix. I think we should add a TODO comment to address this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a TODO comment, but I wonder how to make sure that the PI CUDA backend would be able to easily find the new kernels if the naming changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best way is to not add/rename functions.
What would be the implications of always having offset parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the offset parameter on all entry points would increase the number of registers used by each thread unnecessarily in most cases. In this implementation the compiler is able to optimize out the 0-offset inserted into the original function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when a kernel is called and the offset is 0, there is no performance impact?
How does it work? With static analysis or compiling 2 versions and picking the optimized one at runtime?
Does anyone has some statistics about the utilization of global offset? I have never used it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it work? With static analysis or compiling 2 versions and picking the optimized one at runtime?
The latter. It clones most[1] kernels and adds an offset parameter to the clone, then it inserts an array of three 0 into the original kernel, which is then used in all places where either the global offset intrinsic is used or any function that - either directly or down the line - may use the global offset intrinsic. Luckily the compiler is smart enough to figure out that all uses of elements in this array - which remains unchanged throughout - can be simplified and thus the array will in turn be removed at some point during optimization. The PI CUDA backend then selects the version of the kernel with the offset parameter iff any element of the given offset is strictly greater than 0, otherwise it selects the "original".
[1] Kernels that is completely invariant w.r.t. global offset is not cloned or altered.
Does anyone has some statistics about the utilization of global offset? I have never used it...
Me neither, but I have seen it being used a couple of times. I can see it being useful.
@steffenlarsen, could you resolve merge conflicts, please? |
This commit implements implicit global offset behavior for the kernels generated for the PI CUDA backend. This includes the following changes: * A new builtin `__builtin_ptx_implicit_offset` and intrinsic `int.nvvm.implicit.offset` for getting the global offset. For the `ptx-nvidiacl` this is used for implementing the `__spirv_GlobalOffset` builtin. * A new pass that iterates over the uses of the `int.nvvm.implicit.offset` intrinsic, replacing it with a new function parameter. It then moves up the call-tree, adjusting calls to functions with this new parameter by adding a similar parameter to callers without it and adding this parameter to the calls. An exception are entry points, which are instead cloned with the clone being given the new parameter and the original using an offset of `{0,0,0}` in all uses of the intrinsic or functions with the new parameter. Any entry points that are not cloned are invariant to the offset parameter. Additionally the PI CUDA backend now includes an offset parameter in the set of arguments for kernels. PI CUDA attempt to load the corresponding kernel both with and without the global offset parameter. If present, the kernel with the offset parameter is used only when a non-zero global offset is used. Co-authored-by: David Wood <[email protected]> Co-authored-by: Victor Lomuller <[email protected]> Signed-off-by: Steffen Larsen <[email protected]>
Signed-off-by: Steffen Larsen <[email protected]>
3ab1b55
to
85edd46
Compare
Thank you for notifying me. The merge conflict has been fixed. |
+1. |
Signed-off-by: Steffen Larsen <[email protected]>
I enabled |
This commit implements implicit global offset behavior for the kernels generated for the PI CUDA backend. This includes the following changes:
__builtin_ptx_implicit_offset
and intrinsicint.nvvm.implicit.offset
for getting the global offset. For theptx-nvidiacl
this is used for implementing the__spirv_GlobalOffset
builtin.int.nvvm.implicit.offset
intrinsic, replacing it with a new function parameter. It then moves up the call-tree, adjusting calls to functions with this new parameter by adding a similar parameter to callers without it and adding this parameter to the calls. An exception are entry points, which are instead cloned with the clone being given the new parameter and the original using an offset of{0,0,0}
in all uses of the intrinsic or functions with the new parameter. Any entry points that are not cloned are invariant to the offset parameter.Additionally the PI CUDA backend now includes an offset parameter in the set of arguments for kernels. PI CUDA attempt to load the corresponding kernel both with and without the global offset parameter. If present, the kernel with the offset parameter is used only when a non-zero global offset is used.