Move GPU ukernel selection to KernelConfig #19440
Open
+408
−314
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This moves the logic deciding whether an op should be a ukernel out of the GPULowerToUKernels pass, into KernelConfig.
So KernelConfig decides whether the op should be a ukernel, and encodes that into the resulting
lowering_config
, in a new parameter, that is a new attribute, UKernelSpecAttr. That attribute is directly modeled after the equivalent C++ data structure that we have had in LowerToUKernels passes,FnNameAndDefAttrs
, which it replaces. If the attribute is present, it means that the op was selected for ukernel lowering, with the fields telling the ukernel name and some function definition attributes (to import any dependencies, such as therocm
module for runtime support symbols).All the details about supplying the ukernel bitcode in a
hal.executable.object
are also moved there, becoming a side effect ofKernelConfig
.The GPULowerToUKernels becomes much simpler, since all the decision-making was already done for it. It just looks at the
LoweringConfigAttr
and if it's there, it performs the requested lowering.The motivation for this split is that we need to know in KernelConfig whether it's going to be a ukernel, because ops that will get lowered to a ukernel require a different configuration. The important example for us is
multi_mma
, which in the ukernel case needs to avoid reduction-dimension tiling to 1 so that the ukernel gets to see the reduction loop.A few simplifications arise already in the current argmax ukernel logic, confirming that this was the right design choice: the old ukernel's matching logic was checking that the distribution tile sizes matched what the ukernel could handle; now that is turned upside down: the ukernel matching happens as a helper within KernelConfig where we know we are setting the appropriate tile sizes on purpose.
Another nice improvement is that this puts just enough distance between ukernel selection (which creates the
hal.executable.object
) and ukernel lowering, that we are able to insertHoistExecutableObjectsPass
in between, simplifying the ukernel lowering as it doesn't need to worry anymore about preserving thehal.executable.object
.