-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Improve occupancy during hash table build #15502
Comments
Since option 1 doesnt incur the cost of JIT compilation maybe this is the better approach in terms of performance. My current plan is to arrange the types in increasing order of register usage and split the large switch in
|
This sounds like an interesting approach. |
Although I tested this out only for |
Right. I guess the idea is that we internally (runtime) dispatch the comparator/hasher type based on the type requirements and then pass the one with the least amount of overhead to the kernel. This is a common pattern I'd say, where each runtime branch leads to a separately compiled kernel. If we can afford the compilation time overhead in cudf, then this is the right way to go. The downside is that if we want this optimization to happen, we have to explicitly type out the |
Could this be done using a custom IdTypeMap to the type dispatcher that dispatches unsupported types to null? Perhaps we could define a helper factory to produce such a mapping easily? |
I was wondering the same as Vyas:
|
Yes exactly I think we can do something like that except consign even more of the types in the first branch of the |
Ah okay, this would also achieve what we need except not have the complexity in the |
Adding results for reference. Benchmarks from cudf, all join types, speedups from disabling complex types on A100
|
Is your feature request related to a problem? Please describe.
cuco insert kernel has poor occupancy due to high register usage during hash table build operation executed by cuDF. If I disable some of the code paths for complex types(commenting out dict, string, list, struct, decimal) in
cudf/cpp/include/cudf/utilities/type_dispatcher.hpp
Line 456 in 434df44
I did some experiments by disabling different subsets of types, list has types I disable -> register count for insert kernel
Here is the speedup I see on mixed semi join kernel by improving occupancy for int32 keys obtained by disabling complex types
Describe the solution you'd like
Improve occupancy by disabling codepaths for complex types.
Describe alternatives you've considered
Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.
The text was updated successfully, but these errors were encountered: