-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support choosing hash functions in join APIs #10587
Comments
I'd prefer to just have this as an option for the Exposing the I'm thinking about a way to enable @gaohao95 I presume that your use case is from C++ where this would be possible? Would you find it useful to be able to inject a custom hash function beyond the predefined set in the |
Yes that would work in my case. Can the custom hash function still be inlined? |
Yeah, we'd probably just expose effectively a |
I was thinking about making |
Closes #10587 This PR adds a `detail::hash_join` class which is templated on the hash function. It also cleans up `join` internal functions by moving code around to proper files. The implementation of `detail::hash_join` is mainly taken from `cudf::hash_join::hash_join_impl`. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - David Wendt (https://github.com/davidwendt) URL: #10695
Is your feature request related to a problem? Please describe.
I wish cuDF's public join APIs allow users to choose which hash function to use.
In TPC-DS queries, we observed that a lot of joins have a small build table and a large probe table. NCU profiles suggest that when the build table (and the hash table) can fit inside the L1 cache, the probe kernels are compute bound. Further, cuDF's join benchmark suggests that in this case, performance can be improved if we use a cheaper hash function like the identity hash function instead of the default Murmur3 hash.
The following result is collected on TITAN-V with build table size of 400, and probe table size of 200M.
cuDF default is to use Murmur3 hash for both the hash value entry in the hash table, as well as using Murmur3 for cuCo:
If we use identity hash for the hash value entry in the hash table:
If we change cuCo to use identity hash function as well:
Describe the solution you'd like
Add a
hash_function
argument to cuDF's public join APIs. For example,If the hash function is incompatible with the data type, runtime error should be raised.
The text was updated successfully, but these errors were encountered: