-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support selecting different hash functions in hash_partition #6307
Labels
Comments
gaohao95
added
Needs Triage
Need team to review and classify
feature request
New feature or request
labels
Sep 23, 2020
kkraus14
added
libcudf
Affects libcudf (C++/CUDA) code.
and removed
Needs Triage
Need team to review and classify
labels
Sep 24, 2020
Adding a template to
|
rapids-bot bot
pushed a commit
that referenced
this issue
Dec 3, 2020
This PR intends to - Allow `hash_partition` to select a different hash function (e.g. identity hash function) in additional to `MurmurHash3_32`. (Close #6307) - Remove redundant identical `hash_partition` implementation in `src/hash/hashing.cu`. Restrictions: - MD5 is not supported. Authors: - Hao Gao <[email protected]> Approvers: - Nikolay Sakharnykh - Mark Harris - Ram (Ramakrishna Prabhu) - Mark Harris URL: #6726
rapids-bot bot
pushed a commit
that referenced
this issue
Apr 1, 2021
This PR is to allow hash partitioning to configure the seed of its hash function. As noted in #6307, using the same hash function in hash partitioning and join leads to a massive hash collision and severely degrades join performance on multiple GPUs. There was an initial fix (#6726) to this problem, but it added only the code path to use identity hash function in hash partitioning, which doesn't support complex data types and thus cannot be used in general. In fact, using the same general Murmur3 hash function with different seeds in hash partitioning and join turned out to be a sufficient fix. This PR is to enable such configurations by making `hash_partition` accept an optional seed value. Authors: - Wonchan Lee (https://github.com/magnatelee) Approvers: - https://github.com/gaohao95 - Mark Harris (https://github.com/harrism) - https://github.com/nvdbaranec - Jake Hemstad (https://github.com/jrhemstad) URL: #7771
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Is your feature request related to a problem? Please describe.
Right now, cuDF hard-codes
MurmurHash3_32
insidehash_partition
. Here is one example.cudf/cpp/src/hash/hashing.cu
Line 481 in 5976d74
I wish cuDF can allow the user of the cpp API to choose which hash function to use for
hash_partition
.Describe the solution you'd like
In terms of API, maybe we can add a template or additional argument of hash function to
cudf::detail::hash_partition
. Raise an error if the hash function is incompatible with the data type (e.g. identity hash function with string type).Additional context
@abc99lr discovered using different hash functions has significant impact on the distributed join performance due to hash conflict. For example, when hacking cuDF to use identify hash function instead of Murmur hash like this decreases the running time of distributed join from 0.581s to 0.345s using 4 DGX-1V nodes with 32 GPUs, with 50 million rows/table/GPU.
cc @nsakharnykh @abc99lr
The text was updated successfully, but these errors were encountered: