Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support selecting different hash functions in hash_partition #6307

Closed
gaohao95 opened this issue Sep 23, 2020 · 1 comment · Fixed by #6726
Closed

[FEA] Support selecting different hash functions in hash_partition #6307

gaohao95 opened this issue Sep 23, 2020 · 1 comment · Fixed by #6726
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@gaohao95
Copy link
Contributor

gaohao95 commented Sep 23, 2020

Is your feature request related to a problem? Please describe.

Right now, cuDF hard-codes MurmurHash3_32 inside hash_partition. Here is one example.

auto const hasher = row_hasher<MurmurHash3_32, hash_has_nulls>(*device_input);

I wish cuDF can allow the user of the cpp API to choose which hash function to use for hash_partition.

Describe the solution you'd like
In terms of API, maybe we can add a template or additional argument of hash function to cudf::detail::hash_partition. Raise an error if the hash function is incompatible with the data type (e.g. identity hash function with string type).

Additional context
@abc99lr discovered using different hash functions has significant impact on the distributed join performance due to hash conflict. For example, when hacking cuDF to use identify hash function instead of Murmur hash like this decreases the running time of distributed join from 0.581s to 0.345s using 4 DGX-1V nodes with 32 GPUs, with 50 million rows/table/GPU.

cc @nsakharnykh @abc99lr

@gaohao95 gaohao95 added Needs Triage Need team to review and classify feature request New feature or request labels Sep 23, 2020
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Sep 24, 2020
@jrhemstad
Copy link
Contributor

Adding a template to detail::hash_partition for the hash function sounds good to me.

template <typename Hash>
std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(
  table_view const& input,
  std::vector<size_type> const& columns_to_hash,
  int num_partitions,
  Hash&& hash
  rmm::mr::device_memory_resource* mr,
  cudaStream_t stream)

@rapids-bot rapids-bot bot closed this as completed in #6726 Dec 3, 2020
rapids-bot bot pushed a commit that referenced this issue Dec 3, 2020
This PR intends to
- Allow `hash_partition` to select a different hash function (e.g. identity hash function) in additional to `MurmurHash3_32`. (Close #6307)
- Remove redundant identical `hash_partition` implementation in `src/hash/hashing.cu`.

Restrictions:
- MD5 is not supported.

Authors:
  - Hao Gao <[email protected]>

Approvers:
  - Nikolay Sakharnykh
  - Mark Harris
  - Ram (Ramakrishna Prabhu)
  - Mark Harris

URL: #6726
rapids-bot bot pushed a commit that referenced this issue Apr 1, 2021
This PR is to allow hash partitioning to configure the seed of its hash function. As noted in #6307, using the same hash function in hash partitioning and join leads to a massive hash collision and severely degrades join performance on multiple GPUs. There was an initial fix (#6726) to this problem, but it added only the code path to use identity hash function in hash partitioning, which doesn't support complex data types and thus cannot be used in general. In fact, using the same general Murmur3 hash function with different seeds in hash partitioning and join turned out to be a sufficient fix. This PR is to enable such configurations by making `hash_partition` accept an optional seed value.

Authors:
  - Wonchan Lee (https://github.com/magnatelee)

Approvers:
  - https://github.com/gaohao95
  - Mark Harris (https://github.com/harrism)
  - https://github.com/nvdbaranec
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #7771
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants