[FEA] Support selecting different hash functions in hash_partition #6307

gaohao95 · 2020-09-23T18:30:14Z

Is your feature request related to a problem? Please describe.

Right now, cuDF hard-codes MurmurHash3_32 inside hash_partition. Here is one example.

Line 481 in 5976d74

    
           auto const hasher       = row_hasher<MurmurHash3_32, hash_has_nulls>(*device_input);

I wish cuDF can allow the user of the cpp API to choose which hash function to use for hash_partition.

Describe the solution you'd like
In terms of API, maybe we can add a template or additional argument of hash function to cudf::detail::hash_partition. Raise an error if the hash function is incompatible with the data type (e.g. identity hash function with string type).

Additional context
@abc99lr discovered using different hash functions has significant impact on the distributed join performance due to hash conflict. For example, when hacking cuDF to use identify hash function instead of Murmur hash like this decreases the running time of distributed join from 0.581s to 0.345s using 4 DGX-1V nodes with 32 GPUs, with 50 million rows/table/GPU.

cc @nsakharnykh @abc99lr

The text was updated successfully, but these errors were encountered:

jrhemstad · 2020-09-25T01:40:31Z

Adding a template to detail::hash_partition for the hash function sounds good to me.

template <typename Hash>
std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(
  table_view const& input,
  std::vector<size_type> const& columns_to_hash,
  int num_partitions,
  Hash&& hash
  rmm::mr::device_memory_resource* mr,
  cudaStream_t stream)

This PR intends to - Allow `hash_partition` to select a different hash function (e.g. identity hash function) in additional to `MurmurHash3_32`. (Close #6307) - Remove redundant identical `hash_partition` implementation in `src/hash/hashing.cu`. Restrictions: - MD5 is not supported. Authors: - Hao Gao <[email protected]> Approvers: - Nikolay Sakharnykh - Mark Harris - Ram (Ramakrishna Prabhu) - Mark Harris URL: #6726

This PR is to allow hash partitioning to configure the seed of its hash function. As noted in #6307, using the same hash function in hash partitioning and join leads to a massive hash collision and severely degrades join performance on multiple GPUs. There was an initial fix (#6726) to this problem, but it added only the code path to use identity hash function in hash partitioning, which doesn't support complex data types and thus cannot be used in general. In fact, using the same general Murmur3 hash function with different seeds in hash partitioning and join turned out to be a sufficient fix. This PR is to enable such configurations by making `hash_partition` accept an optional seed value. Authors: - Wonchan Lee (https://github.com/magnatelee) Approvers: - https://github.com/gaohao95 - Mark Harris (https://github.com/harrism) - https://github.com/nvdbaranec - Jake Hemstad (https://github.com/jrhemstad) URL: #7771

gaohao95 added Needs Triage Need team to review and classify feature request New feature or request labels Sep 23, 2020

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Sep 24, 2020

gaohao95 mentioned this issue Nov 10, 2020

Support selecting different hash functions in hash_partition #6726

Merged

rapids-bot bot closed this as completed in #6726 Dec 3, 2020

magnatelee mentioned this issue Mar 30, 2021

Allow hash_partition to take a seed value #7771

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support selecting different hash functions in hash_partition #6307

[FEA] Support selecting different hash functions in hash_partition #6307

gaohao95 commented Sep 23, 2020 •

edited

Loading

jrhemstad commented Sep 25, 2020

[FEA] Support selecting different hash functions in hash_partition #6307

[FEA] Support selecting different hash functions in hash_partition #6307

Comments

gaohao95 commented Sep 23, 2020 • edited Loading

jrhemstad commented Sep 25, 2020

gaohao95 commented Sep 23, 2020 •

edited

Loading