Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support choosing hash functions in join APIs #10587

Closed
gaohao95 opened this issue Apr 4, 2022 · 4 comments · Fixed by #10695
Closed

[FEA] Support choosing hash functions in join APIs #10587

gaohao95 opened this issue Apr 4, 2022 · 4 comments · Fixed by #10695
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@gaohao95
Copy link
Contributor

gaohao95 commented Apr 4, 2022

Is your feature request related to a problem? Please describe.

I wish cuDF's public join APIs allow users to choose which hash function to use.

In TPC-DS queries, we observed that a lot of joins have a small build table and a large probe table. NCU profiles suggest that when the build table (and the hash table) can fit inside the L1 cache, the probe kernels are compute bound. Further, cuDF's join benchmark suggests that in this case, performance can be improved if we use a cheaper hash function like the identity hash function instead of the default Murmur3 hash.

The following result is collected on TITAN-V with build table size of 400, and probe table size of 200M.

cuDF default is to use Murmur3 hash for both the hash value entry in the hash table, as well as using Murmur3 for cuCo:

| Key Type | Payload Type | Nullable | Build Table Size | Probe Table Size | Samples |  CPU Time  | Noise |  GPU Time  | Noise |
|----------|--------------|----------|------------------|------------------|---------|------------|-------|------------|-------|
|      I64 |          I64 |        0 |              400 |        200000000 |    122x | 123.201 ms | 1.27% | 123.197 ms | 1.27% |

If we use identity hash for the hash value entry in the hash table:

| Key Type | Payload Type | Nullable | Build Table Size | Probe Table Size | Samples |  CPU Time  | Noise |  GPU Time  | Noise |
|----------|--------------|----------|------------------|------------------|---------|------------|-------|------------|-------|
|      I64 |          I64 |        0 |              400 |        200000000 |    137x | 110.115 ms | 0.68% | 110.110 ms | 0.68% |

If we change cuCo to use identity hash function as well:

| Key Type | Payload Type | Nullable | Build Table Size | Probe Table Size | Samples | CPU Time  | Noise | GPU Time  | Noise |
|----------|--------------|----------|------------------|------------------|---------|-----------|-------|-----------|-------|
|      I64 |          I64 |        0 |              400 |        200000000 |    167x | 90.224 ms | 1.47% | 90.219 ms | 1.47% |

Describe the solution you'd like

Add a hash_function argument to cuDF's public join APIs. For example,

std::unique_ptr<cudf::table> inner_join(
  cudf::table_view const& left,
  cudf::table_view const& right,
  std::vector<cudf::size_type> const& left_on,
  std::vector<cudf::size_type> const& right_on,
  null_equality compare_nulls         = null_equality::EQUAL,
  cudf::hash_id hash_function         = cudf::hash_id::HASH_MURMUR3,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

If the hash function is incompatible with the data type, runtime error should be raised.

@gaohao95 gaohao95 added Needs Triage Need team to review and classify feature request New feature or request labels Apr 4, 2022
@PointKernel PointKernel added libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue labels Apr 4, 2022
@jrhemstad
Copy link
Contributor

I'd prefer to just have this as an option for the hash_join object constructor instead of the freestanding cudf::inner_join APIs (which I think we're moving towards deprecating).

Exposing the cudf::hash_id provides some level of customization, but it's still limited to the finite set of options in that enum.

I'm thinking about a way to enable hash_join to be a template such that you could provide a custom hash function when compiling against template header of the hash_join object.

@gaohao95 I presume that your use case is from C++ where this would be possible? Would you find it useful to be able to inject a custom hash function beyond the predefined set in the cudf::hash_id enum? I could imagine a hash function that is better than Identity that would still be lower compute overhead than Murmur3.

@gaohao95
Copy link
Contributor Author

gaohao95 commented Apr 5, 2022

I'm thinking about a way to enable hash_join to be a template such that you could provide a custom hash function when compiling against template header of the hash_join object.

Yes that would work in my case. Can the custom hash function still be inlined?

@jrhemstad
Copy link
Contributor

Yes that would work in my case. Can the custom hash function still be inlined?

Yeah, we'd probably just expose effectively a detail::hash_join object that is templated on the hash function like cuco::static_map is.

@PointKernel
Copy link
Member

I was thinking about making hash_join a template also.

@PointKernel PointKernel self-assigned this Apr 5, 2022
rapids-bot bot pushed a commit that referenced this issue Apr 28, 2022
Closes #10587

This PR adds a `detail::hash_join` class which is templated on the hash function. It also cleans up `join` internal functions by moving code around to proper files. The implementation of `detail::hash_join` is mainly taken from `cudf::hash_join::hash_join_impl`.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - David Wendt (https://github.com/davidwendt)

URL: #10695
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants