Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve implememtation details in experimental data structures #345

Merged
merged 10 commits into from
Aug 6, 2023

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Aug 2, 2023

This PR fixes issues and adds new features requested by rapidsai/cudf#13807.

It:

  • removes the requirement of the second hasher from double hashing must be constructible from an integer
  • fixes an issue in map iterator != operator
  • overloads map iterator access operator
  • allows zero capacity container
  • adds capacity_test back since several corner cases need to be exercised

@PointKernel PointKernel added type: feature request New feature request helps: rapids Helps or needed by RAPIDS labels Aug 2, 2023
@PointKernel PointKernel changed the title Fix ref issues Improve several implememtation details in experimental data structures Aug 2, 2023
@PointKernel PointKernel changed the title Improve several implememtation details in experimental data structures Improve implememtation details in experimental data structures Aug 2, 2023
@sleeepyjack
Copy link
Collaborator

h1==h2 isn't ideal. If two distinct keys suffer from a hash collision, they will follow the same exact probing pattern. Using two different hash functions reduces the likelihood of running into these so-called secondary clustering effects.

@PointKernel
Copy link
Member Author

h1==h2 isn't ideal. If two distinct keys suffer from a hash collision, they will follow the same exact probing pattern. Using two different hash functions reduces the likelihood of running into these so-called secondary clustering effects.

Agreed. The following code will fail if the custom_hasher doesn't have an int ctor:

  auto my_probe = cuco::experimental::double_hashing<custom_hasher>(custom_hasher{});

This makes me realize that our current code is based on the assumption that only one hasher type is used in double hashing and the second hasher must have an integer ctor:

__host__ __device__ constexpr double_hashing(Hash1 const& hash1 = {}, Hash2 const& hash2 = {1});

This is a too-restrictive requirement and using two identical hashers is the first step to lower the bar. I haven't figured out a way to avoid the secondary collision yet by following this idea.

@PointKernel PointKernel added the In Progress Currently a work in progress label Aug 5, 2023
@PointKernel PointKernel removed the In Progress Currently a work in progress label Aug 6, 2023
@PointKernel
Copy link
Member Author

h1==h2 isn't ideal. If two distinct keys suffer from a hash collision, they will follow the same exact probing pattern. Using two different hash functions reduces the likelihood of running into these so-called secondary clustering effects.

Changes reverted at 8ac0ee8 since it requires a larger scope of adjustment/cleanup and will be in a separate PR.

@PointKernel PointKernel merged commit 5186b39 into NVIDIA:dev Aug 6, 2023
@PointKernel PointKernel deleted the fix-ref-issues branch August 6, 2023 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helps: rapids Helps or needed by RAPIDS type: feature request New feature request
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants