Improve implememtation details in experimental data structures #345

PointKernel · 2023-08-02T23:20:24Z

This PR fixes issues and adds new features requested by rapidsai/cudf#13807.

It:

removes the requirement of the second hasher from double hashing must be constructible from an integer
fixes an issue in map iterator != operator
overloads map iterator access operator
allows zero capacity container
adds capacity_test back since several corner cases need to be exercised

sleeepyjack · 2023-08-04T15:25:18Z

h1==h2 isn't ideal. If two distinct keys suffer from a hash collision, they will follow the same exact probing pattern. Using two different hash functions reduces the likelihood of running into these so-called secondary clustering effects.

PointKernel · 2023-08-04T19:08:43Z

h1==h2 isn't ideal. If two distinct keys suffer from a hash collision, they will follow the same exact probing pattern. Using two different hash functions reduces the likelihood of running into these so-called secondary clustering effects.

Agreed. The following code will fail if the custom_hasher doesn't have an int ctor:

  auto my_probe = cuco::experimental::double_hashing<custom_hasher>(custom_hasher{});

This makes me realize that our current code is based on the assumption that only one hasher type is used in double hashing and the second hasher must have an integer ctor:

cuCollections/include/cuco/probing_scheme.cuh

Line 112 in 6bc62c6

    
           __host__ __device__ constexpr double_hashing(Hash1 const& hash1 = {}, Hash2 const& hash2 = {1});

This is a too-restrictive requirement and using two identical hashers is the first step to lower the bar. I haven't figured out a way to avoid the secondary collision yet by following this idea.

PointKernel · 2023-08-06T21:40:40Z

h1==h2 isn't ideal. If two distinct keys suffer from a hash collision, they will follow the same exact probing pattern. Using two different hash functions reduces the likelihood of running into these so-called secondary clustering effects.

Changes reverted at 8ac0ee8 since it requires a larger scope of adjustment/cleanup and will be in a separate PR.

PointKernel added 4 commits August 2, 2023 15:35

Use default ctor for the second hasher in double hashing

30789df

Make extent ctor implicit

8f7743d

Fix not logic

8226d69

Overload storage iterator arrow operator

19482ff

PointKernel added type: feature request New feature request helps: rapids Helps or needed by RAPIDS labels Aug 2, 2023

PointKernel changed the title ~~Fix ref issues~~ Improve several implememtation details in experimental data structures Aug 2, 2023

PointKernel mentioned this pull request Aug 2, 2023

Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs rapidsai/cudf#13807

Merged

3 tasks

PointKernel changed the title ~~Improve several implememtation details in experimental data structures~~ Improve implememtation details in experimental data structures Aug 2, 2023

PointKernel added 5 commits August 3, 2023 10:24

Allow zero capacity container

13770a5

Fix min capacity logic

26c3b7f

Fix type conversion issues

f4ff901

Merge remote-tracking branch 'upstream/dev' into fix-ref-issues

3ee1f9e

Add capacity test back

7c76204

PointKernel added the In Progress Currently a work in progress label Aug 5, 2023

Revert double hashing changes

8ac0ee8

PointKernel removed the In Progress Currently a work in progress label Aug 6, 2023

PointKernel merged commit 5186b39 into NVIDIA:dev Aug 6, 2023

PointKernel deleted the fix-ref-issues branch August 6, 2023 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve implememtation details in experimental data structures #345

Improve implememtation details in experimental data structures #345

PointKernel commented Aug 2, 2023 •

edited

Loading

sleeepyjack commented Aug 4, 2023

PointKernel commented Aug 4, 2023

PointKernel commented Aug 6, 2023

Improve implememtation details in experimental data structures #345

Improve implememtation details in experimental data structures #345

Conversation

PointKernel commented Aug 2, 2023 • edited Loading

sleeepyjack commented Aug 4, 2023

PointKernel commented Aug 4, 2023

PointKernel commented Aug 6, 2023

PointKernel commented Aug 2, 2023 •

edited

Loading