Add XXHash_64 hash function to cudf #13612

davidwendt · 2023-06-23T20:46:09Z

Description

Add XXHash_64 hash function to libcudf

std::unique_ptr<column> xxhash_64(
  table_view const& input,  uint64_t seed,
  rmm::cuda_stream_view stream,  rmm::mr::device_memory_resource* mr);

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/src/hash/xxhash64.cu

bdice · 2023-06-30T01:03:39Z

@davidwendt Note that a bug was found in the cuCollections implementation here: NVIDIA/cuCollections#326

davidwendt · 2023-06-30T11:41:04Z

Note that a bug was found in the cuCollections implementation here: NVIDIA/cuCollections#326

Already fixed in a5a0d4b

…hash64

cpp/src/hash/xxhash64.cu

bdice

Pick one of the following and then I can approve:

xxhash64 (cudf API) and XXHash64 (functor)
xxhash_64 (cudf API) and XXHash_64 (functor)

…hash64

cpp/include/cudf/hashing.hpp

cpp/tests/hashing/xxhash_64_test.cpp

cpp/src/hash/xxhash_64.cu

sleeepyjack · 2023-07-19T11:37:56Z

cpp/src/hash/xxhash_64.cu

+    auto block = reinterpret_cast<uint8_t const*>(data + offset);
+    return block[0] | (block[1] << 8) | (block[2] << 16) | (block[3] << 24);


This will always emit 4x pipelined LDG.E.U8. I wonder if we should add an extra path that performs a single LDG.E.32 in case the pointer is aligned correctly.
Dumb question: When can the start of a string be not aligned to 4 bytes?

Instead of loading and shifting the result, a common pattern is to use a memcpy for this:

uint32_t ret; memcpy(&ret, block, sizeof(uint32_t)); return ret;

A string is almost never aligned to 4 bytes. A string is rarely allocated individually but usually part of a larger contiguous block of memory.
The plan is to move these block functions into a separate utilities header where I think we could optimize based on type.
Reference #13706

A string is rarely allocated individually but usually part of a larger contiguous block of memory.

Good point! Let's leave this as-is for now then.

cpp/src/hash/xxhash_64.cu

davidwendt · 2023-07-19T16:10:17Z

/merge

When hashing large keys, e.g., strings, we traverse the input key iteratively in chunks of 4/8 bytes. The current implementation of the `load_chunk` function falsely assumes that the start of the key is always aligned to the chunk size, which is not always the case (see [discussion](rapidsai/cudf#13612 (comment))). Additionally, this PR fixes some uncaught `[-Wmaybe-uninitialized]` warnings when compiling the unit tests.

davidwendt added 2 commits June 23, 2023 16:17

Add XXHash_64 hash function to cudf

05868c9

Merge branch 'branch-23.08' into fea-xxhash64

a4dd39b

davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Jun 23, 2023

davidwendt self-assigned this Jun 23, 2023

github-actions bot added the CMake CMake build issue label Jun 23, 2023

davidwendt added 4 commits June 26, 2023 15:51

fix typo in constant

f39bba0

Merge branch 'branch-23.08' into fea-xxhash64

53603fc

fix rotate function

f9436d3

Merge branch 'branch-23.08' into fea-xxhash64

f318e72

bdice reviewed Jun 27, 2023

View reviewed changes

cpp/src/hash/xxhash64.cu Outdated Show resolved Hide resolved

davidwendt added 6 commits June 27, 2023 14:52

add string test; convert primes to hex; fix getblock logic

de00dc4

fully-qualify calls to detail functions

660357e

Merge branch 'branch-23.08' into fea-xxhash64

25202b6

fix bug in xxhash finalize step

a5a0d4b

Merge branch 'branch-23.08' into fea-xxhash64

c9c0624

Merge branch 'branch-23.08' into fea-xxhash64

df5b8db

davidwendt added 5 commits June 30, 2023 07:55

Merge branch 'branch-23.08' into fea-xxhash64

e4abab3

add gtests for integer, double, fixed-point

aaafd8e

Merge branch 'branch-23.08' into fea-xxhash64

ac16d14

Merge branch 'branch-23.08' into fea-xxhash64

e3b6839

Merge branch 'branch-23.08' into fea-xxhash64

9dc57e4

davidwendt changed the title ~~Add XXHash_64 hash function to cudf~~ Add XXHash_64 hash function to cudf Jul 6, 2023

davidwendt added 3 commits July 10, 2023 08:10

fix merge conflicts

3a5dd7d

rename hash64 to xxhash64

3b4dbb8

local conflict fix

41c64b4

davidwendt added 3 commits July 17, 2023 13:58

rename test source file

e1e7b8d

Merge branch 'branch-23.08' into fea-xxhash64

9a64914

Merge branch 'fea-xxhash64' of github.com:davidwendt/cudf into fea-xx…

0624314

…hash64

davidwendt requested a review from bdice July 17, 2023 19:34

bdice reviewed Jul 17, 2023

View reviewed changes

cpp/src/hash/xxhash64.cu Outdated Show resolved Hide resolved

bdice reviewed Jul 17, 2023

View reviewed changes

rename xxhash64 to xxhash_64

cdec016

harrism removed their request for review July 17, 2023 23:14

davidwendt added 3 commits July 18, 2023 06:43

Merge branch 'fea-xxhash64' of github.com:davidwendt/cudf into fea-xx…

b8f7b72

…hash64

fix merge conflicts

3586abe

fix cmake style violation

74cda04

davidwendt requested a review from bdice July 18, 2023 11:32

davidwendt changed the title ~~Add XXHash_64 hash function to cudf~~ Add XXHash_64 hash function to cudf Jul 18, 2023

Merge branch 'branch-23.08' into fea-xxhash64

e2b9197

karthikeyann reviewed Jul 18, 2023

View reviewed changes

cpp/include/cudf/hashing.hpp Outdated Show resolved Hide resolved

cpp/tests/hashing/xxhash_64_test.cpp Outdated Show resolved Hide resolved

cpp/src/hash/xxhash_64.cu Outdated Show resolved Hide resolved

add some const decls

d47d624

bdice approved these changes Jul 18, 2023

View reviewed changes

davidwendt requested a review from karthikeyann July 18, 2023 20:27

fix doxygen wording for the hash APIs

606b736

karthikeyann approved these changes Jul 19, 2023

View reviewed changes

sleeepyjack reviewed Jul 19, 2023

View reviewed changes

davidwendt added 2 commits July 19, 2023 09:08

use device-span

a5e3838

Merge branch 'branch-23.08' into fea-xxhash64

f1f39e9

sleeepyjack approved these changes Jul 19, 2023

View reviewed changes

rapids-bot bot merged commit 541c5bf into rapidsai:branch-23.08 Jul 19, 2023

davidwendt deleted the fea-xxhash64 branch July 19, 2023 16:10

sleeepyjack mentioned this pull request Jul 28, 2023

Fix memory alignment issues in hash computation NVIDIA/cuCollections#338

Merged

ttnghia mentioned this pull request Nov 1, 2024

[Do not Review] Support hyper log log plus plus(HLL++) #17133

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add XXHash_64 hash function to cudf #13612

Add XXHash_64 hash function to cudf #13612

davidwendt commented Jun 23, 2023 •

edited

Loading

bdice commented Jun 30, 2023

davidwendt commented Jun 30, 2023 •

edited

Loading

bdice left a comment •

edited

Loading

sleeepyjack Jul 19, 2023

davidwendt Jul 19, 2023

sleeepyjack Jul 19, 2023

davidwendt commented Jul 19, 2023

		auto block = reinterpret_cast<uint8_t const*>(data + offset);
		return block[0] \| (block[1] << 8) \| (block[2] << 16) \| (block[3] << 24);

Add XXHash_64 hash function to cudf #13612

Add XXHash_64 hash function to cudf #13612

Conversation

davidwendt commented Jun 23, 2023 • edited Loading

Description

Checklist

bdice commented Jun 30, 2023

davidwendt commented Jun 30, 2023 • edited Loading

bdice left a comment • edited Loading

Choose a reason for hiding this comment

sleeepyjack Jul 19, 2023

Choose a reason for hiding this comment

davidwendt Jul 19, 2023

Choose a reason for hiding this comment

sleeepyjack Jul 19, 2023

Choose a reason for hiding this comment

davidwendt commented Jul 19, 2023

davidwendt commented Jun 23, 2023 •

edited

Loading

davidwendt commented Jun 30, 2023 •

edited

Loading

bdice left a comment •

edited

Loading