Support Bloom filters #1303

nvdbaranec · 2023-07-28T22:38:35Z

Adds support for Spark-style bloom filters via the BloomFilter class. The gpu implementation is in spark-rapids-jni itself and not cudf.

This version of the PR uses a different style of interface the encapsulates the entire Spark serialized blob of bloom filter data. It will probably render #1269 obsolete.

Added benchmark for bloom_filter_put. On an A5000, we're getting 140 GB/s write-throughput for bloom filter sizes of 512k, 1MB, 2MB, 4MB and 8MB. 12.5 milliseconds for 150 million rows. So it's not lightning fast, but it's serviceable.

Also fixed several assorted benchmark build errors. The cudf push for always providing null counts and specifying stream/mr broke a few of them.

…mur hash instead of the cudf version. Brought over cpp and java tests.

Signed-off-by: db <[email protected]>

…ow index) from remaining constructor.

Signed-off-by: db <[email protected]>

…nents instead of an instance. Change BloomFilterInterfaces to take a BaseDeviceMemoryBuffer instead of a DeviceMemoryBuffer. Handle some exception cases. Reordered some function parameter lists for consistency/cleanliness.

…oomFilter class to be more restrictive about bloom filter bit sizes: must always be a multiple of 64 bits.

… Handles nulls in the c++ code : build will ignore null input values and probe will return null for any input value.

…ffer as an opaque cudf Scalar.

…loom filters have matching num_hashes and num_longs parameters.

…nterface for probing directly from a buffer. Improve error checking in unpacking code.

jlowe

Tested with NVIDIA/spark-rapids#8775 as well as in-progress BloomFilterAggregate code. Minor nit on better comments for CudfAccessor usage.

src/main/java/ai/rapids/cudf/CudfAccessor.java

…workaround for certain Scalar accessors.

jlowe · 2023-07-31T21:53:00Z

build

nvdbaranec added 30 commits June 30, 2023 11:15

Back port spark-specific murmur32 hash code from cudf.

b03c47d

Run pre-commit to format files. We were behind a bit.

03f18eb

Merge branch 'pre_commit_pass' into murmur_hash_move

c06f9c1

Update pre-commit config to 16.0.1 to match cudf. Re-ran formatting.

39cec08

Merge branch 'pre_commit_pass' into murmur_hash_move

59aed1b

Change jni bindings to use the spark-rapids-jni implementation of mur…

963f475

…mur hash instead of the cudf version. Brought over cpp and java tests.

Documentation fix.

029ce11

Fix cpp tests to actually call the spark_rapids_jni murmur hash.

acb834c

First pass at xxhash64. cpp tests passing.

8db7b76

Improve cpp tests - null cases and more floating point edge cases.

cb11e73

Add Java tests.

0b85a03

Moved murmur32 hash implementaion from cudf to spark-rapids-jni

a63e155

Signed-off-by: db <[email protected]>

PR review changes.

b59cab4

Fix copyright data in Hash.java

622f89b

Enable 32 bit decimal hash test.

8af2107

Implement xxhash64 on the gpu

55eafd0

Signed-off-by: db <[email protected]>

Merge branch 'branch-23.08' into murmur_hash_move

3583eea

Add missing newlines.

14b9e29

Merge branch 'murmur_hash_move' into xxhash64_support

7f3ed1e

PR review changes.

ed4f54b

Merge branch 'murmur_hash_move' into xxhash64_support

d7a7e16

Remove default xxhash64 class constructor. Remove unused parameter (r…

6d67c00

…ow index) from remaining constructor.

Merge thirdparty/cudf from 23.08

1278dff

Basic bloom filter support. c++ side only. Could use some more tests.

208dd29

More tests.

7886e4f

Rectify thirdparty/cudf

21217e7

Merge branch 'branch-23.08' into bloom_filter

31eff2c

Java bindings and tests.

49a24be

Merge branch 'branch-23.08' into bloom_filter

96f1e0c

Add more tests and general cleanup.

23881e1

Signed-off-by: db <[email protected]>

nvdbaranec added 9 commits July 19, 2023 14:57

Submodule update

18ab06a

Fix small issue from cudf merge.

8d8f2a6

Wave of PR review feedback.

4fb532e

Change an Exception to a Throwable.

dfd8c3a

Produce big-endian swizzled bloom filters from the GPU. Change the Bl…

9290d0a

…oomFilter class to be more restrictive about bloom filter bit sizes: must always be a multiple of 64 bits.

Change bloom filter Java functions to use a long for bloomFilterBits.…

d1307d3

… Handles nulls in the c++ code : build will ignore null input values and probe will return null for any input value.

Java tests for build/probe with null inputs.

6c27db5

Rework BloomFilter interface to wrap the entire Spark bloom filter bu…

e1e35bf

…ffer as an opaque cudf Scalar.

nvdbaranec requested a review from jlowe July 28, 2023 22:38

nvdbaranec marked this pull request as draft July 28, 2023 22:38

nvdbaranec added 2 commits July 28, 2023 19:09

Doc updates. Add checking to the merge function to verify all input b…

529e9be

…loom filters have matching num_hashes and num_longs parameters.

Re-enable Java merge tests. Update benchmarks.

a4c4581

nvdbaranec added the feature request label Jul 29, 2023

nvdbaranec marked this pull request as ready for review July 29, 2023 00:43

Change bloom filter list_scalar type to be uint8. Add an additional i…

305dc32

…nterface for probing directly from a buffer. Improve error checking in unpacking code.

jlowe previously approved these changes Jul 31, 2023

View reviewed changes

src/main/java/ai/rapids/cudf/CudfAccessor.java Show resolved Hide resolved

jlowe mentioned this pull request Jul 31, 2023

Support BloomFilterMightContain expression NVIDIA/spark-rapids#8775

Merged

nvdbaranec mentioned this pull request Jul 31, 2023

[FEA] Extend the Scalar API with functionality to allow access to native handles. #1307

Closed

Add a note and reference to an issue for removing the package/bounce …

3f088a3

…workaround for certain Scalar accessors.

nvdbaranec dismissed jlowe’s stale review via 3f088a3 July 31, 2023 20:08

Eof newline.

5d6ebe0

jlowe approved these changes Jul 31, 2023

View reviewed changes

nvdbaranec merged commit d22259a into NVIDIA:branch-23.08 Aug 2, 2023

nvdbaranec mentioned this pull request Aug 2, 2023

Add bloom filter support. #1269

Closed

jlowe changed the title ~~Rework BloomFilter interface~~ Support Bloom filters Aug 2, 2023

This was linked to issues Aug 2, 2023

[FEA] Implement Bloom Filter kernel for update #1053

Closed

[FEA] Implement Bloom Filter kernel for merge #1054

Closed

[FEA] Implement Bloom Filter probe kernel for testing #1055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Bloom filters #1303

Support Bloom filters #1303

nvdbaranec commented Jul 28, 2023 •

edited

Loading

jlowe left a comment

jlowe commented Jul 31, 2023

Support Bloom filters #1303

Support Bloom filters #1303

Conversation

nvdbaranec commented Jul 28, 2023 • edited Loading

jlowe left a comment

Choose a reason for hiding this comment

jlowe commented Jul 31, 2023

nvdbaranec commented Jul 28, 2023 •

edited

Loading