Replace cudf's concurrent_ordered_map with cuco::static_map in semi/anti joins #9666

vyasr · 2021-11-12T00:17:06Z

This PR resolves #9586, replacing the hash table used in semi and anti joins with cuco::static_map. It depends on NVIDIA/cuCollections#118. At present the code is slower than the original version, so we'll probably want to make some optimizations in cuco before merging this.

vyasr · 2021-11-12T00:24:43Z

Benchmarks

Old code:

Join<int32_t, int32_t>/left_semi_join_32bit/100000/100000/manual_time                                    0.156 ms        0.177 ms         4409
Join<int32_t, int32_t>/left_semi_join_32bit/100000/400000/manual_time                                    0.259 ms        0.278 ms         2699
Join<int32_t, int32_t>/left_semi_join_32bit/100000/1000000/manual_time                                   0.474 ms        0.493 ms         1479
Join<int32_t, int32_t>/left_semi_join_32bit/10000000/10000000/manual_time                                 30.8 ms         30.9 ms           23
Join<int32_t, int32_t>/left_semi_join_32bit/10000000/40000000/manual_time                                 61.3 ms         61.3 ms           11
Join<int32_t, int32_t>/left_semi_join_32bit/10000000/100000000/manual_time                                 122 ms          122 ms            6
Join<int32_t, int32_t>/left_semi_join_32bit/100000000/100000000/manual_time                                313 ms          313 ms            2
Join<int32_t, int32_t>/left_semi_join_32bit/80000000/240000000/manual_time                                 414 ms          414 ms            2
Join<int64_t, int64_t>/left_semi_join_64bit/50000000/50000000/manual_time                                  164 ms          164 ms            4
Join<int64_t, int64_t>/left_semi_join_64bit/40000000/120000000/manual_time                                 225 ms          225 ms            3
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/100000/manual_time                              0.151 ms        0.173 ms         4556
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/400000/manual_time                              0.197 ms        0.218 ms         3482
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/1000000/manual_time                             0.265 ms        0.283 ms         2654
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/10000000/10000000/manual_time                           7.08 ms         7.10 ms           99
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/10000000/40000000/manual_time                           14.6 ms         14.6 ms           48
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/10000000/100000000/manual_time                          32.3 ms         32.3 ms           18
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000000/100000000/manual_time                         76.7 ms         76.7 ms            8
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/80000000/240000000/manual_time                           107 ms          107 ms            5
Join<int64_t, int64_t>/left_semi_join_64bit_nulls/50000000/50000000/manual_time                           43.5 ms         43.6 ms           13
Join<int64_t, int64_t>/left_semi_join_64bit_nulls/40000000/120000000/manual_time                          66.0 ms         66.0 ms            8

New code:

Join<int32_t, int32_t>/left_semi_join_32bit/100000/100000/manual_time                                    0.426 ms        0.446 ms         1613
Join<int32_t, int32_t>/left_semi_join_32bit/100000/400000/manual_time                                    0.840 ms        0.860 ms          833
Join<int32_t, int32_t>/left_semi_join_32bit/100000/1000000/manual_time                                    1.44 ms         1.45 ms          490
Join<int32_t, int32_t>/left_semi_join_32bit/10000000/10000000/manual_time                                 32.2 ms         32.2 ms           22
Join<int32_t, int32_t>/left_semi_join_32bit/10000000/40000000/manual_time                                 77.0 ms         77.0 ms            9
Join<int32_t, int32_t>/left_semi_join_32bit/10000000/100000000/manual_time                                 167 ms          167 ms            4
Join<int32_t, int32_t>/left_semi_join_32bit/100000000/100000000/manual_time                                324 ms          324 ms            2
Join<int32_t, int32_t>/left_semi_join_32bit/80000000/240000000/manual_time                                 505 ms          505 ms            2
Join<int64_t, int64_t>/left_semi_join_64bit/50000000/50000000/manual_time                                  194 ms          194 ms            4
Join<int64_t, int64_t>/left_semi_join_64bit/40000000/120000000/manual_time                                 322 ms          322 ms            2
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/100000/manual_time                              0.191 ms        0.212 ms         3720
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/400000/manual_time                              0.229 ms        0.250 ms         3001
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/1000000/manual_time                             0.359 ms        0.378 ms         2037
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/10000000/10000000/manual_time                           8.39 ms         8.41 ms           86
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/10000000/40000000/manual_time                           16.1 ms         16.1 ms           44
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/10000000/100000000/manual_time                          42.7 ms         42.7 ms           12
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000000/100000000/manual_time                         90.2 ms         90.2 ms            6
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/80000000/240000000/manual_time                           121 ms          121 ms            5
Join<int64_t, int64_t>/left_semi_join_64bit_nulls/50000000/50000000/manual_time                           51.8 ms         51.8 ms           11
Join<int64_t, int64_t>/left_semi_join_64bit_nulls/40000000/120000000/manual_time                          75.4 ms         75.4 ms            7

cpp/src/join/semi_join.cu

jrhemstad · 2021-11-12T01:01:04Z

cpp/src/join/semi_join.cu

+  // Note: This equality comparator violates symmetry of equality and is
+  // therefore relying on the implementation detail of the order in which its
+  // operator is invoked. If cuco makes no promises about the order of
+  // invocation this seems a bit unsafe.
+  row_equality equality_probe{*right_rows_d, *left_rows_d, compare_nulls == null_equality::EQUAL};


Yeah, if you're going to try and do the same optimization as is done in the other joins of using the row hash value as the key and the index as the payload, you're going to need to add the equivalent of pair_contains from the multimap.

Given our discussions I think we can wait on this as well. The same issue exists in other join types and probably isn't worth addressing until NVIDIA/cuCollections#110.

cpp/src/join/semi_join.cu

vyasr · 2021-11-12T01:57:36Z

I intended this to be a draft PR, just marked it as such. I pushed this out so that I could get some feedback and we could iron out the gaps with cuco, but this isn't really ready for the big time yet. We'll probably want to wait on further improvements in cuco.

vyasr · 2021-11-19T20:10:47Z

This PR requires NVIDIA/cuCollections#118, but once that's merged I think we can move forward with this largely as is. While there are significant improvements that could be made, they are heavily dependent on refactoring cuCollections and I don't think we benefit too much by trying to implement interim stopgap solutions.

cpp/src/join/semi_join.cu

PointKernel · 2021-11-19T22:55:02Z

This PR depends on NVIDIA/cuCollections#113, otherwise the default hash allocator won't work here.

…ly for non-nullable types.

…nullable cases and fix a hashing bug.

…anti_join_cuco

PointKernel

LGTM

PointKernel · 2021-12-16T21:10:35Z

@vyasr Back to the benchmark results, any idea why the new implementation is slower?

…anti_join_cuco

vyasr · 2021-12-21T19:20:05Z

@vyasr Back to the benchmark results, any idea why the new implementation is slower?

I'm reasonably confident that the performance regression is entirely due to the switch from cudf's concurrent unordered map to the cuco static map, which hasn't benefited from the optimizations you worked on for the multimap. @jrhemstad was fine eating the perf hit for now and postponing optimization because we were trying to get the mixed joins in #9917 up and running ASAP.

However, the work in #9917 shows that the new mixed join code is going to have to be a new kernel rather than a direct adaptation of the existing hash join code because of how we deal with shared memory. Therefore, IMO this PR is no longer a prerequisite for getting mixed joins done for semi/anti joins and that work can happen in parallel, i.e. we could start using cuco's static multimap for mixed joins without merging this PR. @jrhemstad in light of that, do you want to hold off on merging this PR until we've had a chance to do the cuco refactoring and optimized cuco::static_map? Then we could avoid a performance degradation in hash semi/anti joins.

vyasr · 2021-12-22T00:12:54Z

rerun tests

PointKernel · 2021-12-22T01:06:29Z

Hmm, cudf's concurrent_unordered_map is actually a more naive/unoptimized implementation compared to cuco::static_map. Both of them are linear probing while cuco is even using the CG-based algorithm by default. I think the test case has relatively low occupancy (or few collisions) which may explain why CG-based algorithms are outperformed. We need a follow-up PR dedicated to detailed profiling and performance optimization.

codereport

👍

vyasr · 2022-01-04T19:05:08Z

rerun tests

vyasr · 2022-01-05T17:20:34Z

Discussed offline, we're going to get this merged now and deal with perf later.

@gpucibot merge

jlowe · 2022-01-05T17:23:44Z

deal with perf later.

Is there an issue to track this?

vyasr · 2022-01-05T17:32:13Z

Just made one in #9973.

vyasr · 2022-01-05T17:32:19Z

@gpucibot merge

@PointKernel

The `concurrent_unordered_multimap` is no longer used in libcudf. It has been replaced by `cuco::static_multimap`. The majority of the refactoring was done in PRs #8934 and #9704. A similar effort is in progress for `concurrent_unordered_map` and `cuco::static_map` in #9666 (and may depend on porting some optimizations from libcudf to cuco -- need to look into this before doing a direct replacement). This partially resolves issue #10401. cc: @PointKernel @vyasr Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) - Jake Hemstad (https://github.com/jrhemstad) URL: #10642

vyasr added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 12, 2021

vyasr self-assigned this Nov 12, 2021

vyasr requested a review from a team as a code owner November 12, 2021 00:17

vyasr requested review from robertmaynard and codereport November 12, 2021 00:17

vyasr requested review from PointKernel and jrhemstad and removed request for robertmaynard and codereport November 12, 2021 00:17

jrhemstad reviewed Nov 12, 2021

View reviewed changes

cpp/src/join/semi_join.cu Outdated Show resolved Hide resolved

jrhemstad reviewed Nov 12, 2021

View reviewed changes

vyasr marked this pull request as draft November 12, 2021 01:56

PointKernel mentioned this pull request Nov 15, 2021

[Review] Add stream support for static_map execution NVIDIA/cuCollections#113

Merged

vyasr marked this pull request as ready for review November 19, 2021 20:08

vyasr added 0 - Blocked Cannot progress due to external reasons and removed 2 - In Progress Currently a work in progress labels Nov 19, 2021

PointKernel reviewed Nov 19, 2021

View reviewed changes

cpp/src/join/semi_join.cu Outdated Show resolved Hide resolved

cpp/src/join/semi_join.cu Outdated Show resolved Hide resolved

vyasr added 5 commits November 23, 2021 09:00

Initial version of semi/anti join using cuco that compiles successful…

c073298

…ly for non-nullable types.

Add a basic semi join test to verify that the new code works for non-…

e6e6365

…nullable cases and fix a hashing bug.

Fully functional implementation using cuco.

e0d947c

Add some more illuminating comments.

f7e478c

Remove now unnecessary header.

2ab9702

vyasr added 2 commits December 16, 2021 10:20

Merge remote-tracking branch 'origin/branch-22.02' into feature/semi_…

293ebdf

…anti_join_cuco

Address outstanding issues with recently added cuco features.

3360100

vyasr requested a review from a team as a code owner December 16, 2021 18:35

vyasr requested a review from PointKernel December 16, 2021 18:36

vyasr added 3 - Ready for Review Ready for review by team and removed 0 - Blocked Cannot progress due to external reasons 5 - Merge After Dependencies DO NOT MERGE Hold off on merging; see PR for details labels Dec 16, 2021

PointKernel approved these changes Dec 16, 2021

View reviewed changes

Merge remote-tracking branch 'origin/branch-22.02' into feature/semi_…

528f8ff

…anti_join_cuco

codereport approved these changes Dec 22, 2021

View reviewed changes

jrhemstad approved these changes Jan 4, 2022

View reviewed changes

vyasr mentioned this pull request Jan 5, 2022

[FEA] Address performance regression in semi/anti joins from switching to cuco #9973

Open

rapids-bot bot merged commit 2112757 into rapidsai:branch-22.02 Jan 5, 2022

vyasr deleted the feature/semi_anti_join_cuco branch January 14, 2022 18:02

jrhemstad mentioned this pull request Mar 24, 2022

[FEA] row_comparators should use strongly typed index types to ensure commutativity #10508

Closed

bdice mentioned this pull request Apr 12, 2022

Remove concurrent_unordered_multimap. #10642

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace cudf's concurrent_ordered_map with cuco::static_map in semi/anti joins #9666

Replace cudf's concurrent_ordered_map with cuco::static_map in semi/anti joins #9666

vyasr commented Nov 12, 2021

vyasr commented Nov 12, 2021 •

edited

Loading

jrhemstad Nov 12, 2021

vyasr Nov 19, 2021

vyasr commented Nov 12, 2021

vyasr commented Nov 19, 2021

PointKernel commented Nov 19, 2021 •

edited

Loading

PointKernel left a comment

PointKernel commented Dec 16, 2021

vyasr commented Dec 21, 2021

vyasr commented Dec 22, 2021

PointKernel commented Dec 22, 2021

codereport left a comment

vyasr commented Jan 4, 2022

vyasr commented Jan 5, 2022

jlowe commented Jan 5, 2022

vyasr commented Jan 5, 2022

vyasr commented Jan 5, 2022

Replace cudf's concurrent_ordered_map with cuco::static_map in semi/anti joins #9666

Replace cudf's concurrent_ordered_map with cuco::static_map in semi/anti joins #9666

Conversation

vyasr commented Nov 12, 2021

vyasr commented Nov 12, 2021 • edited Loading

jrhemstad Nov 12, 2021

Choose a reason for hiding this comment

vyasr Nov 19, 2021

Choose a reason for hiding this comment

vyasr commented Nov 12, 2021

vyasr commented Nov 19, 2021

PointKernel commented Nov 19, 2021 • edited Loading

PointKernel left a comment

Choose a reason for hiding this comment

PointKernel commented Dec 16, 2021

vyasr commented Dec 21, 2021

vyasr commented Dec 22, 2021

PointKernel commented Dec 22, 2021

codereport left a comment

Choose a reason for hiding this comment

vyasr commented Jan 4, 2022

vyasr commented Jan 5, 2022

jlowe commented Jan 5, 2022

vyasr commented Jan 5, 2022

vyasr commented Jan 5, 2022

vyasr commented Nov 12, 2021 •

edited

Loading

PointKernel commented Nov 19, 2021 •

edited

Loading