-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Slow performance in left_anti_join and left_semi_join when nulls are in the data #7300
Comments
This issue has been labeled |
@hyperbolic2346 was this fixed by #6943? |
@kkraus14 No, this is the same issue in a different spot. Probably the same general fix as well. |
We currently have a workaround in place for this in the Spark plugin where nulls are manually filtered before performing the join. I'd love to see this fixed so we can remove it. |
This issue has been labeled |
This is still relevant. |
This issue has been labeled |
Still relevant |
Starting on this. |
…assed to left_semi_join and left_anti_join (#8277) Fixes #7300 This is fundamentally the same issue and fix as https://github.com/rapidsai/cudf/pull/6943/files from @hyperbolic2346 When nulls are considered not equal (`null_equality::NOT_EQUAL`) there is no point in adding them to the hash table used for the join as they will never compare as true against anything. Adding large numbers of nulls was causing huge performance issues. Includes a fix to doxygen comments for `left_anti_join` Performance gain is tremendous. Before: ``` Benchmark Time CPU Iterations ----------------------------------------------------------------------------------------------------------------------- Join<int32_t, int32_t>/left_anti_join_32bit_nulls/100000/100000/manual_time 1072 ms 1072 ms 1 Join<int32_t, int32_t>/left_anti_join_32bit_nulls/200000/400000/manual_time 4253 ms 4253 ms 1 Join<int32_t, int32_t>/left_anti_join_32bit_nulls/300000/1000000/manual_time 14016 ms 14016 ms 1 Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/100000/manual_time 932 ms 932 ms 1 Join<int32_t, int32_t>/left_semi_join_32bit_nulls/200000/400000/manual_time 4481 ms 4481 ms 1 Join<int32_t, int32_t>/left_semi_join_32bit_nulls/300000/1000000/manual_time 14172 ms 14172 ms 1 ``` After: ``` ----------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------------------------------------------------- Join<int32_t, int32_t>/left_anti_join_32bit_nulls/100000/100000/manual_time 0.143 ms 0.162 ms 4996 Join<int32_t, int32_t>/left_anti_join_32bit_nulls/200000/400000/manual_time 0.255 ms 0.275 ms 2780 Join<int32_t, int32_t>/left_anti_join_32bit_nulls/300000/1000000/manual_time 0.514 ms 0.532 ms 1368 Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/100000/manual_time 0.135 ms 0.155 ms 5203 Join<int32_t, int32_t>/left_semi_join_32bit_nulls/200000/400000/manual_time 0.206 ms 0.224 ms 3325 Join<int32_t, int32_t>/left_semi_join_32bit_nulls/300000/1000000/manual_time 0.368 ms 0.385 ms 1903 ``` Authors: - https://github.com/nvdbaranec Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Mike Wilson (https://github.com/hyperbolic2346) - Robert Maynard (https://github.com/robertmaynard) - Mark Harris (https://github.com/harrism) URL: #8277
Describe the bug
Performance is really bad in
left_anti_join
andleft_semi_join
calls when nulls are involved. It appears to be related to #6052 in that we are building a hash table and performance suffers with collisions.Steps/Code to reproduce bug
Generate a table with some null keys and run
left_anti_join
orleft_semi_join
on the data. The performance can be measured in minutes with a table of 10 million rows.Expected behavior
When nulls are not equal, we don't need to hash the row and store it, so this is completely wasted work to hash the row if it is null.
The text was updated successfully, but these errors were encountered: