Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Slow performance in left_anti_join and left_semi_join when nulls are in the data #7300

Closed
hyperbolic2346 opened this issue Feb 3, 2021 · 9 comments · Fixed by #8277
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Comments

@hyperbolic2346
Copy link
Contributor

Describe the bug
Performance is really bad in left_anti_join and left_semi_join calls when nulls are involved. It appears to be related to #6052 in that we are building a hash table and performance suffers with collisions.

Steps/Code to reproduce bug
Generate a table with some null keys and run left_anti_join or left_semi_join on the data. The performance can be measured in minutes with a table of 10 million rows.

Expected behavior
When nulls are not equal, we don't need to hash the row and store it, so this is completely wasted work to hash the row if it is null.

@hyperbolic2346 hyperbolic2346 added bug Something isn't working Needs Triage Need team to review and classify labels Feb 3, 2021
@jrhemstad jrhemstad added libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue and removed Needs Triage Need team to review and classify labels Feb 3, 2021
@github-actions
Copy link

github-actions bot commented Mar 5, 2021

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 5, 2021

@hyperbolic2346 was this fixed by #6943?

@hyperbolic2346
Copy link
Contributor Author

@kkraus14 No, this is the same issue in a different spot. Probably the same general fix as well.

@jlowe
Copy link
Member

jlowe commented Mar 5, 2021

We currently have a workaround in place for this in the Spark plugin where nulls are manually filtered before performing the join. I'd love to see this fixed so we can remove it.

@github-actions
Copy link

github-actions bot commented Apr 5, 2021

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jlowe
Copy link
Member

jlowe commented Apr 5, 2021

This is still relevant.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jlowe
Copy link
Member

jlowe commented May 12, 2021

Still relevant

@nvdbaranec
Copy link
Contributor

Starting on this.

rapids-bot bot pushed a commit that referenced this issue May 24, 2021
…assed to left_semi_join and left_anti_join (#8277)

Fixes  #7300

This is fundamentally the same issue and fix as https://github.com/rapidsai/cudf/pull/6943/files from @hyperbolic2346 

When nulls are considered not equal (`null_equality::NOT_EQUAL`) there is no point in adding them to the hash table used for the join as they will never compare as true against anything.  Adding large numbers of nulls was causing huge performance issues.

Includes a fix to doxygen comments for `left_anti_join`

Performance gain is tremendous.

Before:

```
Benchmark                                                                             Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------------------
Join<int32_t, int32_t>/left_anti_join_32bit_nulls/100000/100000/manual_time        1072 ms         1072 ms            1
Join<int32_t, int32_t>/left_anti_join_32bit_nulls/200000/400000/manual_time        4253 ms         4253 ms            1
Join<int32_t, int32_t>/left_anti_join_32bit_nulls/300000/1000000/manual_time      14016 ms        14016 ms            1
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/100000/manual_time         932 ms          932 ms            1
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/200000/400000/manual_time        4481 ms         4481 ms            1
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/300000/1000000/manual_time      14172 ms        14172 ms            1
```


After:
```
-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------------------
Join<int32_t, int32_t>/left_anti_join_32bit_nulls/100000/100000/manual_time       0.143 ms        0.162 ms         4996
Join<int32_t, int32_t>/left_anti_join_32bit_nulls/200000/400000/manual_time       0.255 ms        0.275 ms         2780
Join<int32_t, int32_t>/left_anti_join_32bit_nulls/300000/1000000/manual_time      0.514 ms        0.532 ms         1368
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/100000/100000/manual_time       0.135 ms        0.155 ms         5203
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/200000/400000/manual_time       0.206 ms        0.224 ms         3325
Join<int32_t, int32_t>/left_semi_join_32bit_nulls/300000/1000000/manual_time      0.368 ms        0.385 ms         1903
```

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Robert Maynard (https://github.com/robertmaynard)
  - Mark Harris (https://github.com/harrism)

URL: #8277
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
5 participants