Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] bounds checking in joins can be expensive #3798

Closed
abellina opened this issue Oct 12, 2021 · 0 comments · Fixed by #3799
Closed

[FEA] bounds checking in joins can be expensive #3798

abellina opened this issue Oct 12, 2021 · 0 comments · Fixed by #3799
Assignees
Labels
feature request New feature or request performance A performance related task/issue

Comments

@abellina
Copy link
Collaborator

abellina commented Oct 12, 2021

While working on q72 from NDS we found that joins were sending time in this query calling the bounds_checker (https://github.com/rapidsai/cudf/blob/cf0b2caffd4ead3dc73025d95ae2dee11c539e9e/cpp/include/cudf/detail/gather.cuh#L56). This is used when out_of_bounds_policy::NULLIFY is specified while gathering.

@jlowe pointed out that there are several query types that don't need to do this and proposed optimizing this out. Here's what happens for each type, as I understand it:

  • LeftAnti/LeftSemi: every row in the gather map is valid, we are only picking from a single table, just filtering it down.
  • InnerJoin: every join in the gather map is valid, no nulls (and up to no rows) are produced in some cases.
  • LeftOuter/RightOuter: the side we are joining on (left or right) doesn't need to be checked since all the rows of their corresponding table (left or right) are the gather map and must be manifested. The opposite side of the join (i.e. the right side for a LeftOuter) must be checked, it is likely to have gaps for "outer" joins.
  • FullOuter: this has to be checked. Both left and right gather maps have extra invalid rows that are purposefully added to those maps by the cuDF join code so they can represent the null entries on both sides.
  • CrossJoin: has no condition, so every row in the gather maps is there for the purpose of exploding the original tables (rows on left x rows on right)

The optimization proposed is to skip the bound checks where it can be skipped by providing out_of_bounds_policy::NULLIFY as exposed with: rapidsai/cudf#9406

@abellina abellina added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Oct 12, 2021
@abellina abellina added this to the Oct 4 - Oct 15 milestone Oct 12, 2021
@abellina abellina self-assigned this Oct 12, 2021
@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance A performance related task/issue
Projects
None yet
2 participants