You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@jlowe pointed out that there are several query types that don't need to do this and proposed optimizing this out. Here's what happens for each type, as I understand it:
LeftAnti/LeftSemi: every row in the gather map is valid, we are only picking from a single table, just filtering it down.
InnerJoin: every join in the gather map is valid, no nulls (and up to no rows) are produced in some cases.
LeftOuter/RightOuter: the side we are joining on (left or right) doesn't need to be checked since all the rows of their corresponding table (left or right) are the gather map and must be manifested. The opposite side of the join (i.e. the right side for a LeftOuter) must be checked, it is likely to have gaps for "outer" joins.
FullOuter: this has to be checked. Both left and right gather maps have extra invalid rows that are purposefully added to those maps by the cuDF join code so they can represent the null entries on both sides.
CrossJoin: has no condition, so every row in the gather maps is there for the purpose of exploding the original tables (rows on left x rows on right)
The optimization proposed is to skip the bound checks where it can be skipped by providing out_of_bounds_policy::NULLIFY as exposed with: rapidsai/cudf#9406
The text was updated successfully, but these errors were encountered:
While working on q72 from NDS we found that joins were sending time in this query calling the
bounds_checker
(https://github.com/rapidsai/cudf/blob/cf0b2caffd4ead3dc73025d95ae2dee11c539e9e/cpp/include/cudf/detail/gather.cuh#L56). This is used whenout_of_bounds_policy::NULLIFY
is specified while gathering.@jlowe pointed out that there are several query types that don't need to do this and proposed optimizing this out. Here's what happens for each type, as I understand it:
LeftAnti
/LeftSemi
: every row in the gather map is valid, we are only picking from a single table, just filtering it down.InnerJoin
: every join in the gather map is valid, no nulls (and up to no rows) are produced in some cases.LeftOuter
/RightOuter
: the side we are joining on (left or right) doesn't need to be checked since all the rows of their corresponding table (left or right) are the gather map and must be manifested. The opposite side of the join (i.e. the right side for aLeftOuter
) must be checked, it is likely to have gaps for "outer" joins.FullOuter
: this has to be checked. Both left and right gather maps have extra invalid rows that are purposefully added to those maps by the cuDF join code so they can represent the null entries on both sides.CrossJoin
: has no condition, so every row in the gather maps is there for the purpose of exploding the original tables (rows on left x rows on right)The optimization proposed is to skip the bound checks where it can be skipped by providing
out_of_bounds_policy::NULLIFY
as exposed with: rapidsai/cudf#9406The text was updated successfully, but these errors were encountered: