-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for lexicographic comparators for lists of structs #11222
Comments
Given this requirement, I'd be less inclined to make this change: #11040 (comment) |
This issue has been labeled |
Background
Part of a broad push for row operators on nested types in story issue #10186. Related to user request in #10408.
@devavret has designed lexicographic comparators for lists in #11129 but this design cannot support lists of structs. Meanwhile, structs can be lexicographically compared through decomposition (#10164).
Proposed approach
(Discussed with @devavret @ttnghia @vyasr.)
We propose to alter the self-comparator (two-table comparator) preprocessing step to replace all
list<struct<...>>
with ranked child elements aslist<size_type>
from innermost to outermost levels of nesting. The replacement data is computed by callingcudf::rank
on the child column of the list. (For two tables, the child list columns from both tables are concatenated, ranked, and then split back into a child list column for each table. This can use a fancy "chained" iterator to avoid materializing the concatenation.)The key insight is that any child type (
int
,struct<int, int>
, ...) except forlist<struct<...>>
can be ranked using existing lexicographic comparators, and replacing the list child data with the corresponding rankings will preserve the desired sorting order.After preprocessing, the columns will not contain
list<struct<...>>
at any level of nesting, and all other combinations of nested types can be compared using other methods (currently dremel encoding for lists as in #11129, and decomposition for structs as in #10164). Altogether, the number ofrank
calls during preprocessing is equal to the number of instances oflist<struct<...>>
in the type hierarchy.Notes
This preprocessing step for each
list<struct<...>>
may be expensive due to its reliance on ranks, which requires a sort and is thusO(N log N)
. Originally we considered replacing the data with an order preserving hash (#10020) which would beO(N)
but that is probably not easy to implement without stringent constraints or prior knowledge of the inputs. It may be possible to have special cases for some limited set of types where some form of compressed ranking (like an order-preserving hash) can be trivially computed without sorting (thecudf::rank
step is not needed). For example, the child ordering of a non-nullablelist<struct<int32_t>>
can be preserved by extracting the single integer's value and sorting on that. However, if null values in the parentstruct
must be considered, this may not work. The mapping must be one-to-one (injective) and monotonic, whichcudf::rank
satisfies.Additionally, this places a new requirement on the self-comparator (two-table comparator) to know the nullability and physical element comparator (#10870) at construction so that the child ranks of
list<struct<...>>
data are correctly preprocessed. Currently these arguments are given when constructing the device row comparator from the self-comparator (two-table comparator).Examples
Example 1
Given a column with type
list<struct<int, int>>
:The
struct<int, int>
leaf data looks like:The ranking of these values is
0, 2, 4, 0, 2
. Replacing these values in the list column gives alist<size_type>
column:Then this can be sorted.
Example 2
Given a column with type
list<struct<struct<list<int>, list<struct<int>>>, list<int>>>
:We can iteratively replace all
list<struct<...>
bylist<size_type>
:At this point, the child type of the root
list
isstruct<struct<list<int>, list<size_type>>, list<int>>
. This child type contains nolist<struct<...>>
and can be ranked again:The table is now preprocessed and is now comparable.
Example 3
For a binary operation like
column < scalar
, the two-table comparator will be used. The preprocessing will need to rank all of the elements of alist<struct<...>>
from the column against those of the scalar. To optimize this, the rank algorithm should be updated to make it possible to compute a rank over multiple non-contiguous column views without materializing their concatenation.Tasks
The text was updated successfully, but these errors were encountered: