Proposal for lexicographic comparators for lists of structs #11222

bdice · 2022-07-07T22:59:41Z

Background

Part of a broad push for row operators on nested types in story issue #10186. Related to user request in #10408.

@devavret has designed lexicographic comparators for lists in #11129 but this design cannot support lists of structs. Meanwhile, structs can be lexicographically compared through decomposition (#10164).

Proposed approach

(Discussed with @devavret @ttnghia @vyasr.)

We propose to alter the self-comparator (two-table comparator) preprocessing step to replace all list<struct<...>> with ranked child elements as list<size_type> from innermost to outermost levels of nesting. The replacement data is computed by calling cudf::rank on the child column of the list. (For two tables, the child list columns from both tables are concatenated, ranked, and then split back into a child list column for each table. This can use a fancy "chained" iterator to avoid materializing the concatenation.)

The key insight is that any child type (int, struct<int, int>, ...) except for list<struct<...>> can be ranked using existing lexicographic comparators, and replacing the list child data with the corresponding rankings will preserve the desired sorting order.

After preprocessing, the columns will not contain list<struct<...>> at any level of nesting, and all other combinations of nested types can be compared using other methods (currently dremel encoding for lists as in #11129, and decomposition for structs as in #10164). Altogether, the number of rank calls during preprocessing is equal to the number of instances of list<struct<...>> in the type hierarchy.

Notes

This preprocessing step for each list<struct<...>> may be expensive due to its reliance on ranks, which requires a sort and is thus O(N log N). Originally we considered replacing the data with an order preserving hash (#10020) which would be O(N) but that is probably not easy to implement without stringent constraints or prior knowledge of the inputs. It may be possible to have special cases for some limited set of types where some form of compressed ranking (like an order-preserving hash) can be trivially computed without sorting (the cudf::rank step is not needed). For example, the child ordering of a non-nullable list<struct<int32_t>> can be preserved by extracting the single integer's value and sorting on that. However, if null values in the parent struct must be considered, this may not work. The mapping must be one-to-one (injective) and monotonic, which cudf::rank satisfies.

Additionally, this places a new requirement on the self-comparator (two-table comparator) to know the nullability and physical element comparator (#10870) at construction so that the child ranks of list<struct<...>> data are correctly preprocessed. Currently these arguments are given when constructing the device row comparator from the self-comparator (two-table comparator).

Examples

Example 1

Given a column with type list<struct<int, int>>:

[{1, 2}, {2, 1}]
[{3, 4}, {1, 2}]
[{2, 1}]

The struct<int, int> leaf data looks like:

{1, 2}
{2, 1}
{3, 4}
{1, 2}
{2, 1}

The ranking of these values is 0, 2, 4, 0, 2. Replacing these values in the list column gives a list<size_type> column:

[0, 2]
[3, 0]
[2]

Then this can be sorted.

Example 2

Given a column with type list<struct<struct<list<int>, list<struct<int>>>, list<int>>>:

[{{[1], [{2}]}, [3, 4]}]
[{{[0], [{1}]}, [2, 3]}]

We can iteratively replace all list<struct<...> by list<size_type>:

list<struct<struct<list<int>, list<size_type>>, list<int>>>
[{{[1], [1]}, [3, 4]}]
[{{[0], [0]}, [2, 3]}]

At this point, the child type of the root list is struct<struct<list<int>, list<size_type>>, list<int>>. This child type contains no list<struct<...>> and can be ranked again:

list<size_type>
[1]
[0]

The table is now preprocessed and is now comparable.

Example 3

For a binary operation like column < scalar, the two-table comparator will be used. The preprocessing will need to rank all of the elements of a list<struct<...>> from the column against those of the scalar. To optimize this, the rank algorithm should be updated to make it possible to compute a rank over multiple non-contiguous column views without materializing their concatenation.

Tasks

Give feedback

Fully support lexicographic comparison for nested types in table self-comparator #12879

3 - Ready for Review CMake Spark feature request libcudf non-breaking
Support lists of structs in row lexicographic comparator #12953

3 - Ready for Review CMake Spark feature request libcudf non-breaking
Options

The text was updated successfully, but these errors were encountered:

jrhemstad · 2022-07-07T23:18:58Z

Additionally, this places a new requirement on the self-comparator (two-table comparator) to know the nullability and physical element comparator (#10870) at construction so that the child ranks of list<struct<...>> data are correctly preprocessed. Currently these arguments are given when constructing the device row comparator from the self-comparator (two-table comparator).

Given this requirement, I'd be less inclined to make this change: #11040 (comment)

github-actions · 2022-08-07T00:13:12Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

bdice added proposal Change current process or code libcudf Affects libcudf (C++/CUDA) code. labels Jul 7, 2022

bdice added this to the List and Struct data types and operations milestone Jul 7, 2022

NVnavkumar mentioned this issue Jul 14, 2022

[FEA] SortArray supports nested types NVIDIA/spark-rapids#3715

Open

vyasr mentioned this issue Jul 22, 2022

[FEA] Story - Supporting row operators on nested types #10186

Closed

bdice mentioned this issue Jul 28, 2022

List lexicographic comparator #11129

Merged

github-actions bot added the inactive-30d label Aug 7, 2022

GregoryKimball mentioned this issue Oct 3, 2022

[FEA] Implement full support for nested types #11844

Closed

PointKernel mentioned this issue Nov 8, 2022

Support nested types as groupby keys in libcudf #11792

Merged

3 tasks

ttnghia mentioned this issue Nov 30, 2022

[FEA] GroupBy on an arbitrary nested array(with struct/array inside) NVIDIA/spark-rapids#5109

Open

GregoryKimball added this to libcudf Jan 6, 2023

ttnghia mentioned this issue Feb 7, 2023

Add support for arrays in hashaggregate [databricks] NVIDIA/spark-rapids#7465

Merged

ttnghia mentioned this issue Feb 22, 2023

[BUG] Cannot lexicographic compare a table with a LIST of STRUCT column at ai.rapids.cudf.Table.sortOrder NVIDIA/spark-rapids#7799

Closed

ttnghia self-assigned this Mar 5, 2023

ttnghia mentioned this issue Mar 5, 2023

Fully support lexicographic comparison for nested types in table self-comparator #12879

Closed

ttnghia linked a pull request Mar 7, 2023 that will close this issue

Fully support lexicographic comparison for nested types in table self-comparator #12879

Closed

ttnghia mentioned this issue Mar 15, 2023

Support lists of structs in row lexicographic comparator #12953

Merged

ttnghia removed a link to a pull request Mar 24, 2023

Fully support lexicographic comparison for nested types in table self-comparator #12879

Closed

ttnghia removed the inactive-30d label Mar 31, 2023

ttnghia mentioned this issue Apr 6, 2023

[FEA] Support order by on multi-level array NVIDIA/spark-rapids#7230

Open

rapids-bot bot closed this as completed in #12953 May 3, 2023

rapids-bot bot closed this as completed in d0a7dec May 3, 2023

GregoryKimball removed this from libcudf Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for lexicographic comparators for lists of structs #11222

Proposal for lexicographic comparators for lists of structs #11222

bdice commented Jul 7, 2022 •

edited by ttnghia

Loading

Tasks

jrhemstad commented Jul 7, 2022

github-actions bot commented Aug 7, 2022

Proposal for lexicographic comparators for lists of structs #11222

Proposal for lexicographic comparators for lists of structs #11222

Comments

bdice commented Jul 7, 2022 • edited by ttnghia Loading

Background

Proposed approach

Notes

Examples

Example 1

Example 2

Example 3

Tasks

jrhemstad commented Jul 7, 2022

github-actions bot commented Aug 7, 2022

bdice commented Jul 7, 2022 •

edited by ttnghia

Loading