[FEA] Support sorting on keys that are structs #7226

kuhushukla · 2021-01-27T14:38:30Z

Is your feature request related to a problem? Please describe.
Allow cudf sort function to take one or more sorting keys that are nested types like lists and/or structs.
Describe the solution you'd like
sort(col(my_struct_col))) should return a column vector that is sorted ascending or descending with nulls first or last.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2021-01-27T15:17:50Z

Struct columns of non-nested types won't be too bad as it's just like adding extra columns to a multi-column sort.

Arbitrarily nested lists on the other-hand are going to be extremely difficult to the point of being effectively impossible.

kuhushukla · 2021-01-28T23:36:58Z

I think structs of non-nested types would be a great start, given our user story!

revans2 · 2021-01-29T15:31:41Z

I think structs of non-nested types would be a great start, given our user story!

To be clear structs of strings would be needed.

kkraus14 · 2021-02-02T23:44:13Z

Updated title to capture that that this is for structs and not lists

jrhemstad · 2021-02-02T23:47:47Z

The way to make this non-intrusive is to have a struct to table conversion function, something like:

table_view struct_to_table(structs_column_view);

This will flatten any of internal columns (and likewise with nested struct columns) into a table_view. You can then insert the table_view input to sort with the expanded struct column.

jlowe · 2021-02-03T02:06:44Z

This will flatten any of internal columns (and likewise with nested struct columns) into a table_view. You can then insert the table_view input to sort with the expanded struct column.

What about the validity vector of the parent struct column (and nested versions)?

jrhemstad · 2021-02-03T02:30:34Z

This will flatten any of internal columns (and likewise with nested struct columns) into a table_view. You can then insert the table_view input to sort with the expanded struct column.

What about the validity vector of the parent struct column (and nested versions)?

Since they're just views, you can set the validity mask of the expanded child column to be the same as the parent struct column. I believe that should preserve the desired semantics. In fact, I thought we already pushed the parent null mask down into the children for structs?

jlowe · 2021-02-03T02:37:03Z

In fact, I thought we already pushed the parent null mask down into the children for structs?

We do, but after the sort we still need the parent struct validity to be updated so it is accurately reflected in subsequent operations (e.g.: write out to Parquet). The parent struct may say an entry is valid yet all the children are null, and that's semantically different than a null struct. Pushing the parent validity down to the children is very useful when working with the children in isolation, but we're not doing that here. We're sorting the struct column itself, not the individual children within it. The two are similar but distinct operations.

I don't know how we can reconstruct the parent struct validity from the resulting children validity after the fact. AFAICT either the sort method will have to support this update directly somehow (and therefore take the struct column directly) or we'd have to use the child columns (if they are sort keys) to generate the gather map then use that to update the struct validity vectors.

jlowe · 2021-02-03T02:41:57Z

The parent struct may say an entry is valid yet all the children are null, and that's semantically different than a null struct.

Hmm, I wonder if there's a sort order difference between a null struct vs. a valid struct containing all null children. If so then we can't use the individual child columns to proxy for the parent struct column.

jrhemstad · 2021-02-03T02:42:55Z

In fact, I thought we already pushed the parent null mask down into the children for structs?

We do, but after the sort we still need the parent struct validity to be updated so it is accurately reflected in subsequent operations (e.g.: write out to Parquet). The parent struct may say an entry is valid yet all the children are null, and that's semantically different than a null struct. Pushing the parent validity down to the children is very useful when working with the children in isolation, but we're not doing that here. We're sorting the struct column itself, not the individual children within it. The two are similar but distinct operations.

I don't know how we can reconstruct the parent struct validity from the resulting children validity after the fact. AFAICT either the sort method will have to support this update directly somehow (and therefore take the struct column directly) or we'd have to use the child columns (if they are sort keys) to generate the gather map then use that to update the struct validity vectors.

Flattening the struct column into a table is just for establishing the lexicographical ordering (i.e., the gather map). We still have the original column and so when we call gather on the original column with the gather map, it'll preserve whatever the original state was of the original input struct column.

revans2 · 2021-02-03T16:36:58Z

Just to be clear in Spark when sorting structs the order of null first vs null last first applies to the struct itself and then applies to the fields in the struct.
desc_nulls_first

+------------+
|           s|
+------------+
|        null|
|        null|
|   {1, null}|
|   {0, null}|
|   {null, 1}|
|   {null, 0}|
|{null, null}|
|{null, null}|
+------------+

desc_nulls_last

+------------+
|           s|
+------------+
|   {1, null}|
|   {0, null}|
|   {null, 1}|
|   {null, 0}|
|{null, null}|
|{null, null}|
|        null|
|        null|
+------------+

The columns within the struct are sorted in the order that they appear in the struct.
Ascending order

+-----------+
|          s|
+-----------+
|  {0, 4, 1}|
|  {0, 4, 2}|
|  {1, 2, 5}|
|  {1, 3, 3}|
|  {1, 3, 4}|
|  {2, 1, 7}|
|  {2, 1, 8}|
|  {2, 2, 6}|
|  {3, 0, 9}|
| {3, 0, 10}|
| {3, 0, 11}|
|{4, -2, 14}|
|{4, -1, 12}|
|{4, -1, 13}|
|{5, -3, 16}|
|{5, -3, 17}|
|{5, -2, 15}|
|{6, -4, 18}|
|{6, -4, 19}|
+-----------+

Hopefully all of that was intuitive, but I wanted to be sure that the requirement is 100% clear

jlowe · 2021-02-03T16:43:58Z

Just to be clear in Spark when sorting structs the order of null first vs null last first applies to the struct itself and then applies to the fields in the struct.

That means we won't be able to use only the child columns to proxy for the struct. A column to proxy as the struct validity vector will need to be prepended to the child columns when building the sort keys. The struct validity vector can be expanded as a BOOL8 column to meet that requirement.

revans2 · 2021-02-03T16:47:01Z

The struct validity vector can be expanded as a BOOL8 column to meet that requirement.

Or the comparator when sorting includes the base column of the struct (with the validity in it) and ignores the value or just produces a constant throw away value, and the null handling logic already in place just works.

jlowe · 2021-02-03T16:50:00Z

Or the comparator when sorting includes the base column of the struct (with the validity in it) and ignores the value or just produces a constant throw away value, and the null handling logic already in place just works.

Agreed that would be preferable. I was thinking of ways to accomplish it with the existing comparator functionality.

jrhemstad · 2021-02-03T17:52:37Z

This example doesn't make any sense to me:

desc_nulls_first

+------------+
|           s|
+------------+
|        null|
|        null|
|   {1, null}|
|   {0, null}|
|   {null, 1}|
|   {null, 0}|
|{null, null}|
|{null, null}|
+------------+

In a descending sort where nulls are considered larger than all other values ("nulls first") then shouldn't it be:

+------------+
|           s|
+------------+
|        null|
|        null|
|{null, null}|
|{null, null}|
|   {null, 1}|
|   {null, 0}|
|   {1, null}|
|   {0, null}|
+------------+

revans2 · 2021-02-04T16:45:35Z

@jrhemstad you are 100% right. I had to go back and read the code. In Spark the null ordering within a struct is separate from the null ordering on the main columns. It always goes with the default value of nulls first on ascending and nulls last on descending. That is counter intuitive to me, but it is what they do.

jrhemstad · 2021-02-04T17:26:05Z

@jrhemstad you are 100% right. I had to go back and read the code. In Spark the null ordering within a struct is separate from the null ordering on the main columns. It always goes with the default value of nulls first on ascending and nulls last on descending. That is counter intuitive to me, but it is what they do.

That's not something we'll be able to support natively. The way to accomplish having struct members have different null ordering from the struct column itself is to explode the struct column children into their own columns much like I've already described, but additionally inserting a BOOL8 column that represents that parent column's bitmask (like @jlowe described above). Then you can specify whatever per-column (and therefore per-struct member) null ordering you wish.

If the null ordering is the same, we can support the fact that a null struct is different from a valid struct with all null members, i.e., null is ordered differently from {null, null}, we'll just have to do something different than I originally thought.

kkraus14 · 2021-02-04T20:10:55Z

If the null ordering is the same, we can support the fact that a null struct is different from a valid struct with all null members, i.e., null is ordered differently from {null, null}, we'll just have to do something different than I originally thought.

This is the behavior we'd want from the Python side.

revans2 · 2021-02-04T22:29:19Z

That is kind of what I expected.

kkraus14 · 2021-02-04T23:24:44Z

Is this something that would make sense to push to change the behavior in Spark?

revans2 · 2021-02-05T15:04:28Z

Is this something that would make sense to push to change the behavior in Spark?

We could try to change the Spark behavior, but my guess is that we would still have to support both ways. At least while the change rolls out as this would be a breaking change and likely have to go into Spark 4.0. Also, in enterprise software, like this, we would likely have to include a flag so users can get the old behavior if they rely on it. Because the default values are self consistent I am hoping that we can cover 99% of the use cases without having to split apart the struct.

github-actions · 2021-03-21T18:25:26Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

jlowe · 2021-03-22T13:50:24Z

This is being actively worked on in #7422.

@karthikeyann

closes #7226 Add struct column support to `cudf::sort`, `cudf::sorted_order` struct is supported by flattening the struct into individual columns in table_view, null mask of struct is converted to boolean column with same null_mask. Authors: - Karthikeyan (@karthikeyann) Approvers: - Gera Shegalov (@gerashegalov) - David (@davidwendt) - Nghia Truong (@ttnghia) - Jake Hemstad (@jrhemstad) - Conor Hoekstra (@codereport) URL: #7422

kuhushukla added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Jan 27, 2021

This was referenced Jan 27, 2021

[FEA] Support sorting on struct type keys NVIDIA/spark-rapids#1605

Closed

[FEA] Support features for user query NVIDIA/spark-rapids#1608

Closed

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jan 27, 2021

kkraus14 changed the title ~~[FEA] Support sorting on keys that are nested types~~ [FEA] Support sorting on keys that are structs Feb 2, 2021

kkraus14 assigned karthikeyann and unassigned karthikeyann Feb 16, 2021

karthikeyann self-assigned this Feb 19, 2021

karthikeyann mentioned this issue Feb 22, 2021

Add struct column support to cudf::sort and cudf::sorted_order #7422

Merged

jrhemstad mentioned this issue Mar 10, 2021

[FEA] Add equi-join support for structs #7543

Closed

github-actions bot added the inactive-30d label Mar 21, 2021

github-actions bot removed the inactive-30d label Mar 22, 2021

rapids-bot bot closed this as completed in #7422 Mar 26, 2021

jrhemstad mentioned this issue Apr 12, 2021

[BUG] Null equality of structs doesn't match to spark #7934

Closed

revans2 mentioned this issue May 7, 2021

[BUG] null ordering for nulls in children appears to be ignored #8189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support sorting on keys that are structs #7226

[FEA] Support sorting on keys that are structs #7226

kuhushukla commented Jan 27, 2021

jrhemstad commented Jan 27, 2021

kuhushukla commented Jan 28, 2021 •

edited

Loading

revans2 commented Jan 29, 2021

kkraus14 commented Feb 2, 2021

jrhemstad commented Feb 2, 2021

jlowe commented Feb 3, 2021

jrhemstad commented Feb 3, 2021

jlowe commented Feb 3, 2021

jlowe commented Feb 3, 2021

jrhemstad commented Feb 3, 2021 •

edited

Loading

revans2 commented Feb 3, 2021

jlowe commented Feb 3, 2021

revans2 commented Feb 3, 2021 •

edited

Loading

jlowe commented Feb 3, 2021

jrhemstad commented Feb 3, 2021

revans2 commented Feb 4, 2021

jrhemstad commented Feb 4, 2021 •

edited

Loading

kkraus14 commented Feb 4, 2021

revans2 commented Feb 4, 2021

kkraus14 commented Feb 4, 2021

revans2 commented Feb 5, 2021

github-actions bot commented Mar 21, 2021

jlowe commented Mar 22, 2021

[FEA] Support sorting on keys that are structs #7226

[FEA] Support sorting on keys that are structs #7226

Comments

kuhushukla commented Jan 27, 2021

jrhemstad commented Jan 27, 2021

kuhushukla commented Jan 28, 2021 • edited Loading

revans2 commented Jan 29, 2021

kkraus14 commented Feb 2, 2021

jrhemstad commented Feb 2, 2021

jlowe commented Feb 3, 2021

jrhemstad commented Feb 3, 2021

jlowe commented Feb 3, 2021

jlowe commented Feb 3, 2021

jrhemstad commented Feb 3, 2021 • edited Loading

revans2 commented Feb 3, 2021

jlowe commented Feb 3, 2021

revans2 commented Feb 3, 2021 • edited Loading

jlowe commented Feb 3, 2021

jrhemstad commented Feb 3, 2021

revans2 commented Feb 4, 2021

jrhemstad commented Feb 4, 2021 • edited Loading

kkraus14 commented Feb 4, 2021

revans2 commented Feb 4, 2021

kkraus14 commented Feb 4, 2021

revans2 commented Feb 5, 2021

github-actions bot commented Mar 21, 2021

jlowe commented Mar 22, 2021

kuhushukla commented Jan 28, 2021 •

edited

Loading

jrhemstad commented Feb 3, 2021 •

edited

Loading

revans2 commented Feb 3, 2021 •

edited

Loading

jrhemstad commented Feb 4, 2021 •

edited

Loading