-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support sorting on keys that are structs #7226
Comments
Struct columns of non-nested types won't be too bad as it's just like adding extra columns to a multi-column sort. Arbitrarily nested lists on the other-hand are going to be extremely difficult to the point of being effectively impossible. |
I think structs of non-nested types would be a great start, given our user story! |
To be clear structs of strings would be needed. |
Updated title to capture that that this is for |
The way to make this non-intrusive is to have a struct to table conversion function, something like:
This will flatten any of internal columns (and likewise with nested struct columns) into a |
What about the validity vector of the parent struct column (and nested versions)? |
Since they're just views, you can set the validity mask of the expanded child column to be the same as the parent struct column. I believe that should preserve the desired semantics. In fact, I thought we already pushed the parent null mask down into the children for structs? |
We do, but after the sort we still need the parent struct validity to be updated so it is accurately reflected in subsequent operations (e.g.: write out to Parquet). The parent struct may say an entry is valid yet all the children are null, and that's semantically different than a null struct. Pushing the parent validity down to the children is very useful when working with the children in isolation, but we're not doing that here. We're sorting the struct column itself, not the individual children within it. The two are similar but distinct operations. I don't know how we can reconstruct the parent struct validity from the resulting children validity after the fact. AFAICT either the sort method will have to support this update directly somehow (and therefore take the struct column directly) or we'd have to use the child columns (if they are sort keys) to generate the gather map then use that to update the struct validity vectors. |
Hmm, I wonder if there's a sort order difference between a null struct vs. a valid struct containing all null children. If so then we can't use the individual child columns to proxy for the parent struct column. |
Flattening the struct column into a table is just for establishing the lexicographical ordering (i.e., the gather map). We still have the original column and so when we call |
Just to be clear in Spark when sorting structs the order of null first vs null last first applies to the struct itself and then applies to the fields in the struct.
desc_nulls_last
The columns within the struct are sorted in the order that they appear in the struct.
Hopefully all of that was intuitive, but I wanted to be sure that the requirement is 100% clear |
That means we won't be able to use only the child columns to proxy for the struct. A column to proxy as the struct validity vector will need to be prepended to the child columns when building the sort keys. The struct validity vector can be expanded as a |
Or the comparator when sorting includes the base column of the struct (with the validity in it) and ignores the value or just produces a constant throw away value, and the null handling logic already in place just works. |
Agreed that would be preferable. I was thinking of ways to accomplish it with the existing comparator functionality. |
This example doesn't make any sense to me:
In a descending sort where nulls are considered larger than all other values ("nulls first") then shouldn't it be:
|
@jrhemstad you are 100% right. I had to go back and read the code. In Spark the null ordering within a struct is separate from the null ordering on the main columns. It always goes with the default value of nulls first on ascending and nulls last on descending. That is counter intuitive to me, but it is what they do. |
That's not something we'll be able to support natively. The way to accomplish having struct members have different null ordering from the struct column itself is to explode the struct column children into their own columns much like I've already described, but additionally inserting a If the null ordering is the same, we can support the fact that a null struct is different from a valid struct with all null members, i.e., |
This is the behavior we'd want from the Python side. |
That is kind of what I expected. |
Is this something that would make sense to push to change the behavior in Spark? |
We could try to change the Spark behavior, but my guess is that we would still have to support both ways. At least while the change rolls out as this would be a breaking change and likely have to go into Spark 4.0. Also, in enterprise software, like this, we would likely have to include a flag so users can get the old behavior if they rely on it. Because the default values are self consistent I am hoping that we can cover 99% of the use cases without having to split apart the struct. |
This issue has been labeled |
This is being actively worked on in #7422. |
closes #7226 Add struct column support to `cudf::sort`, `cudf::sorted_order` struct is supported by flattening the struct into individual columns in table_view, null mask of struct is converted to boolean column with same null_mask. Authors: - Karthikeyan (@karthikeyann) Approvers: - Gera Shegalov (@gerashegalov) - David (@davidwendt) - Nghia Truong (@ttnghia) - Jake Hemstad (@jrhemstad) - Conor Hoekstra (@codereport) URL: #7422
Is your feature request related to a problem? Please describe.
Allow cudf sort function to take one or more sorting keys that are nested types like lists and/or structs.
Describe the solution you'd like
sort(col(my_struct_col)))
should return a column vector that is sorted ascending or descending with nulls first or last.The text was updated successfully, but these errors were encountered: