Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Sort list columns #5890

Closed
shwina opened this issue Aug 7, 2020 · 7 comments
Closed

[FEA] Sort list columns #5890

shwina opened this issue Aug 7, 2020 · 7 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@shwina
Copy link
Contributor

shwina commented Aug 7, 2020

It would be nice to be able to sort list columns. The use case I have encountered so far is in testing equality of two list columns with identical elements, but ordered differently, as occurs in a groupby COLLECT.

What does it mean to sort a list column?

Python does a "lexical" comparison between lists:

>>> compare([1, 2], [1, 2])
[1, 2] == [1, 2]

>>> compare([1, 2], [1, 2, 3])
[1, 2] < [1, 2, 3]

>>> compare([1, 2], [1, 1, 3])
[1, 2] > [1, 1, 3]

Accepting the above, it becomes easy to reason about comparing arbitrarily nested lists:

>>> compare([[[1, 2], [1, 3]], [[4, 5]]], [[[1, 2], [1, 3]], [[4, 6]]])
[[[1, 2], [1, 3]], [[4, 5]]] < [[[1, 2], [1, 3]], [[4, 6]]]

>>> compare([[[1, 2], [1, 4]], [[4, 5]]], [[[1, 2], [1, 3]], [[4, 6]]])
[[[1, 2], [1, 4]], [[4, 5]]] > [[[1, 2], [1, 3]], [[4, 6]]]

I wonder if Spark users would have similar expectations for sorting list columns. cc: @nvdbaranec @jlowe @jrhemstad @kkraus14 @harrism

@shwina shwina added feature request New feature or request Needs Triage Need team to review and classify labels Aug 7, 2020
@shwina shwina added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 7, 2020
@kkraus14
Copy link
Collaborator

kkraus14 commented Aug 8, 2020

cc @revans2 from the Spark side as well
cc @felipeblazing @williamBlazing for if you have thoughts / opinions from the BlazingSQL side

@jrhemstad
Copy link
Contributor

So even if we decide on what it means to sort a list of columns, implementing this in the context of a multi-column sort is going to be very difficult. What's more, I'm concerned adding the logic to the row_comparator could cause a performance regression in multi-column sorting even in the absence of list columns.

@jlowe
Copy link
Member

jlowe commented Aug 10, 2020

Spark has a similar comparison for list types. There may be some discrepancies when null entries within a list are encountered. Spark treats a null entry within a list as less than any other value in the other list except another null which is treated as equal.

I'm concerned adding the logic to the row_comparator could cause a performance regression in multi-column sorting even in the absence of list columns.

Can you elaborate? I'm curious if this will be like NaN handling, where the performance hit was negligible unless NaNs appeared in the data.

@jrhemstad
Copy link
Contributor

Can you elaborate? I'm curious if this will be like NaN handling, where the performance hit was negligible unless NaNs appeared in the data.

It will be similar to the AST interpretation Brad is doing where adding increasingly complex and nested switch statements eventually causes the compiler to balk and starts putting things on the stack frame or refusing to inline things, which significantly reduces performance. Even if list columns aren't present, the increased complexity of the comparator could very well reduce performance.

@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@nvdbaranec
Copy link
Contributor

Still relevant. And I believe we will be starting to investigate in the somewhat-near future.

@GregoryKimball
Copy link
Contributor

Closed by #11129

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

6 participants