Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Segmented Sorting #4603

Closed
jrhemstad opened this issue Mar 19, 2020 · 6 comments · Fixed by #7122
Closed

[FEA] Segmented Sorting #4603

jrhemstad opened this issue Mar 19, 2020 · 6 comments · Fixed by #7122
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@jrhemstad
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Given a table and a set of offsets that demarcate "segments" of the table, I would like to be able to sort each segment.

Describe the solution you'd like

Implement a segmented_sort API like:

unique_ptr<table> segmented_sort(table_view table_to_sort, vector<size_type> key_columns, vector<size_type> segment_offsets, vector<order> column_order, vector<null_order> null_orders);

Describe alternatives you've considered
Alternatively, you can use slice to get a set of views and sort each view separately, but that's inefficient.

Additional context
I thought of this while testing partition in #4472. Since the ordering per partition is non-deterministic, it is difficult to unit test when there are multiple values per partition.

Unfortunately, CUB's segmented sort can't be used because we need a comparison sort for the lexicographic row ordering.

Instead, a straightforward way to implement this:

  1. Prepend an integral dummy column of 0s to table_to_sort
  2. Scatter 1s to every location indicated by segment_offsets
  3. Compute exclusive_scan of dummy column
  4. Do a lexicographic sort of dummy column + table_to_sort
@jrhemstad jrhemstad added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Mar 19, 2020
@harrism
Copy link
Member

harrism commented Mar 20, 2020

Sorry, can you clarify: is the present use case just for unit testing?

@jrhemstad
Copy link
Contributor Author

Sorry, can you clarify: is the present use case just for unit testing?

Yep, not high priority. I imagine it will be useful outside of unit testing since it's a somewhat common operation, but I don't have any other immediate need.

I just wanted to capture the idea while I was thinking of it.

@kkraus14
Copy link
Collaborator

I believe there's a desire to sort as a groupby operation in Python. I.E. something like groupby columns 0 and 1 of the input Table and then sort by columns 2 and 4 within each group.

@harrism
Copy link
Member

harrism commented Mar 20, 2020

But isn't that case just a lexicographic sort by columns 0, 1, 2, 4?

@devavret
Copy link
Contributor

Feels like a generalization of sort_helper's sorted_values but for table values instead.

@kkraus14
Copy link
Collaborator

But isn't that case just a lexicographic sort by columns 0, 1, 2, 4?

Yes it is, but there's no easy way with the current APIs to compose that because I may be asking for additional aggregations as well where currently I'd have to redo the work of sorting by columns 0 and 1 in order to merge the two.

@jrhemstad jrhemstad mentioned this issue Jan 26, 2021
6 tasks
@karthikeyann karthikeyann self-assigned this Jan 30, 2021
@rapids-bot rapids-bot bot closed this as completed in #7122 Feb 4, 2021
rapids-bot bot pushed a commit that referenced this issue Feb 4, 2021
addresses part of #6541 Segment sort of lists

- [x] lists_column_view segmented_sort
- [x] numerical types (cub segmented sort limitation)
- [x] sort_lists(table_view)
- [x] unit tests

closes  #4603 Segmented sort
- [x] segmented_sort
- [x] unit tests.

Authors:
  - Karthikeyan (@karthikeyann)

Approvers:
  - AJ Schmidt (@ajschmidt8)
  - Keith Kraus (@kkraus14)
  - Jake Hemstad (@jrhemstad)
  - Conor Hoekstra (@codereport)

URL: #7122
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants