Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Segmented sort #7122

Merged
merged 35 commits into from
Feb 4, 2021

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Jan 12, 2021

addresses part of #6541 Segment sort of lists

  • lists_column_view segmented_sort
  • numerical types (cub segmented sort limitation)
  • sort_lists(table_view)
  • unit tests

closes #4603 Segmented sort

  • segmented_sort
  • unit tests.

@karthikeyann karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Jan 12, 2021
@karthikeyann karthikeyann self-assigned this Jan 12, 2021
cpp/include/cudf/lists/sorting.hpp Outdated Show resolved Hide resolved
cpp/src/lists/segmented_sort.cu Show resolved Hide resolved
CUDF_EXPECTS(std::all_of(keys.begin(),
keys.end(),
[](column_view const& col) { return col.type().id() == type_id::LIST; }),
"segmented_sort only supports lists columns");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I'd think I'd should be able to do a segmented sort of a normal column. In fact, I filed an issue asking for exactly that ages ago: #4603

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#6541 requested for list columns. This will be renamed to sort_lists and this check will remain.
segmented_sort of normal column/table will be exposed as per #4603 specifications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented segmented_sort as well.

@codecov
Copy link

codecov bot commented Jan 27, 2021

Codecov Report

Merging #7122 (2761626) into branch-0.18 (8860baf) will increase coverage by 0.11%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #7122      +/-   ##
===============================================
+ Coverage        82.09%   82.20%   +0.11%     
===============================================
  Files               97      100       +3     
  Lines            16474    16952     +478     
===============================================
+ Hits             13524    13936     +412     
- Misses            2950     3016      +66     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/avro.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/csv.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/json.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/main.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/orc.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/parquet.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/utils.py 0.00% <ø> (ø)
... and 83 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ecde9d...e0696ef. Read the comment docs.

@karthikeyann karthikeyann marked this pull request as ready for review January 28, 2021 20:29
@karthikeyann karthikeyann requested review from a team as code owners January 28, 2021 20:29
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Performs a lexicographic segmented sort of the list in each row of a table of list columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this function. How do you do a lexicographic sort of a table of list columns?

Copy link
Contributor Author

@karthikeyann karthikeyann Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this clear?
Performs a lexicographic sort of each list in a table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is it the same as calling sort independently on each column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. It will be similar to calling sort on each row of the table. Each row is a list.
The relative row order in the table remains same. In each row, the elements in list are sorted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's the same as calling sort_lists on each column in the table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. it's similar to treating each row as table, then sort it. then repeat it for all rows.

Copying the example from #6541

row# a b c
0 [21, 22, 23, 22] [13, 14, 12, 11] ["a", "b", "c", "d"]
1 [22, 21, 23, 22] [14, 13, 12, 11] ["a", "b", "c", "d"]

Here [...] is a list.

row# a b c
0
  • 21
  • 22
  • 23
  • 22
  • 13
  • 14
  • 12
  • 11
  • "a"
  • "b"
  • "c"
  • "d"
1
  • 22
  • 21
  • 23
  • 22
  • 14
  • 13
  • 12
  • 11
  • "a"
  • "b"
  • "c"
  • "d"

sort_lists(values={a,b,c}, keys={a,b}) output will be

row# a b c
0
  • 21
  • 22
  • 22
  • 23
  • 13
  • 11
  • 14
  • 12
  • "a"
  • "d"
  • "b"
  • "c"
1
  • 21
  • 22
  • 22
  • 23
  • 13
  • 11
  • 14
  • 12
  • "b"
  • "d"
  • "a"
  • "c"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that require all of the lists in a given row to have the same number of elements? What would the result be for this input?

row# a b c
0 [21, 22, 23, 22] [13, 14] ["a", "c", "d"]
1 [22, 21, 23, 22] [14] ["c", "d"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
For this input, it will throw a logic error.

cpp/src/sort/segmented_sort.cu Outdated Show resolved Hide resolved
cpp/src/sort/segmented_sort.cu Outdated Show resolved Hide resolved
cpp/src/sort/segmented_sort.cu Outdated Show resolved Hide resolved
@karthikeyann karthikeyann changed the title Add Segmented sort for lists Add Segmented sort Feb 2, 2021
@karthikeyann
Copy link
Contributor Author

Unrelated style check failures.
FAILED: mypy style check
python/cudf/cudf/core/column/categorical.py:818: error: Item "dtype[Any]" of "Union[Any, dtype[Any]]" has no attribute "serialize"

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 2, 2021

Unrelated style check failures.
FAILED: mypy style check
python/cudf/cudf/core/column/categorical.py:818: error: Item "dtype[Any]" of "Union[Any, dtype[Any]]" has no attribute "serialize"

Being fixed in #7279

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level questions:

  • Why is sort_lists different from segmented_sort? Couldn't sort_lists just be implemented by calling segmented_sort on the data/offsets children of a list column?
  • This brings up the question, do we even need a sort_lists API? Couldn't a caller trivially just do segmented_sort(list.data, list.offsets)? This had the advantage of implicitly enforcing the "lists of depth 1" requirement.
  • The sort_lists table API still sticks out as very odd to me. I think from [FEA] Segmented sort / sorted_order (argsort) #6541 what @kkraus14 really wanted is just a table level segmented sort, not a lexicographic sort of lists of all the same length.

In summary, all I would think we need is:

  • segmented_sort_by_key(table_view values, table_view keys, column_view segment_offsets, ...)
  • segmented_sorted_order(table_view keys, column_view segment_offsets)

@karthikeyann
Copy link
Contributor Author

Had discussion with @jrhemstad

  • sort_lists has 2 functions/implementations.
    • sort_lists(column_view) for single column (faster method, uses cub segmented radix sort, fixed_width columns)
    • sort_lists(table_view) (this is implemented by calling segmented_sort_by_key(table_view) on the data/offsets children of a list column)
  • Caller can call segmented_sort_by_key, but caller needs to add offset to list.offsets, have to reconstruct the list columns of table after getting result, which is what sort_lists(table_view) does.
  • discussed with @jrhemstad

functions for segmented sorts and sorted_order (argsort) to sort the list elements within a column

sort_lists(table_view) is not needed and will be removed. sort_lists(column_view) is required and will be kept.

@harrism
Copy link
Member

harrism commented Feb 3, 2021

@karthikeyann can you get the requested changes done by tomorrow? If not we should push to 0.19.

@karthikeyann
Copy link
Contributor Author

@karthikeyann can you get the requested changes done by tomorrow? If not we should push to 0.19.

Yes. It will be done by today. By tomorrow, I will address re-review comments as well.

Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Feb 3, 2021
@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 4, 2021

rerun tests

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 4, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 369ec98 into rapidsai:branch-0.18 Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Segmented Sorting
7 participants