-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement lists::drop_list_duplicates
for keys-values lists columns
#9345
Reimplement lists::drop_list_duplicates
for keys-values lists columns
#9345
Conversation
lists::drop_list_duplicates
for keys-values lists columnslists::drop_list_duplicates
for keys-values lists columns [skip ci]
Currently, stream compaction API `drop_duplicates` uses unstable sort for all of its internal sorting. This is wrong since we may want to have stable sorting results so we can choose to keep the first or the last duplicate element from the repeated sequences. This PR does two things: * Fixes the issue mentioned above by using stable sort if the input option is to keep the first or last duplicate element, and * Adds a new keep option into the enum class `duplicate_keep_option`: `KEEP_ANY`. This option allows the user to choose to keep one element from the repeated sequence at any position. Note that the issue did not show up by any failed test because thrust default (unstable) sort, which is called internally in `drop_duplicate`, still produces the same results as thrust stable sort most of the time (but this is not guaranteed). As such, the current `drop_duplicate` still produces correct results in its tests. This PR blocks #9345. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - MithunR (https://github.com/mythrocks) URL: #9417
…ithin each row of lists column (#9425) This PR adds `lists::stable_sort_lists` that can sort elements within rows of lists column using stable sort. This is necessary for implementing `lists::drop_list_duplicates` that operates on keys-values columns input when we want to remove the values corresponding to duplicate keys with `KEEP_FIRST` or `KEEP_LAST` option. In order to implement `lists::stable_sort_lists`, stable sort versions for the `segmented_sorted_order` and `segmented_sort_by_key` have also been implemented, which can maintain the order of equally-compared elements within segments. This PR blocks #9345. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Conor Hoekstra (https://github.com/codereport) - MithunR (https://github.com/mythrocks) URL: #9425
# Conflicts: # cpp/include/cudf/stream_compaction.hpp # cpp/src/groupby/sort/aggregate.cpp
} | ||
}; | ||
|
||
/** | ||
* @brief Struct used in type_dispatcher for comparing two entries in a lists column. | ||
*/ | ||
struct column_row_comparator_dispatch { | ||
offset_type const* const list_offsets; | ||
size_type const* const list_indices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be redundant with element_equality_comparator
. Likewise, the table_row_comparator_fn
looks to be redundant with row_equality_comparator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column_row_comparator_dispatch
is not the same as element_equality_comparator
. In particular, it allows to have different output depending on the nans_equal
parameter. element_equality_comparator
does not offer such feature. In addition,
Sorry that I didn't clarify it in @brief doxygen (now added).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a relatively minor thing to duplicate such a large bit of complicated code. Remind me why nans_equal
is necessary? For the most part, we've resigned to just always treat NaNs as equal in most libcudf functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a discussion before regarding that with Keith and decided to have an option allowing 2 different behaviors because:
- Pandas wants to have NaNs treated as equal
- Spark wants to have NaNs considered unequal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark wants to have NaNs considered unequal
Seriously? Most of the code in libcudf that's specialized to treat NaNs as equal was because of Spark. How wonderfully inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, Spark is "consistently inconsistent" thus in groupby it treats NaNs as equal for NaN keys (but not values). That's why we didn't need to modify groupby code for floating-point keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future work I'd like to see adding a static/dynamic option to the row/element comparator for controlling NaN equality so we don't have to maintain this code that is 99% redundant.
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9345 +/- ##
================================================
- Coverage 10.79% 10.68% -0.11%
================================================
Files 116 117 +1
Lines 18869 19872 +1003
================================================
+ Hits 2036 2123 +87
- Misses 16833 17749 +916
Continue to review full report at Codecov.
|
@gpucibot merge |
…mn (#9553) This PR adds JNI work for the new interface of `lists::drop_list_duplicates` that operates on keys-values input columns. It also does a small fix to remove an unused variable in `drop_list_duplicates.cu`. Blocked by #9345. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - MithunR (https://github.com/mythrocks) - Robert (Bobby) Evans (https://github.com/revans2) - David Wendt (https://github.com/davidwendt) URL: #9553
This PR changes the interface of
lists::drop_list_duplicates
such that it may accept a second (optional) inputvalues
lists column, and returns a pairs of lists columns containing the results of copying the input column without duplicate entries.If the optional
values
column is given, the users are responsible to have the keys-values columns having the same number of entries in each row. Otherwise, the results will be undefined.When copying the key entries, the corresponding value entries are also copied at the same time. A parameter
duplicate_keep_option
reused from stream compaction is used to specify which duplicate keys will be copying.This closes #9124, and blocked by #9425.