-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Implement drop_list_duplicates #7494
Comments
I don't understand this issue. It's marked "doc", but seems to be an implementation question or request for review. If it's a feature request, please make it one, and ask for review in a PR. |
Hello, @harrism. Yep, this should be an @ttnghia, it would be good to make mention of the expected behaviour for corner cases such as:
|
Sure. I've updated those in the post description. BTW, how to efficiently detect if a list row is a nested list? |
I have taken a closer look at the proposed operation. The requirement from Spark is to drop all but one copy of the repeated element. A couple of questions on that front:
|
Again, I'm not sure I've understood the question. The type of |
|
Should we add a boolean flag
|
I detest boolean flags in APIs. If you want, you can default |
Thanks for clarifying, @ttnghia. I'm trying to reason out whether consistency between
Appreciated, @jrhemstad. The issue, though, is that if the order does not matter, we might go faster since we don't need to track the indices.
|
A PR for |
Closes #7494 and partially addresses #7414. This is the new implementation for `drop_list_duplicates`, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last). Example with null_equality=EQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} } ``` Example with null_equality=UNEQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} } ``` Authors: - Nghia Truong (@ttnghia) Approvers: - AJ Schmidt (@ajschmidt8) - @nvdbaranec - David (@davidwendt) - Keith Kraus (@kkraus14) URL: #7528
Closes rapidsai#7494 and partially addresses rapidsai#7414. This is the new implementation for `drop_list_duplicates`, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last). Example with null_equality=EQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} } ``` Example with null_equality=UNEQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} } ``` Authors: - Nghia Truong (@ttnghia) Approvers: - AJ Schmidt (@ajschmidt8) - @nvdbaranec - David (@davidwendt) - Keith Kraus (@kkraus14) URL: rapidsai#7528
Hi there!
I'm implementing
drop_list_duplicates
for removing duplicated entries in the list columns. It is at an early stage with my idea below. If anybody has any idea, please let me know. Thanks in advance :)This FEA also addresses #7414.
Signature:
And algorithm:
From step 2, if we generate
sorted_lists
, we will consume more memory but the step 3 will be faster. If we don't generate it, we can access the original list column in step 3 but the access pattern may be bad and the step 3 will be slower.For comparing list entries, I propose to use
row_lexicographic_comparator
.Corner cases:
nulls_equal
policy, if it is set toEQUAL
then only onenull
entry is kept. Which null entry to keep---it is specified by thekeep
policy.keep
policy.logic_error
.Note that, if we don't care about ordering of the entries, then the
keep
policy may be dropped out completely from the API.The text was updated successfully, but these errors were encountered: