[FEA] Implement drop_list_duplicates #7494

ttnghia · 2021-03-03T00:42:08Z

Hi there!

I'm implementing drop_list_duplicates for removing duplicated entries in the list columns. It is at an early stage with my idea below. If anybody has any idea, please let me know. Thanks in advance :)

This FEA also addresses #7414.

Signature:

/**
 * @brief Create a new column by removing duplicated elements from each list in the given
 * lists_column
 *
 * Given an `input` list_column_view, the list elements in the column are copied to an output column
 * such that their duplicated entries are dropped. Adopted from stream compaction, the definition of
 * uniqueness depends on the value of @p keep:
 * - KEEP_FIRST: only the first of a sequence of duplicate entries is copied
 * - KEEP_LAST: only the last of a sequence of duplicate entries is copied
 * - KEEP_NONE: all duplicate entries are dropped out
 *
 * @param[in] lists_column    the input lists_column_view
 * @param[in] keep            keep first entry, last entry, or no entries if duplicates found
 * @param[in] nulls_equal     flag to denote nulls are considered equal
 * @param[in] mr              Device memory resource used to allocate the returned column
 *
 *  * @code{.pseudo}
 * input = { {1, 1, 2, 1, 3}, {4}, {5, 6, 6, 6, 5} }
 * if keep is KEEP_FIRST, the output will be { {1, 2, 3}, {4}, {5, 6} }
 * if keep is KEEP_LAST, the output will be  { {2, 1, 3}, {4}, {6, 5} }
 * if keep is KEEP_NONE, the output will be  { {2, 3}, {4}, {} }
 * @endcode
 *
 * @return A list column with lists having unique entries as specified by the `keep` policy.
 */
std::unique_ptr<column> drop_list_duplicates(
  lists_column_view const& lists_column,
  duplicate_keep_option keep,
  null_equality nulls_equal           = null_equality::EQUAL,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

And algorithm:

   /*
     * Given input = { {1, 1, 2, 1, 3}, {4}, {5, 6, 6, 6, 5} }
     *  - if keep is KEEP_FIRST, the output will be { {1, 2, 3}, {4}, {5, 6} }
     *  - if keep is KEEP_LAST, the output will be  { {2, 1, 3}, {4}, {6, 5} }
     *  - if keep is KEEP_NONE, the output will be  { {2, 3}, {4}, {} }
     *
     * 1. Generate ordered indices for each list element
     *   ordered_indices = { {0, 1, 2, 3, 4}, {5}, {6, 7, 8, 9, 10} }
     *
     * 2. Call sort_list on the lists and indices using the list entries as keys
     *   sorted_lists = { {1, 1, 1, 2, 3}, {4}, {5, 5, 6, 6, 6} }, and
     *   sorted_indices = { {0, 1, 3, 2, 4}, {5}, {6, 10, 7, 8, 9} }
     *
     * 3. Remove list indices if the list entries are duplicated
     *  - with keep is KEEP_FIRST: sorted_unique_indices = { {0, 2, 4}, {5}, {6, 7} }
     *  - with keep is KEEP_LAST:  sorted_unique_indices = { {3, 2, 4}, {5}, {10, 9} }
     *  - with keep is KEEP_NONE:  sorted_unique_indices = { {2, 4}, {5}, {} }
     *
     * 4. Call sort_lists on the sorted_unique_indices to obtain the final list indices
     *  - with keep is KEEP_FIRST: sorted_unique_indices = { {0, 2, 4}, {5}, {6, 7} }
     *  - with keep is KEEP_LAST:  sorted_unique_indices = { {2, 3, 4}, {5}, {9, 10} }
     *  - with keep is KEEP_NONE:  sorted_unique_indices = { {2, 4}, {5}, {} }
     *
     * 5. Gather list entries using the sorted_unique_indices as gather map
     *   (remember to deal with null elements)
     */

From step 2, if we generate sorted_lists, we will consume more memory but the step 3 will be faster. If we don't generate it, we can access the original list column in step 3 but the access pattern may be bad and the step 3 will be slower.

For comparing list entries, I propose to use row_lexicographic_comparator.

Corner cases:

null entries in a list: depending on the nulls_equal policy, if it is set to EQUAL then only one null entry is kept. Which null entry to keep---it is specified by the keep policy.
null rows: just return null rows, nothing changes.
NaN entries in a list: NaNs should be treated as equal, thus only one NaN value is kept. Again, which value to keep depends on the keep policy.
Nested types are not supported---the function should throw logic_error.

Note that, if we don't care about ordering of the entries, then the keep policy may be dropped out completely from the API.

The text was updated successfully, but these errors were encountered:

harrism · 2021-03-03T00:47:18Z

I don't understand this issue. It's marked "doc", but seems to be an implementation question or request for review.

If it's a feature request, please make it one, and ask for review in a PR.

mythrocks · 2021-03-03T07:23:47Z

Hello, @harrism. Yep, this should be an [FEA] issue.
The idea was to document the behaviour of drop_list_duplicates() and possibly the approach, before actually implementing it.

@ttnghia, it would be good to make mention of the expected behaviour for corner cases such as:

null list elements
null list rows (probably just return a null row)
NaN elements for float/double
Nested types (which are likely out of scope for the moment)

ttnghia · 2021-03-03T16:21:47Z

Sure. I've updated those in the post description.

BTW, how to efficiently detect if a list row is a nested list?

mythrocks · 2021-03-03T18:22:34Z

I have taken a closer look at the proposed operation. The requirement from Spark is to drop all but one copy of the repeated element. A couple of questions on that front:

KEEP_NONE seems very niche. I can't think of an array/set operation in Spark that would use this feature. Is there something I should know about here that Pandas offers?
KEEP_FIRST vs KEEP_LAST: Why are these semantically different? If two array elements are deemed equal, what difference would it make to return the first or the last?

mythrocks · 2021-03-03T18:24:37Z

BTW, how to efficiently detect if a list row is a nested list?

Again, I'm not sure I've understood the question. The type of lists_column_view.child() would inform you of this. What do we mean by "efficiently" here?

ttnghia · 2021-03-03T18:25:37Z

The keep policy here is taken directly from stream compaction. Thus I think we should enforce consistency. Of course, we can remove that KEEP_NONE option so we always have one entry always being kept.
As you can see from my example, keep first and keep last would produce different results if the list order matters.

ttnghia · 2021-03-03T18:43:38Z

Should we add a boolean flag order_matters to the API?

Only if order matters, then we will need the additional keep policy.
If order doesn't matter, we can ignore the keep policy and also don't need to do the step 4 above. The result will be just a list of unique entries from the original list in sorted order.

jrhemstad · 2021-03-03T18:47:54Z

Should we add a boolean flag order_matters to the API?

Only if order matters, then we will need the additional keep policy.

If order doesn't matter, we can ignore the keep policy and also don't need to do the step 4 above.

I detest boolean flags in APIs. If you want, you can default duplicate_keep_option to one of the options

mythrocks · 2021-03-03T19:13:35Z

Thanks for clarifying, @ttnghia. I'm trying to reason out whether consistency between drop_duplicates(table_view...) and drop_list_duplicates(list_column_view) is necessary. The former may produce fewer rows than were input; the latter does not. But I see the resemblance. Hmm.

If you want, you can default duplicate_keep_option to one of the options

Appreciated, @jrhemstad. The issue, though, is that if the order does not matter, we might go faster since we don't need to track the indices.
I wonder if it makes sense to split this into two APIs/overloads:

One that does not take a duplicate_keep_option, and does not guarantee output order: Spark will be among those that use this.
One that takes duplicate_keep_option, and guarantees output order via index tracking: I'm not clear who might use this.

ttnghia · 2021-03-07T22:07:56Z

A PR for drop_list_duplicates has been submitted. Note that the actual implementation is very different from the algorithm proposed above.

@ttnghia

Closes #7494 and partially addresses #7414. This is the new implementation for `drop_list_duplicates`, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last). Example with null_equality=EQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} } ``` Example with null_equality=UNEQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} } ``` Authors: - Nghia Truong (@ttnghia) Approvers: - AJ Schmidt (@ajschmidt8) - @nvdbaranec - David (@davidwendt) - Keith Kraus (@kkraus14) URL: #7528

@ttnghia

Closes rapidsai#7494 and partially addresses rapidsai#7414. This is the new implementation for `drop_list_duplicates`, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last). Example with null_equality=EQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} } ``` Example with null_equality=UNEQUAL: ``` input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} } output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} } ``` Authors: - Nghia Truong (@ttnghia) Approvers: - AJ Schmidt (@ajschmidt8) - @nvdbaranec - David (@davidwendt) - Keith Kraus (@kkraus14) URL: rapidsai#7528

ttnghia added doc Documentation Needs Triage Need team to review and classify labels Mar 3, 2021

ttnghia added libcudf Affects libcudf (C++/CUDA) code. 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request labels Mar 3, 2021

ttnghia changed the title ~~[DOC] Implement drop_list_duplicate~~ [DOC] Implement drop_list_duplicates Mar 3, 2021

harrism removed 4 - Needs Review Waiting for reviewer to review or respond doc Documentation labels Mar 3, 2021

harrism changed the title ~~[DOC] Implement drop_list_duplicates~~ [FEA] Implement drop_list_duplicates Mar 3, 2021

kkraus14 removed the Needs Triage Need team to review and classify label Mar 5, 2021

ttnghia mentioned this issue Mar 7, 2021

Implement drop_list_duplicates #7528

Merged

kkraus14 added the Python Affects Python cuDF API. label Mar 10, 2021

rapids-bot bot closed this as completed in #7528 Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Implement drop_list_duplicates #7494

[FEA] Implement drop_list_duplicates #7494

ttnghia commented Mar 3, 2021 •

edited

Loading

harrism commented Mar 3, 2021

mythrocks commented Mar 3, 2021

ttnghia commented Mar 3, 2021 •

edited

Loading

mythrocks commented Mar 3, 2021

mythrocks commented Mar 3, 2021

ttnghia commented Mar 3, 2021 •

edited

Loading

ttnghia commented Mar 3, 2021 •

edited

Loading

jrhemstad commented Mar 3, 2021

mythrocks commented Mar 3, 2021 •

edited

Loading

ttnghia commented Mar 7, 2021

[FEA] Implement drop_list_duplicates #7494

[FEA] Implement drop_list_duplicates #7494

Comments

ttnghia commented Mar 3, 2021 • edited Loading

harrism commented Mar 3, 2021

mythrocks commented Mar 3, 2021

ttnghia commented Mar 3, 2021 • edited Loading

mythrocks commented Mar 3, 2021

mythrocks commented Mar 3, 2021

ttnghia commented Mar 3, 2021 • edited Loading

ttnghia commented Mar 3, 2021 • edited Loading

jrhemstad commented Mar 3, 2021

mythrocks commented Mar 3, 2021 • edited Loading

ttnghia commented Mar 7, 2021

ttnghia commented Mar 3, 2021 •

edited

Loading

ttnghia commented Mar 3, 2021 •

edited

Loading

ttnghia commented Mar 3, 2021 •

edited

Loading

ttnghia commented Mar 3, 2021 •

edited

Loading

mythrocks commented Mar 3, 2021 •

edited

Loading