Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement drop_list_duplicates #7494

Closed
ttnghia opened this issue Mar 3, 2021 · 10 comments · Fixed by #7528
Closed

[FEA] Implement drop_list_duplicates #7494

ttnghia opened this issue Mar 3, 2021 · 10 comments · Fixed by #7528
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Mar 3, 2021

Hi there!

I'm implementing drop_list_duplicates for removing duplicated entries in the list columns. It is at an early stage with my idea below. If anybody has any idea, please let me know. Thanks in advance :)

This FEA also addresses #7414.

Signature:

/**
 * @brief Create a new column by removing duplicated elements from each list in the given
 * lists_column
 *
 * Given an `input` list_column_view, the list elements in the column are copied to an output column
 * such that their duplicated entries are dropped. Adopted from stream compaction, the definition of
 * uniqueness depends on the value of @p keep:
 * - KEEP_FIRST: only the first of a sequence of duplicate entries is copied
 * - KEEP_LAST: only the last of a sequence of duplicate entries is copied
 * - KEEP_NONE: all duplicate entries are dropped out
 *
 * @param[in] lists_column    the input lists_column_view
 * @param[in] keep            keep first entry, last entry, or no entries if duplicates found
 * @param[in] nulls_equal     flag to denote nulls are considered equal
 * @param[in] mr              Device memory resource used to allocate the returned column
 *
 *  * @code{.pseudo}
 * input = { {1, 1, 2, 1, 3}, {4}, {5, 6, 6, 6, 5} }
 * if keep is KEEP_FIRST, the output will be { {1, 2, 3}, {4}, {5, 6} }
 * if keep is KEEP_LAST, the output will be  { {2, 1, 3}, {4}, {6, 5} }
 * if keep is KEEP_NONE, the output will be  { {2, 3}, {4}, {} }
 * @endcode
 *
 * @return A list column with lists having unique entries as specified by the `keep` policy.
 */
std::unique_ptr<column> drop_list_duplicates(
  lists_column_view const& lists_column,
  duplicate_keep_option keep,
  null_equality nulls_equal           = null_equality::EQUAL,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

And algorithm:

   /*
     * Given input = { {1, 1, 2, 1, 3}, {4}, {5, 6, 6, 6, 5} }
     *  - if keep is KEEP_FIRST, the output will be { {1, 2, 3}, {4}, {5, 6} }
     *  - if keep is KEEP_LAST, the output will be  { {2, 1, 3}, {4}, {6, 5} }
     *  - if keep is KEEP_NONE, the output will be  { {2, 3}, {4}, {} }
     *
     * 1. Generate ordered indices for each list element
     *   ordered_indices = { {0, 1, 2, 3, 4}, {5}, {6, 7, 8, 9, 10} }
     *
     * 2. Call sort_list on the lists and indices using the list entries as keys
     *   sorted_lists = { {1, 1, 1, 2, 3}, {4}, {5, 5, 6, 6, 6} }, and
     *   sorted_indices = { {0, 1, 3, 2, 4}, {5}, {6, 10, 7, 8, 9} }
     *
     * 3. Remove list indices if the list entries are duplicated
     *  - with keep is KEEP_FIRST: sorted_unique_indices = { {0, 2, 4}, {5}, {6, 7} }
     *  - with keep is KEEP_LAST:  sorted_unique_indices = { {3, 2, 4}, {5}, {10, 9} }
     *  - with keep is KEEP_NONE:  sorted_unique_indices = { {2, 4}, {5}, {} }
     *
     * 4. Call sort_lists on the sorted_unique_indices to obtain the final list indices
     *  - with keep is KEEP_FIRST: sorted_unique_indices = { {0, 2, 4}, {5}, {6, 7} }
     *  - with keep is KEEP_LAST:  sorted_unique_indices = { {2, 3, 4}, {5}, {9, 10} }
     *  - with keep is KEEP_NONE:  sorted_unique_indices = { {2, 4}, {5}, {} }
     *
     * 5. Gather list entries using the sorted_unique_indices as gather map
     *   (remember to deal with null elements)
     */

From step 2, if we generate sorted_lists, we will consume more memory but the step 3 will be faster. If we don't generate it, we can access the original list column in step 3 but the access pattern may be bad and the step 3 will be slower.

For comparing list entries, I propose to use row_lexicographic_comparator.

Corner cases:

  • null entries in a list: depending on the nulls_equal policy, if it is set to EQUAL then only one null entry is kept. Which null entry to keep---it is specified by the keep policy.
  • null rows: just return null rows, nothing changes.
  • NaN entries in a list: NaNs should be treated as equal, thus only one NaN value is kept. Again, which value to keep depends on the keep policy.
  • Nested types are not supported---the function should throw logic_error.

Note that, if we don't care about ordering of the entries, then the keep policy may be dropped out completely from the API.

@ttnghia ttnghia added doc Documentation Needs Triage Need team to review and classify labels Mar 3, 2021
@ttnghia ttnghia added libcudf Affects libcudf (C++/CUDA) code. 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request labels Mar 3, 2021
@ttnghia ttnghia changed the title [DOC] Implement drop_list_duplicate [DOC] Implement drop_list_duplicates Mar 3, 2021
@harrism
Copy link
Member

harrism commented Mar 3, 2021

I don't understand this issue. It's marked "doc", but seems to be an implementation question or request for review.

If it's a feature request, please make it one, and ask for review in a PR.

@harrism harrism removed 4 - Needs Review Waiting for reviewer to review or respond doc Documentation labels Mar 3, 2021
@harrism harrism changed the title [DOC] Implement drop_list_duplicates [FEA] Implement drop_list_duplicates Mar 3, 2021
@mythrocks
Copy link
Contributor

Hello, @harrism. Yep, this should be an [FEA] issue.
The idea was to document the behaviour of drop_list_duplicates() and possibly the approach, before actually implementing it.

@ttnghia, it would be good to make mention of the expected behaviour for corner cases such as:

  1. null list elements
  2. null list rows (probably just return a null row)
  3. NaN elements for float/double
  4. Nested types (which are likely out of scope for the moment)

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 3, 2021

Sure. I've updated those in the post description.

BTW, how to efficiently detect if a list row is a nested list?

@mythrocks
Copy link
Contributor

I have taken a closer look at the proposed operation. The requirement from Spark is to drop all but one copy of the repeated element. A couple of questions on that front:

  1. KEEP_NONE seems very niche. I can't think of an array/set operation in Spark that would use this feature. Is there something I should know about here that Pandas offers?
  2. KEEP_FIRST vs KEEP_LAST: Why are these semantically different? If two array elements are deemed equal, what difference would it make to return the first or the last?

@mythrocks
Copy link
Contributor

BTW, how to efficiently detect if a list row is a nested list?

Again, I'm not sure I've understood the question. The type of lists_column_view.child() would inform you of this. What do we mean by "efficiently" here?

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 3, 2021

  1. The keep policy here is taken directly from stream compaction. Thus I think we should enforce consistency. Of course, we can remove that KEEP_NONE option so we always have one entry always being kept.
  2. As you can see from my example, keep first and keep last would produce different results if the list order matters.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 3, 2021

Should we add a boolean flag order_matters to the API?

  • Only if order matters, then we will need the additional keep policy.
  • If order doesn't matter, we can ignore the keep policy and also don't need to do the step 4 above. The result will be just a list of unique entries from the original list in sorted order.

@jrhemstad
Copy link
Contributor

Should we add a boolean flag order_matters to the API?

  • Only if order matters, then we will need the additional keep policy.
  • If order doesn't matter, we can ignore the keep policy and also don't need to do the step 4 above.

I detest boolean flags in APIs. If you want, you can default duplicate_keep_option to one of the options

@mythrocks
Copy link
Contributor

mythrocks commented Mar 3, 2021

Thanks for clarifying, @ttnghia. I'm trying to reason out whether consistency between drop_duplicates(table_view...) and drop_list_duplicates(list_column_view) is necessary. The former may produce fewer rows than were input; the latter does not. But I see the resemblance. Hmm.

If you want, you can default duplicate_keep_option to one of the options

Appreciated, @jrhemstad. The issue, though, is that if the order does not matter, we might go faster since we don't need to track the indices.
I wonder if it makes sense to split this into two APIs/overloads:

  1. One that does not take a duplicate_keep_option, and does not guarantee output order: Spark will be among those that use this.
  2. One that takes duplicate_keep_option, and guarantees output order via index tracking: I'm not clear who might use this.

@kkraus14 kkraus14 removed the Needs Triage Need team to review and classify label Mar 5, 2021
@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 7, 2021

A PR for drop_list_duplicates has been submitted. Note that the actual implementation is very different from the algorithm proposed above.

@kkraus14 kkraus14 added the Python Affects Python cuDF API. label Mar 10, 2021
rapids-bot bot pushed a commit that referenced this issue Mar 12, 2021
Closes #7494 and partially addresses #7414.

This is the new implementation for `drop_list_duplicates`, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last).

Example with null_equality=EQUAL:
```
input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} }

```

Example with null_equality=UNEQUAL:
```
input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} }

```

Authors:
  - Nghia Truong (@ttnghia)

Approvers:
  - AJ Schmidt (@ajschmidt8)
  - @nvdbaranec
  - David (@davidwendt)
  - Keith Kraus (@kkraus14)

URL: #7528
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this issue Mar 25, 2021
Closes rapidsai#7494 and partially addresses rapidsai#7414.

This is the new implementation for `drop_list_duplicates`, which removes duplicated entries from lists column. The result is a new lists column in which each list row contains only unique entries. By current implementation, the output lists will have entries sorted by ascending order (null(s) last).

Example with null_equality=EQUAL:
```
input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} }

```

Example with null_equality=UNEQUAL:
```
input: { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
output: { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL, NULL, NULL} }

```

Authors:
  - Nghia Truong (@ttnghia)

Approvers:
  - AJ Schmidt (@ajschmidt8)
  - @nvdbaranec
  - David (@davidwendt)
  - Keith Kraus (@kkraus14)

URL: rapidsai#7528
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants