Skip to content

Commit

Permalink
Reimplement lists::drop_list_duplicates for keys-values lists colum…
Browse files Browse the repository at this point in the history
…ns (#9345)

This PR changes the interface of `lists::drop_list_duplicates` such that it may accept a second (optional) input `values` lists column, and returns a pairs of lists columns containing the results of copying the input column without duplicate entries.

If the optional `values` column is given, the users are responsible to have the keys-values columns having the same number of entries in each row. Otherwise, the results will be undefined.

When copying the key entries, the corresponding value entries are also copied at the same time. A parameter `duplicate_keep_option` reused from stream compaction is used to specify which duplicate keys will be copying.

This closes #9124, and blocked by #9425.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - https://github.com/nvdbaranec

URL: #9345
  • Loading branch information
ttnghia authored Nov 11, 2021
1 parent 77dc477 commit 544643c
Show file tree
Hide file tree
Showing 8 changed files with 862 additions and 536 deletions.
10 changes: 10 additions & 0 deletions cpp/include/cudf/detail/replace.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -96,5 +96,15 @@ std::unique_ptr<column> find_and_replace_all(
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @copydoc cudf::normalize_nans_and_zeros
*
* @param stream CUDA stream used for device memory operations and kernel launches.
*/
std::unique_ptr<column> normalize_nans_and_zeros(
column_view const& input,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

} // namespace detail
} // namespace cudf
28 changes: 24 additions & 4 deletions cpp/include/cudf/lists/detail/drop_list_duplicates.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,25 +15,45 @@
*/
#pragma once

#include <cudf/lists/lists_column_view.hpp>
#include <cudf/lists/drop_list_duplicates.hpp>

#include <rmm/cuda_stream_view.hpp>

namespace cudf {
namespace lists {
namespace detail {
/**
* @copydoc cudf::lists::drop_list_duplicates(lists_column_view const&,
* lists_column_view const&,
* duplicate_keep_option,
* null_equality,
* nan_equality,
* rmm::mr::device_memory_resource*)
* @param stream CUDA stream used for device memory operations and kernel launches.
*/
std::unique_ptr<column> drop_list_duplicates(
lists_column_view const& keys,
lists_column_view const& values,
duplicate_keep_option keep_option,
null_equality nulls_equal,
nan_equality nans_equal,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @copydoc cudf::lists::drop_list_duplicates
*
* @copydoc cudf::lists::drop_list_duplicates(lists_column_view const&,
* null_equality,
* nan_equality,
* rmm::mr::device_memory_resource*)
* @param stream CUDA stream used for device memory operations and kernel launches.
*/
std::unique_ptr<column> drop_list_duplicates(
lists_column_view const& lists_column,
lists_column_view const& input,
null_equality nulls_equal,
nan_equality nans_equal,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

} // namespace detail
} // namespace lists
} // namespace cudf
94 changes: 73 additions & 21 deletions cpp/include/cudf/lists/drop_list_duplicates.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,35 +28,87 @@ namespace lists {
*/

/**
* @brief Create a new lists column by extracting unique entries from list elements in the given
* lists column.
*
* Given an input lists column, the list elements in the column are copied to an output lists
* column such that their duplicated entries are dropped out to keep only the unique ones. The
* order of those entries within each list are not guaranteed to be preserved as in the input. In
* the current implementation, entries in the output lists are sorted by ascending order (nulls
* last), but this is not guaranteed in future implementation.
*
* @throw cudf::logic_error if the child column of the input lists column contains nested type other
* than struct.
*
* @param lists_column The input lists column to extract lists with unique entries.
* @param nulls_equal Flag to specify whether null entries should be considered equal.
* @param nans_equal Flag to specify whether NaN entries should be considered as equal value (only
* applicable for floating point data column).
* @brief Copy the elements from the lists in `keys` and associated `values` columns according to
* the unique elements in `keys`.
*
* For each list in `keys` and associated `values`, according to the parameter `keep_option`, copy
* the unique elements from the list in `keys` and their corresponding elements in `values` to new
* lists. Order of the output elements within each list are not guaranteed to be preserved as in the
* input.
*
* Behavior is undefined if `count_elements(keys)[i] != count_elements(values)[i]` for all `i` in
* `[0, keys.size())`.
*
* @throw cudf::logic_error If the child column of the input keys column contains nested type other
* than STRUCT.
* @throw cudf::logic_error If `keys.size() != values.size()`.
*
* @param keys The input keys lists column to check for uniqueness and copy unique elements.
* @param values The values lists column in which the elements are mapped to elements in the key
* column.
* @param nulls_equal Flag to specify whether null key elements should be considered as equal.
* @param nans_equal Flag to specify whether NaN key elements should be considered as equal
* (only applicable for floating point keys elements).
* @param keep_option Flag to specify which elements will be copied from the input to the output.
* @param mr Device resource used to allocate memory.
*
* @code{.pseudo}
* input = { {1, 1, 2, 1, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
* output = { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} }
* keys = { {1, 1, 2, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
* values = { {"a", "b", "c", "d"}, {"e"}, NULL, {}, {"N0", "N1", "N2", "f", "g", "h", "i", "j"} }
*
* [out_keys, out_values] = drop_list_duplicates(keys, values, duplicate_keep_option::KEEP_FIRST)
* out_keys = { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} }
* out_values = { {"a", "c", "d"}, {"e"}, NULL, {}, {"f", "g", "N0"} }
*
* [out_keys, out_values] = drop_list_duplicates(keys, values, duplicate_keep_option::KEEP_LAST)
* out_keys = { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} }
* out_values = { {"b", "c", "d"}, {"e"}, NULL, {}, {"j", "i", "N2"} }
*
* Note that permuting the entries of each list in this output also produces another valid output.
* [out_keys, out_values] = drop_list_duplicates(keys, values, duplicate_keep_option::KEEP_NONE)
* out_keys = { {2, 3}, {4}, NULL, {}, {} }
* out_values = { {"c", "d"}, {"e"}, NULL, {}, {} }
* @endcode
*
* @return A pair of lists columns storing the results from extracting unique key elements and their
* corresponding values elements from the input.
*/
std::pair<std::unique_ptr<column>, std::unique_ptr<column>> drop_list_duplicates(
lists_column_view const& keys,
lists_column_view const& values,
duplicate_keep_option keep_option = duplicate_keep_option::KEEP_FIRST,
null_equality nulls_equal = null_equality::EQUAL,
nan_equality nans_equal = nan_equality::UNEQUAL,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Create a new list column by copying elements from the input lists column ignoring
* duplicate list elements.
*
* Given a lists column, an output lists column is generated by copying elements from the input
* lists column in a way such that the duplicate elements in each list are ignored, producing only
* unique list elements.
*
* Order of the output elements are not guaranteed to be preserved as in the input.
*
* @throw cudf::logic_error If the child column of the input lists column contains nested type other
* than STRUCT.
*
* @param input The input lists column to check and copy unique elements.
* @param nulls_equal Flag to specify whether null key elements should be considered as equal.
* @param nans_equal Flag to specify whether NaN key elements should be considered as equal
* (only applicable for floating point keys column).
* @param keep_option Flag to specify which elements will be copied from the input to the output.
* @param mr Device resource used to allocate memory.
*
* @code{.pseudo}
* input = { {1, 1, 2, 3}, {4}, NULL, {}, {NULL, NULL, NULL, 5, 6, 6, 6, 5} }
* drop_list_duplicates(input) = { {1, 2, 3}, {4}, NULL, {}, {5, 6, NULL} }
* @endcode
*
* @return A lists column with list elements having unique entries.
* @return A lists column storing the results from extracting unique list elements from the input.
*/
std::unique_ptr<column> drop_list_duplicates(
lists_column_view const& lists_column,
lists_column_view const& input,
null_equality nulls_equal = null_equality::EQUAL,
nan_equality nans_equal = nan_equality::UNEQUAL,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
Expand Down
Loading

0 comments on commit 544643c

Please sign in to comment.