Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Python drop_duplicates with cudf::stable_distinct. #11656

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
84f70da
expose stable_distinct
brandon-b-miller Sep 6, 2022
a550a2e
Merge branch 'branch-22.10' into fea-expose-stable-distinct
brandon-b-miller Sep 26, 2022
3271fb4
Merge branch 'branch-22.12' into fea-expose-stable-distinct
brandon-b-miller Oct 26, 2022
4932c2c
update for api changes
brandon-b-miller Oct 26, 2022
4871363
Merge branch 'branch-22.12' into fea-expose-stable-distinct
brandon-b-miller Nov 7, 2022
0fc9713
Merge branch 'branch-23.04' into fea-expose-stable-distinct
bdice Apr 17, 2023
15965ed
Merge branch 'branch-23.06' into fea-expose-stable-distinct
bdice Apr 17, 2023
b695de2
Style compliance.
bdice Apr 17, 2023
67ce906
Merge branch 'branch-23.06' into fea-expose-stable-distinct
brandon-b-miller May 12, 2023
bdd14a5
refactor cython, resolve python issues
brandon-b-miller May 12, 2023
cf2361e
Intermediate work.
bdice May 16, 2023
68fee3b
Add note about KEEP_ANY.
bdice May 18, 2023
a798266
Revise inputs in distinct tests.
bdice May 18, 2023
217c006
Make stable distinct tests pass.
bdice May 18, 2023
f7ca10e
Use key indices that are in-bounds.
bdice May 19, 2023
a765f99
Use stable ordering for KEEP_ANY.
bdice May 19, 2023
16fd32d
Apply clang-format.
bdice May 19, 2023
4514303
Use apply_boolean_mask to reuse existing kernels rather than compilin…
bdice May 19, 2023
98535b1
Merge branch 'fea-expose-stable-distinct-bdice' into fea-expose-stabl…
bdice May 19, 2023
45f497f
Clean up docs.
bdice May 19, 2023
403b24d
Fix drop_duplicates implementation and always use stable_distinct.
bdice May 19, 2023
5452c91
Merge remote-tracking branch 'upstream/branch-23.06' into fea-expose-…
bdice May 19, 2023
245783a
Clean up distinct / keep Cython.
bdice May 19, 2023
d3b6f1f
Remove preserve_order.
bdice May 19, 2023
71a5908
Fix CMake.
bdice May 19, 2023
668b1e5
Deprecate preserve_order.
bdice May 19, 2023
75028b9
Fix some tests that require sorting (no longer implicit in drop_dupli…
bdice May 19, 2023
a437c17
Fix bugs where sorting is required.
bdice May 19, 2023
f0d3e70
Merge branch 'branch-23.06' into fea-expose-stable-distinct
bdice May 22, 2023
2163c26
Merge branch 'branch-23.06' into fea-expose-stable-distinct
bdice May 22, 2023
fcaa8f4
Merge branch 'fea-expose-stable-distinct' of github.com:brandon-b-mil…
bdice May 22, 2023
ce6bf14
Renumber ngroups to accommodate sorting order of <NA>.
bdice May 22, 2023
99d506c
Sort in np.unique dispatch.
bdice May 22, 2023
7f3d0e5
Improve onehot testing.
bdice May 22, 2023
c3a3bf7
Small changes from self-review.
bdice May 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 2 additions & 17 deletions cpp/include/cudf/detail/stream_compaction.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -91,24 +91,9 @@ std::unique_ptr<table> distinct(
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Create a new table without duplicate rows.
* @copydoc cudf::stable_distinct
*
* Given an `input` table_view, each row is copied to the output table to create a set of distinct
* rows. The row order is guaranteed to be preserved as in the input.
*
* If there are duplicate rows, which row to be copied depends on the specified value of the `keep`
* parameter.
*
* This API produces exactly the same set of output rows as `cudf::distinct`.
*
* @param input The input table
* @param keys Vector of indices indicating key columns in the `input` table
* @param keep Copy any, first, last, or none of the found duplicates
* @param nulls_equal Flag to specify whether null elements should be considered as equal
* @param nans_equal Flag to specify whether NaN elements should be considered as equal
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned table
* @return A table containing the resulting distinct rows
* @param[in] stream CUDA stream used for device memory operations and kernel launches.
bdice marked this conversation as resolved.
Show resolved Hide resolved
*/
std::unique_ptr<table> stable_distinct(
table_view const& input,
Expand Down
27 changes: 27 additions & 0 deletions cpp/include/cudf/stream_compaction.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,33 @@ std::unique_ptr<table> distinct(
nan_equality nans_equal = nan_equality::ALL_EQUAL,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Create a new table without duplicate rows.
*
* Given an `input` table_view, each row is copied to the output table to create a set of distinct
* rows. The row order is guaranteed to be preserved as in the input.
*
* If there are duplicate rows, which row to be copied depends on the specified value of the `keep`
* parameter.
*
* This API produces exactly the same set of output rows as `cudf::distinct`.
*
* @param input The input table
* @param keys Vector of indices indicating key columns in the `input` table
* @param keep Copy any, first, last, or none of the found duplicates
* @param nulls_equal Flag to specify whether null elements should be considered as equal
* @param nans_equal Flag to specify whether NaN elements should be considered as equal
* @param mr Device memory resource used to allocate the returned table
* @return A table containing the resulting distinct rows
*/
std::unique_ptr<table> stable_distinct(
table_view const& input,
std::vector<size_type> const& keys,
duplicate_keep_option keep = duplicate_keep_option::KEEP_ANY,
null_equality nulls_equal = null_equality::EQUAL,
nan_equality nans_equal = nan_equality::ALL_EQUAL,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Count the number of consecutive groups of equivalent rows in a column.
*
Expand Down
12 changes: 12 additions & 0 deletions cpp/src/stream_compaction/distinct.cu
Original file line number Diff line number Diff line change
Expand Up @@ -162,4 +162,16 @@ std::unique_ptr<table> distinct(table_view const& input,
input, keys, keep, nulls_equal, nans_equal, cudf::default_stream_value, mr);
}

std::unique_ptr<table> stable_distinct(table_view const& input,
std::vector<size_type> const& keys,
duplicate_keep_option keep,
null_equality nulls_equal,
nan_equality nans_equal,
rmm::mr::device_memory_resource* mr)
{
CUDF_FUNC_RANGE();
return detail::stable_distinct(
input, keys, keep, nulls_equal, nans_equal, cudf::default_stream_value, mr);
}

} // namespace cudf
7 changes: 7 additions & 0 deletions python/cudf/cudf/_lib/cpp/stream_compaction.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,10 @@ cdef extern from "cudf/stream_compaction.hpp" namespace "cudf" \
column_view source_table,
null_policy null_handling,
nan_policy nan_handling) except +

cdef unique_ptr[table] stable_distinct(
table_view input,
vector[size_type] keys,
duplicate_keep_option keep,
null_equality nulls_equal,
) except +
bdice marked this conversation as resolved.
Show resolved Hide resolved
61 changes: 61 additions & 0 deletions python/cudf/cudf/_lib/stream_compaction.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ from cudf._lib.cpp.stream_compaction cimport (
distinct_count as cpp_distinct_count,
drop_nulls as cpp_drop_nulls,
duplicate_keep_option,
stable_distinct as cpp_stable_distinct,
unique as cpp_unique,
)
from cudf._lib.cpp.table.table cimport table
Expand Down Expand Up @@ -107,6 +108,66 @@ def apply_boolean_mask(list columns, Column boolean_mask):
return columns_from_unique_ptr(move(c_result))


def stable_distinct(list columns,
bdice marked this conversation as resolved.
Show resolved Hide resolved
object keys,
object keep='first',
bool nulls_are_equal=True):

"""
Drop duplicate rows in provided columns. Retain original order.

Parameters
----------
columns : List of columns
keys : List of column indices. If set, then these columns are checked for
duplicates rather than all of columns (optional)
keep : keep 'first' or 'last' or none of the duplicate rows
nulls_are_equal : if True, nulls are treated equal else not.

Returns
-------
List of columns with duplicates dropped

"""
cdef vector[size_type] cpp_keys = (
keys if keys is not None else range(len(columns))
)
cdef duplicate_keep_option cpp_keep_option

if keep == 'first':
cpp_keep_option = duplicate_keep_option.KEEP_FIRST
elif keep == 'last':
cpp_keep_option = duplicate_keep_option.KEEP_LAST
elif keep is False:
cpp_keep_option = duplicate_keep_option.KEEP_NONE
else:
raise ValueError('keep must be either "first", "last" or False')

# shifting the index number by number of index columns
cdef null_equality cpp_nulls_equal = (
null_equality.EQUAL
if nulls_are_equal
else null_equality.UNEQUAL
)

cdef table_view source_table_view = table_view_from_columns(columns)
cdef table_view keys_view = source_table_view.select(cpp_keys)

cdef unique_ptr[table] c_result

with nogil:
c_result = move(
cpp_stable_distinct(
source_table_view,
cpp_keys,
cpp_keep_option,
cpp_nulls_equal
)
)

return columns_from_unique_ptr(move(c_result))


def drop_duplicates(list columns,
object keys=None,
object keep='first',
Expand Down
12 changes: 10 additions & 2 deletions python/cudf/cudf/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -2946,7 +2946,12 @@ def diff(self, periods=1, axis=0):

@_cudf_nvtx_annotate
def drop_duplicates(
self, subset=None, keep="first", inplace=False, ignore_index=False
self,
subset=None,
keep="first",
inplace=False,
ignore_index=False,
preserve_order=False,
):
"""
Return DataFrame with duplicate rows removed, optionally only
Expand Down Expand Up @@ -3017,7 +3022,10 @@ def drop_duplicates(
1 Yum Yum cup 4.0
""" # noqa: E501
outdf = super().drop_duplicates(
subset=subset, keep=keep, ignore_index=ignore_index
subset=subset,
keep=keep,
ignore_index=ignore_index,
preserve_order=preserve_order,
)

return self._mimic_inplace(outdf, inplace=inplace)
Expand Down
39 changes: 27 additions & 12 deletions python/cudf/cudf/core/indexed_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1536,6 +1536,7 @@ def drop_duplicates(
keep="first",
nulls_are_equal=True,
ignore_index=False,
preserve_order=False,
):
"""
Drop duplicate rows in frame.
Expand Down Expand Up @@ -1569,18 +1570,32 @@ def drop_duplicates(
keys = self._positions_from_column_names(
subset, offset_by_index_columns=not ignore_index
)
return self._from_columns_like_self(
libcudf.stream_compaction.drop_duplicates(
list(self._columns)
if ignore_index
else list(self._index._columns + self._columns),
keys=keys,
keep=keep,
nulls_are_equal=nulls_are_equal,
),
self._column_names,
self._index.names if not ignore_index else None,
)
if preserve_order:
bdice marked this conversation as resolved.
Show resolved Hide resolved
return self._from_columns_like_self(
libcudf.stream_compaction.stable_distinct(
list(self._columns)
if ignore_index
else list(self._index._columns + self._columns),
keys=keys,
keep=keep,
nulls_are_equal=nulls_are_equal,
),
self._column_names,
self._index.names if not ignore_index else None,
)
else:
return self._from_columns_like_self(
libcudf.stream_compaction.drop_duplicates(
list(self._columns)
if ignore_index
else list(self._index._columns + self._columns),
keys=keys,
keep=keep,
nulls_are_equal=nulls_are_equal,
),
self._column_names,
self._index.names if not ignore_index else None,
)

@_cudf_nvtx_annotate
def _empty_like(self, keep_index=True):
Expand Down