Skip to content

Commit

Permalink
Add performance hints into doc
Browse files Browse the repository at this point in the history
  • Loading branch information
PointKernel committed Mar 12, 2022
1 parent 56f0a56 commit 2c3c02a
Showing 1 changed file with 16 additions and 4 deletions.
20 changes: 16 additions & 4 deletions cpp/include/cudf/stream_compaction.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -216,16 +216,22 @@ enum class duplicate_keep_option {
/**
* @brief Create a new table with consecutive duplicate rows removed.
*
* A row is distinct if there are no equivalent rows in the table. A row is unique if there is no
* adjacent equivalent row. That is, keeping distinct rows removes all duplicates in the
* table/column, while keeping unique rows only removes duplicates from consecutive groupings.
*
* Given an `input` table_view, one specific row from a group of equivalent elements is copied to
* output table depending on the value of @p keep:
* - KEEP_FIRST: only the first of a sequence of duplicate rows is copied
* - KEEP_LAST: only the last of a sequence of duplicate rows is copied
* - KEEP_NONE: no duplicate rows are copied
*
* A row is distinct if there are no equivalent rows in the table. A row is unique if there is no
* adjacent equivalent row. That is, keeping distinct rows removes all duplicates in the
* table/column, while keeping unique rows only removes duplicates from consecutive groupings.
*
* Performance hints:
* - Always use `cudf::unique` instead of `cudf::distinct` if the input is pre-sorted
* - If the input is not pre-sorted and the behavior of pandas.DataFrame.drop_duplicates is desired:
* - If `keep` is not relevant, use `cudf::distinct`
* - If `keep` control is required, stable sort the input then `cudf::unique`
*
* @throws cudf::logic_error if the `keys` column indices are out of bounds in the `input` table.
*
* @param[in] input input table_view to copy only unique rows
Expand Down Expand Up @@ -254,6 +260,12 @@ std::unique_ptr<table> unique(
*
* The order of elements in the output table is not specified.
*
* Performance hints:
* - Always use `cudf::unique` instead of `cudf::distinct` if the input is pre-sorted
* - If the input is not pre-sorted and the behavior of pandas.DataFrame.drop_duplicates is desired:
* - If `keep` is not relevant, use `cudf::distinct`
* - If `keep` control is required, stable sort the input then `cudf::unique`
*
* @param[in] input input table_view to copy only distinct rows
* @param[in] keys vector of indices representing key columns from `input`
* @param[in] nulls_equal flag to denote nulls are equal if null_equality::EQUAL, nulls are not
Expand Down

0 comments on commit 2c3c02a

Please sign in to comment.