Preserve order if necessary when deduping categoricals internally #11597

brandon-b-miller · 2022-08-25T15:48:11Z

codecov · 2022-08-25T17:41:33Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@8ad0290). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-22.10   #11597   +/-   ##
===============================================
  Coverage                ?   86.41%           
===============================================
  Files                   ?      145           
  Lines                   ?    22998           
  Branches                ?        0           
===============================================
  Hits                    ?    19874           
  Misses                  ?     3124           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

brandon-b-miller · 2022-08-26T14:45:17Z

rerun tests

shwina · 2022-08-30T14:21:58Z

python/cudf/cudf/core/column/categorical.py

-                .drop_duplicates(ignore_index=True)
-                ._column
-            )
+            new_cats = dedup_preserve_order(cudf.Series(new_cats))._column


Can we have dedup_preserve_order operate on a Column rather than a Series? As much as possible, we want to avoid this pattern of constructing a Series out of a column, invoking a Series operation on it, and then getting the column back out of the result.

Having it operate on a column would have the added benefit that we can then make it a method of CategoricalColumn.

@shwina those are all good points, I'll update as such.

I added this to ColumnBase in the end as I ended up not being able to see any reason all columns shouldn't share the capability. Let me know if this should be changed.

shwina

LGTM!

shwina · 2022-08-30T16:53:22Z

python/cudf/cudf/core/column/column.py

@@ -217,6 +217,16 @@ def dropna(self, drop_nan: bool = False) -> ColumnBase:
        # The drop_nan argument is only used for numerical columns.
        return drop_nulls([self])[0]

+    def _dedup_preserve_order(self):


Should this just be merged with the unique() method? I'm thinking col.unique() returns unique values in arbitrary order, and col.unique(preserve_order=True) does this.

Discussed this offline and came to the conclusion that long term we should implement #11638 to back this.

Can we still collapse this _dedup_preserve_order() method into unique()?

brandon-b-miller · 2022-09-02T16:47:34Z

@gpucibot merge

updates, tests

87a90f4

github-actions bot added the Python Affects Python cuDF API. label Aug 25, 2022

brandon-b-miller mentioned this pull request Aug 25, 2022

[BUG] Handling of categoricals assumes drop_duplicates preserves order (it doesn't) #11486

Closed

brandon-b-miller marked this pull request as ready for review August 26, 2022 14:44

brandon-b-miller requested a review from a team as a code owner August 26, 2022 14:44

brandon-b-miller requested review from mroeschke and isVoid August 26, 2022 14:44

brandon-b-miller added bug Something isn't working non-breaking Non-breaking change labels Aug 26, 2022

shwina reviewed Aug 30, 2022

View reviewed changes

address reviews

9a78e66

shwina approved these changes Aug 30, 2022

View reviewed changes

shwina reviewed Aug 30, 2022

View reviewed changes

shwina self-requested a review August 30, 2022 16:54

add tests for failing concat cases

4ea9d5e

brandon-b-miller mentioned this pull request Sep 1, 2022

[FEA] Expose stable_distinct to python #11638

Closed

brandon-b-miller added 4 commits September 1, 2022 08:30

dedupe and preserve order in CategoricalColumn._concat

04cd189

Merge branch 'branch-22.10' into fix-categorical-ordered-dedup

6270cdb

cleanup

987aa26

use unique(preserve_order=True) instead

6552922

mroeschke approved these changes Sep 1, 2022

View reviewed changes

minor comment update

0624a3c

rapids-bot bot merged commit 488c7ad into rapidsai:branch-22.10 Sep 2, 2022

wence- mentioned this pull request Mar 14, 2023

Preserve integer dtype of hive-partitioned column containing nulls #12930

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve order if necessary when deduping categoricals internally #11597

Preserve order if necessary when deduping categoricals internally #11597

brandon-b-miller commented Aug 25, 2022

codecov bot commented Aug 25, 2022 •

edited

Loading

brandon-b-miller commented Aug 26, 2022

shwina Aug 30, 2022

brandon-b-miller Aug 30, 2022

brandon-b-miller Aug 30, 2022

shwina left a comment

shwina Aug 30, 2022

brandon-b-miller Sep 1, 2022

shwina Sep 1, 2022

brandon-b-miller Sep 1, 2022

brandon-b-miller commented Sep 2, 2022

Preserve order if necessary when deduping categoricals internally #11597

Preserve order if necessary when deduping categoricals internally #11597

Conversation

brandon-b-miller commented Aug 25, 2022

codecov bot commented Aug 25, 2022 • edited Loading

Codecov Report

brandon-b-miller commented Aug 26, 2022

shwina Aug 30, 2022

Choose a reason for hiding this comment

brandon-b-miller Aug 30, 2022

Choose a reason for hiding this comment

brandon-b-miller Aug 30, 2022

Choose a reason for hiding this comment

shwina left a comment

Choose a reason for hiding this comment

shwina Aug 30, 2022

Choose a reason for hiding this comment

brandon-b-miller Sep 1, 2022

Choose a reason for hiding this comment

shwina Sep 1, 2022

Choose a reason for hiding this comment

brandon-b-miller Sep 1, 2022

Choose a reason for hiding this comment

brandon-b-miller commented Sep 2, 2022

codecov bot commented Aug 25, 2022 •

edited

Loading