Add `Series.drop` api #7304

isVoid · 2021-02-04T01:19:58Z

This PR introduces Series.drop API. Series.drop allows users to drop certain elements in the series specified labels or index parameter.

Example:

>>> s = cudf.Series([1, 2, 3], index=['x', 'y', 'z'])
>>> s.drop(labels=['y'])
x    1
z    3
dtype: int64

Add series test case
Move common code path from DataFrame.drop to helper function
Add typing annotation
Add docstring

codecov · 2021-02-04T05:58:14Z

Codecov Report

Merging #7304 (c5af5ca) into branch-0.19 (7871e7a) will increase coverage by 0.51%.
The diff coverage is 91.93%.

@@               Coverage Diff               @@
##           branch-0.19    #7304      +/-   ##
===============================================
+ Coverage        81.86%   82.38%   +0.51%     
===============================================
  Files              101      101              
  Lines            16884    17328     +444     
===============================================
+ Hits             13822    14275     +453     
+ Misses            3062     3053       -9

Impacted Files	Coverage Δ
python/cudf/cudf/core/index.py	`93.34% <ø> (+0.48%)`	⬆️
python/cudf/cudf/core/column/column.py	`87.80% <75.00%> (+0.04%)`	⬆️
python/cudf/cudf/core/column/numerical.py	`94.85% <85.71%> (-0.17%)`	⬇️
python/cudf/cudf/core/frame.py	`89.12% <89.47%> (+0.10%)`	⬆️
python/cudf/cudf/core/column/decimal.py	`93.33% <90.47%> (-1.54%)`	⬇️
python/cudf/cudf/core/dataframe.py	`90.58% <95.00%> (+0.11%)`	⬆️
python/cudf/cudf/core/series.py	`91.57% <96.77%> (+0.78%)`	⬆️
python/cudf/cudf/core/column/string.py	`86.76% <100.00%> (+0.26%)`	⬆️
python/cudf/cudf/io/feather.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/comm/serialize.py	`0.00% <0.00%> (ø)`
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc104ef...a4f9007. Read the comment docs.

python/cudf/cudf/core/dataframe.py

shwina · 2021-02-04T22:24:01Z

python/cudf/cudf/core/dataframe.py

+
+        labels: a list of labels specifying the rows to drop
+        """
+        return self.join(cudf.DataFrame(index=labels), how="leftanti")


Nice - very elegant!

Since the join parameters are fixed here, it might be worth jumping straight to the cython without first creating a dataframe. Something like

idx = Table(index=labels) res = cudf._lib.join.join(self, idx, 'leftanti', None, left_on=[], right_on=[], left_index=True, right_index=True) return self.__class__._from_table(res)

This way you avoid all the python overhead of constructing a dataframe and then a bunch of unecessary checking and typecasting that the merge codepath uses, which isn't necessarily needed in this case. The downside is you might want to throw if the labels passed dont match the dtype of the target series index.

Is there a fast path that could exist in the merge code that we're not currently enabling?

Not sure how useful it would be for users, but after the upcoming join improvements it might be useful as an internal utility for join at least, since join is just a narrow case of merge with left_index=True, right_index=True and no on parameters.

A lot of the merge code is there just to validate all the combinations of arguments a user could pass and fix their dtypes if necessary. We should be able to get around that internally if we know something about what our cython API expects and are sure what we are passing to it is right. It would also help if cudf._lib.join.join had a better "developer facing" API, so the actual call to it from didn't seem so weird.

The counterargument is that to avoid libcudf itself erroring in an invalid case, we have to do a dtype check, to make sure the labels and the index are the same type. It could be argued that we might as well let the full user facing merge code error since it's already built to error in that case. But then, if someone passes something invalid, it's almost just as weird that the error comes from inside join.

A lot of the merge code is there just to validate all the combinations of arguments a user could pass and fix their dtypes if necessary. We should be able to get around that internally...

Won't we still need dtype validation, etc., here? The user-provided inputs to drop() could be of different dtype from the index columns.

It would also help if cudf._lib.join.join had a better "developer facing" API, so the actual call to it from didn't seem so weird.

Agree that we need a better developer-facing join API. FWIW, that won't be coming from Cython though, as the Cython API is going to be very simple (just returning a tuple of gathermaps).

But then, if someone passes something invalid, it's almost just as weird that the error comes from inside join

Maybe we could wrap the call to join here in a try...except and raise a more informative error message?

I'm fine with that, I guess my only question is in general, should we avoid using our own user facing API entrypoints internally in favor of a lower level API, just to avoid all the extra overhead we put in the user facing APIs that are meant to check every edge case of everything they could possibly be doing wrong?

I think we're in 100% agreement that we should use internal APIs that bypass Pandas semantics for efficiency. My only point is that doesn't always have to be calling all the way down to Cython. For example _gather is an internal API at the Python level.

Here specifically, it looks like we need a join API without all the bells and whistles of the top-level merge() method, and we currently don't really have that.

btw in drop there is this line:

if errors == "raise" and not labels.isin(levels_index).all(): raise KeyError("One or more values not found in axis")

which I think guards against the dtype mismatch issue? @brandon-b-miller

Offline sync with @brandon-b-miller : There is room to optimize for join parameter logic checks, but should defer until the join refactor is finished.

For now, can we add a TODO here to optimize later by calling a lower-level join API? Ideally, the TODO should document exactly what logic checks we would like to avoid here.

python/cudf/cudf/core/frame.py

shwina · 2021-02-04T22:38:16Z

Overall looks great - few minor suggestions and one question about implementation for MultiIndex

brandon-b-miller · 2021-02-05T21:16:00Z

Had one question/idea otherwise looks great :)

python/cudf/cudf/core/series.py

isVoid · 2021-02-19T17:55:00Z

rerun tests

python/cudf/cudf/core/frame.py

shwina

LGTM!

galipremsagar · 2021-03-05T17:57:17Z

@gpucibot merge

galipremsagar · 2021-03-05T23:06:23Z

rerun tests

isVoid · 2021-03-08T20:27:35Z

rerun tests

galipremsagar · 2021-03-10T15:48:16Z

rerun tests

@isVoid

Closes rapidsai#7045 This PR introduces `Series.drop` API. `Series.drop` allows users to drop certain elements in the series specified `labels` or `index` parameter. Example: ```python3 >>> s = cudf.Series([1, 2, 3], index=['x', 'y', 'z']) >>> s.drop(labels=['y']) x 1 z 3 dtype: int64 ``` - [x] Add series test case - [x] Move common code path from `DataFrame.drop` to helper function - [x] Add typing annotation - [x] Add docstring Authors: - Michael Wang (@isVoid) Approvers: - Ashwin Srinath (@shwina) - GALI PREM SAGAR (@galipremsagar) URL: rapidsai#7304

isVoid added 3 commits February 3, 2021 13:51

Moving logics into Frame.py

d3be46f

Adding series.drop tests except errors param test

85bf4f8

Add errors test

b301cd8

isVoid added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. feature request New feature or request non-breaking Non-breaking change labels Feb 4, 2021

isVoid added 2 commits February 3, 2021 20:43

Move common code path to helper func

818a8bb

Add typing

bc0f5f2

isVoid changed the title ~~Add Series.drop~~ Add Series.drop api Feb 4, 2021

isVoid added 2 commits February 4, 2021 12:08

Updated docstring

1c38141

Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into 7045

a1abaa1

isVoid marked this pull request as ready for review February 4, 2021 20:26

isVoid requested a review from a team as a code owner February 4, 2021 20:26

isVoid requested review from shwina and brandon-b-miller February 4, 2021 20:26

isVoid added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 4, 2021

shwina reviewed Feb 4, 2021

View reviewed changes

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved

shwina reviewed Feb 4, 2021

View reviewed changes

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved

shwina reviewed Feb 4, 2021

View reviewed changes

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved

shwina reviewed Feb 4, 2021

View reviewed changes

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved

isVoid added 4 commits February 5, 2021 11:19

Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into 7045

188e11d

Use __future__

fd70f2b

Use DataFrameOrSeries

2765eab

Check levels

5325fcc

isVoid added 2 commits February 5, 2021 13:48

Better doc

a61119f

Initial impl midx improve

e8d494a

isVoid commented Feb 6, 2021

View reviewed changes

python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved

isVoid added 3 commits February 8, 2021 10:16

Use column index to rename

43acf56

Better docstrings

b72ac3e

Merge S/DF code paths

3cb9ede

shwina reviewed Feb 8, 2021

View reviewed changes

python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved

More tests, update TODO

bef984a

isVoid requested a review from shwina February 9, 2021 17:49

Add todo notes

3145f2b

isVoid added 2 commits February 19, 2021 13:05

Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into 7045

f4b72d3

Consolidated code logic

1c2e8d0

shwina reviewed Feb 19, 2021

View reviewed changes

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved

removed typed res var

2b893fa

shwina approved these changes Feb 19, 2021

View reviewed changes

shwina added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Feb 19, 2021

Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into 7045

a4f9007

galipremsagar approved these changes Mar 5, 2021

View reviewed changes

rapids-bot bot merged commit e628101 into rapidsai:branch-0.19 Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `Series.drop` api #7304

Add `Series.drop` api #7304

isVoid commented Feb 4, 2021 •

edited

Loading

codecov bot commented Feb 4, 2021 •

edited

Loading

shwina Feb 4, 2021

brandon-b-miller Feb 5, 2021

shwina Feb 5, 2021

brandon-b-miller Feb 5, 2021 •

edited

Loading

shwina Feb 8, 2021

brandon-b-miller Feb 8, 2021

shwina Feb 8, 2021

isVoid Feb 8, 2021 •

edited

Loading

isVoid Feb 8, 2021

shwina Feb 12, 2021

shwina commented Feb 4, 2021

brandon-b-miller commented Feb 5, 2021

isVoid commented Feb 19, 2021

shwina left a comment

galipremsagar commented Mar 5, 2021

galipremsagar commented Mar 5, 2021

isVoid commented Mar 8, 2021

galipremsagar commented Mar 10, 2021

Add Series.drop api #7304

Add Series.drop api #7304

Conversation

isVoid commented Feb 4, 2021 • edited Loading

codecov bot commented Feb 4, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandon-b-miller Feb 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isVoid Feb 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shwina commented Feb 4, 2021

brandon-b-miller commented Feb 5, 2021

isVoid commented Feb 19, 2021

shwina left a comment

Choose a reason for hiding this comment

galipremsagar commented Mar 5, 2021

galipremsagar commented Mar 5, 2021

isVoid commented Mar 8, 2021

galipremsagar commented Mar 10, 2021

Add `Series.drop` api #7304

Add `Series.drop` api #7304

isVoid commented Feb 4, 2021 •

edited

Loading

codecov bot commented Feb 4, 2021 •

edited

Loading

brandon-b-miller Feb 5, 2021 •

edited

Loading

isVoid Feb 8, 2021 •

edited

Loading