Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Update numeric_only behavior in reduction APIs #12847

Merged

Conversation

galipremsagar
Copy link
Contributor

Description

  • This PR removes the deprecation of numeric_only=None and defaults to numeric_only=False.
  • Removes level parameter from reduction APIs to match pandas-2.0
  • Change axis defaults to match pandas-2.0 APIs.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@galipremsagar galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer breaking Breaking change labels Feb 24, 2023
@galipremsagar galipremsagar self-assigned this Feb 24, 2023
@galipremsagar galipremsagar requested a review from a team as a code owner February 24, 2023 21:09
@galipremsagar galipremsagar requested review from brandon-b-miller and vyasr and removed request for a team February 24, 2023 21:09
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
Comment on lines 5885 to 5893
try:
result = [
getattr(source._data[col], op)(**kwargs)
for col in source._data.names
]
except AttributeError:
raise TypeError(
f"Not all column dtypes support op {op}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this try-except is still required because of decimal dtypes, but I could be wrong and it may be worth trying it out to see if it is now removable. I'm pretty sure it will fail for numeric dtypes on the statistical moments though (skew/kurt/etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this would still be needed for decimal types when numeric_only=True:

In [1]: import cudf

In [2]: from decimal import Decimal

In [3]: s = cudf.Series([Decimal(1), Decimal(10), Decimal(2)])

In [4]: df = cudf.DataFrame({'a':s})


In [7]: df.skew(numeric_only=True)
> /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/dataframe.py(5856)_reduce()
-> result = [
(Pdb) c
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/dataframe.py:5856, in DataFrame._reduce(self, op, axis, numeric_only, **kwargs)
   5855     import pdb;pdb.set_trace()
-> 5856     result = [
   5857         getattr(source._data[col], op)(**kwargs)
   5858         for col in source._data.names
   5859     ]
   5860 except AttributeError:

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/dataframe.py:5857, in <listcomp>(.0)
   5855     import pdb;pdb.set_trace()
   5856     result = [
-> 5857         getattr(source._data[col], op)(**kwargs)
   5858         for col in source._data.names
   5859     ]
   5860 except AttributeError:

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/column/numerical_base.py:77, in NumericalBaseColumn.skew(self, skipna)
     75     return cudf.utils.dtypes._get_nan_for_dtype(self.dtype)
---> 77 self = self.nans_to_nulls().dropna()  # type: ignore
     79 if len(self) < 3:

AttributeError: 'Decimal128Column' object has no attribute 'nans_to_nulls'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/dataframe.py:5878, in DataFrame._reduce(self, op, axis, numeric_only, **kwargs)
   5877 try:
-> 5878     result = [
   5879         getattr(source._data[col], op)(**kwargs)
   5880         for col in source._data.names
   5881     ]
   5882 except AttributeError:

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/dataframe.py:5879, in <listcomp>(.0)
   5877 try:
   5878     result = [
-> 5879         getattr(source._data[col], op)(**kwargs)
   5880         for col in source._data.names
   5881     ]
   5882 except AttributeError:

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/column/numerical_base.py:77, in NumericalBaseColumn.skew(self, skipna)
     75     return cudf.utils.dtypes._get_nan_for_dtype(self.dtype)
---> 77 self = self.nans_to_nulls().dropna()  # type: ignore
     79 if len(self) < 3:

AttributeError: 'Decimal128Column' object has no attribute 'nans_to_nulls'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 df.skew(numeric_only=True)

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/frame.py:2346, in Frame.skew(self, axis, skipna, numeric_only, **kwargs)
   2343 if axis not in (0, "index", None):
   2344     raise NotImplementedError("Only axis=0 is currently supported.")
-> 2346 return self._reduce(
   2347     "skew",
   2348     axis=axis,
   2349     skipna=skipna,
   2350     numeric_only=numeric_only,
   2351     **kwargs,
   2352 )

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/dataframe.py:5883, in DataFrame._reduce(self, op, axis, numeric_only, **kwargs)
   5878         result = [
   5879             getattr(source._data[col], op)(**kwargs)
   5880             for col in source._data.names
   5881         ]
   5882     except AttributeError:
-> 5883         raise TypeError(
   5884             f"Not all column dtypes support op {op}"
   5885         )
   5886 elif any(
   5887     not is_numeric_dtype(self._data[name])
   5888     for name in self._data.names
   5889 ):
   5890     raise TypeError(
   5891         "Non numeric columns passed with "
   5892         "`numeric_only=False`, pass `numeric_only=True` "
   5893         f"to perform DataFrame.{op}"
   5894     )

TypeError: Not all column dtypes support op skew

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a NotImplementedError instead of a TypeError? These operations are well-defined for decimal columns, we just haven't implemented them, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, agree. Changed to NotImplementedError 👍

@galipremsagar galipremsagar requested a review from vyasr March 10, 2023 18:49
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor non-blocking question. Thanks!

Comment on lines 5885 to 5893
try:
result = [
getattr(source._data[col], op)(**kwargs)
for col in source._data.names
]
except AttributeError:
raise TypeError(
f"Not all column dtypes support op {op}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a NotImplementedError instead of a TypeError? These operations are well-defined for decimal columns, we just haven't implemented them, right?

@galipremsagar galipremsagar merged commit 58b9acb into rapidsai:pandas_2.0_feature_branch Mar 10, 2023
@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Mar 10, 2023
galipremsagar added a commit that referenced this pull request Apr 27, 2023
#12847 introduced support for `numeric_only`, this PR cleans up a `kurt` related pytest that was relying on the old behavior.

This PR fixes 18 pytests :
```
= 413 failed, 88257 passed, 2044 skipped, 932 xfailed, 165 xpassed in 463.03s (0:07:43) =
```

On `pandas_2.0_feature_branch`:
```
= 431 failed, 88221 passed, 2044 skipped, 932 xfailed, 165 xpassed in 456.25s (0:07:36) =
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants