Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Groupby.fillna is not raising when a non-categorical value is passed #15666

Closed
galipremsagar opened this issue May 6, 2024 · 0 comments · Fixed by #15683
Closed

[BUG] Groupby.fillna is not raising when a non-categorical value is passed #15666

galipremsagar opened this issue May 6, 2024 · 0 comments · Fixed by #15683
Assignees
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas

Comments

@galipremsagar
Copy link
Contributor

Describe the bug
When a scalar that is not present in categories is passed to Groupby.fillna we seem to be quietly passing instead of failing.

Steps/Code to reproduce bug

In [30]: import cudf

In [31]: s = cudf.Series(['a', 'b', 'c', 'f', 'ew', 'lk'], dtype='category')

In [32]: ps = s.to_pandas()

In [33]: s
Out[33]: 
0     a
1     b
2     c
3     f
4    ew
5    lk
dtype: category
Categories (6, object): ['a', 'b', 'c', 'ew', 'f', 'lk']

In [34]: ps
Out[34]: 
0     a
1     b
2     c
3     f
4    ew
5    lk
dtype: category
Categories (6, object): ['a', 'b', 'c', 'ew', 'f', 'lk']

In [35]: s.groupby(s).fillna(1)
/nvme/0/pgali/envs/cudfdev/lib/python3.11/site-packages/cudf/core/groupby/groupby.py:2293: FutureWarning: groupby fillna is deprecated and will be removed in a future version. Use groupby ffill or groupby bfill for forward or backward filling instead.
  warnings.warn(
Out[35]: 
0     a
1     b
2     c
3     f
4    ew
5    lk
dtype: category
Categories (6, object): ['a', 'b', 'c', 'ew', 'f', 'lk']

In [36]: ps.groupby(ps).fillna(1)
<ipython-input-36-8ffdaad8b8ae>:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  ps.groupby(ps).fillna(1)
<ipython-input-36-8ffdaad8b8ae>:1: FutureWarning: SeriesGroupBy.fillna is deprecated and will be removed in a future version. Use obj.ffill() or obj.bfill() for forward or backward filling instead. If you want to fill with a single value, use Series.fillna instead
  ps.groupby(ps).fillna(1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[36], line 1
----> 1 ps.groupby(ps).fillna(1)

File /nvme/0/pgali/envs/cudfdev/lib/python3.11/site-packages/pandas/core/groupby/generic.py:965, in SeriesGroupBy.fillna(self, value, method, axis, inplace, limit, downcast)
    887 """
    888 Fill NA/NaN values using the specified method within groups.
    889 
   (...)
    955 dtype: float64
    956 """
    957 warnings.warn(
    958     f"{type(self).__name__}.fillna is deprecated and "
    959     "will be removed in a future version. Use obj.ffill() or obj.bfill() "
   (...)
    963     stacklevel=find_stack_level(),
    964 )
--> 965 result = self._op_via_apply(
    966     "fillna",
    967     value=value,
    968     method=method,
    969     axis=axis,
    970     inplace=inplace,
    971     limit=limit,
    972     downcast=downcast,
    973 )
    974 return result

File /nvme/0/pgali/envs/cudfdev/lib/python3.11/site-packages/pandas/core/groupby/groupby.py:1425, in GroupBy._op_via_apply(self, name, *args, **kwargs)
   1422     return self._python_apply_general(curried, self._selected_obj)
   1424 is_transform = name in base.transformation_kernels
-> 1425 result = self._python_apply_general(
   1426     curried,
   1427     self._obj_with_exclusions,
   1428     is_transform=is_transform,
   1429     not_indexed_same=not is_transform,
   1430 )
   1432 if self._grouper.has_dropped_na and is_transform:
   1433     # result will have dropped rows due to nans, fill with null
   1434     # and ensure index is ordered same as the input
   1435     result = self._set_result_index_ordered(result)

File /nvme/0/pgali/envs/cudfdev/lib/python3.11/site-packages/pandas/core/groupby/groupby.py:1885, in GroupBy._python_apply_general(self, f, data, not_indexed_same, is_transform, is_agg)
   1850 @final
   1851 def _python_apply_general(
   1852     self,
   (...)
   1857     is_agg: bool = False,
   1858 ) -> NDFrameT:
   1859     """
   1860     Apply function f in python space
   1861 
   (...)
   1883         data after applying f
   1884     """
-> 1885     values, mutated = self._grouper.apply_groupwise(f, data, self.axis)
   1886     if not_indexed_same is None:
   1887         not_indexed_same = mutated

File /nvme/0/pgali/envs/cudfdev/lib/python3.11/site-packages/pandas/core/groupby/ops.py:919, in BaseGrouper.apply_groupwise(self, f, data, axis)
    917 # group might be modified
    918 group_axes = group.axes
--> 919 res = f(group)
    920 if not mutated and not _is_indexed_like(res, group_axes, axis):
    921     mutated = True

File /nvme/0/pgali/envs/cudfdev/lib/python3.11/site-packages/pandas/core/groupby/groupby.py:1413, in GroupBy._op_via_apply.<locals>.curried(x)
   1412 def curried(x):
-> 1413     return f(x, *args, **kwargs)

File /nvme/0/pgali/envs/cudfdev/lib/python3.11/site-packages/pandas/core/generic.py:7349, in NDFrame.fillna(self, value, method, axis, inplace, limit, downcast)
   7342     else:
   7343         raise TypeError(
   7344             '"value" parameter must be a scalar, dict '
   7345             "or Series, but you passed a "
   7346             f'"{type(value).__name__}"'
   7347         )
-> 7349     new_data = self._mgr.fillna(
   7350         value=value, limit=limit, inplace=inplace, downcast=downcast
   7351     )
   7353 elif isinstance(value, (dict, ABCSeries)):
   7354     if axis == 1:

File /nvme/0/pgali/envs/cudfdev/lib/python3.11/site-packages/pandas/core/internals/base.py:186, in DataManager.fillna(self, value, limit, inplace, downcast)
    182 if limit is not None:
    183     # Do this validation even if we go through one of the no-op paths
    184     limit = libalgos.validate_limit(None, limit=limit)
--> 186 return self.apply_with_block(
    187     "fillna",
    188     value=value,
    189     limit=limit,
    190     inplace=inplace,
    191     downcast=downcast,
    192     using_cow=using_copy_on_write(),
    193     already_warned=_AlreadyWarned(),
    194 )

File /nvme/0/pgali/envs/cudfdev/lib/python3.11/site-packages/pandas/core/internals/managers.py:363, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    361         applied = b.apply(f, **kwargs)
    362     else:
--> 363         applied = getattr(b, f)(**kwargs)
    364     result_blocks = extend_blocks(applied, result_blocks)
    366 out = type(self).from_blocks(result_blocks, self.axes)

File /nvme/0/pgali/envs/cudfdev/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2334, in ExtensionBlock.fillna(self, value, limit, inplace, downcast, using_cow, already_warned)
   2331 except TypeError:
   2332     # 3rd party EA that has not implemented copy keyword yet
   2333     refs = None
-> 2334     new_values = self.values.fillna(value=value, method=None, limit=limit)
   2335     # issue the warning *after* retrying, in case the TypeError
   2336     #  was caused by an invalid fill_value
   2337     warnings.warn(
   2338         # GH#53278
   2339         "ExtensionArray.fillna added a 'copy' keyword in pandas "
   (...)
   2345         stacklevel=find_stack_level(),
   2346     )

File /nvme/0/pgali/envs/cudfdev/lib/python3.11/site-packages/pandas/core/arrays/_mixins.py:376, in NDArrayBackedExtensionArray.fillna(self, value, method, limit, copy)
    373 else:
    374     # We validate the fill_value even if there is nothing to fill
    375     if value is not None:
--> 376         self._validate_setitem_value(value)
    378     if not copy:
    379         new_values = self[:]

File /nvme/0/pgali/envs/cudfdev/lib/python3.11/site-packages/pandas/core/arrays/categorical.py:1589, in Categorical._validate_setitem_value(self, value)
   1587     return self._validate_listlike(value)
   1588 else:
-> 1589     return self._validate_scalar(value)

File /nvme/0/pgali/envs/cudfdev/lib/python3.11/site-packages/pandas/core/arrays/categorical.py:1614, in Categorical._validate_scalar(self, fill_value)
   1612     fill_value = self._unbox_scalar(fill_value)
   1613 else:
-> 1614     raise TypeError(
   1615         "Cannot setitem on a Categorical with a new "
   1616         f"category ({fill_value}), set the categories first"
   1617     ) from None
   1618 return fill_value

TypeError: Cannot setitem on a Categorical with a new category (1), set the categories first

Expected behavior
Match pandas error.

@galipremsagar galipremsagar added bug Something isn't working cudf.pandas Issues specific to cudf.pandas labels May 6, 2024
@galipremsagar galipremsagar added this to the cudf.pandas API coverage milestone May 6, 2024
@galipremsagar galipremsagar self-assigned this May 6, 2024
rapids-bot bot pushed a commit that referenced this issue May 7, 2024
Fixes: #15666 

This PR validates values passed to `fillna` even if there are no null values in a categorical column.

Forks from #14534

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #15683
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant