Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas' recommendation on inplace deprecation and categorical column #57104

Open
adrinjalali opened this issue Jan 27, 2024 · 5 comments
Open
Labels
Categorical Categorical Data Type inplace Relating to inplace parameter or equivalent Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@adrinjalali
Copy link

Working on making scikit-learn's code pandas=2.2.0 compatible, here's a minimal reproducer for where I started:

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"].replace(to_replace="a", value="b", inplace=True)

which results in:

$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Traceback (most recent call last):
  File "/tmp/4.py", line 4, in <module>
    df["col"].replace(to_replace="a", value="b", inplace=True)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/generic.py", line 7963, in replace
    warnings.warn(
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

The first pattern doesn't apply here, so from this message, I understand I should do:

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"] = df["col"].replace(to_replace="a", value="b")

But this also fails with:

$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Traceback (most recent call last):
  File "/tmp/4.py", line 4, in <module>
    df["col"] = df["col"].replace(to_replace="a", value="b")
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/generic.py", line 8135, in replace
    new_data = self._mgr.replace(
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/base.py", line 249, in replace
    return self.apply_with_block(
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 364, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 854, in replace
    values._replace(to_replace=to_replace, value=value, inplace=True)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 2665, in _replace
    warnings.warn(
FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.

With a bit of reading docs, it seems I need to do:

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df["col"] = df["col"].cat.rename_categories({"a": "b"})

which fails with

$ python -Werror::FutureWarning /tmp/4.py
/tmp/4.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Traceback (most recent call last):
  File "/tmp/4.py", line 4, in <module>
    df["col"] = df["col"].cat.rename_categories({"a": "b"})
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/accessor.py", line 112, in f
    return self._delegate_method(name, *args, **kwargs)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 2939, in _delegate_method
    res = method(*args, **kwargs)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 1205, in rename_categories
    cat._set_categories(new_categories)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 924, in _set_categories
    new_dtype = CategoricalDtype(categories, ordered=self.ordered)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 221, in __init__
    self._finalize(categories, ordered, fastpath=False)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 378, in _finalize
    categories = self.validate_categories(categories, fastpath=fastpath)
  File "/home/adrin/miniforge3/envs/sklearn/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 579, in validate_categories
    raise ValueError("Categorical categories must be unique")
ValueError: Categorical categories must be unique

So rename_categories is not the one I want apparently, but reading through the "see also":

reorder_categories

Reorder categories.

add_categories

Add new categories.

remove_categories

Remove the specified categories.

remove_unused_categories

Remove categories which are not used.

set_categories

Set the categories to the specified ones.

None of them seem to do what I need to do.

So it seems the way to go would be:

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df.loc[df["col"] == "a", "col"] = "b"
df["col"] = df["col"].astype("category").cat.remove_unused_categories()

Which is far from what the warning message suggests.

So at the end:

  • did I arrive at the right conclusion as what the code should look like now.
  • I think the warning message might be a bit more concrete as where users should go.
  • should there be a method on Series.cat to do this easier?
@phofl
Copy link
Member

phofl commented Jan 30, 2024

I think making rename_categories accept this might make the most sense, the solution you arrived at is probably the best case at the moment but obviously not great

cc @jbrockmendel

@rhshadrach rhshadrach added Categorical Categorical Data Type Needs Discussion Requires discussion from core team before further action inplace Relating to inplace parameter or equivalent labels Jan 30, 2024
@lesteve
Copy link
Contributor

lesteve commented Jan 30, 2024

FWIW I ended up with this (not great either), that I find a bit more readable (but this may depend on the reader 😉):

import pandas as pd

df = pd.DataFrame({'col': ["a", "b", "c"]}, dtype="category")
df['col'] = df['col'].astype(object).replace(to_replace="a", value="b").astype("category")

@phofl phofl added this to the 2.2.1 milestone Jan 30, 2024
@jbrockmendel
Copy link
Member

i think eventually we want users to do obj.replace('a', 'b').cat.remove_unused_categories(). That works now, but the .replace issues a warning. i guess we could update the warning message to suggest this pattern for that particular use case

@adrinjalali
Copy link
Author

@jbrockmendel your code gives this warning now:

FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.

I'm not sure if you want to remove the warning in this case, or to suggest a different solution?

@lithomas1 lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024
RomanB22 added a commit to RomanB22/netpyne-dev that referenced this issue Mar 27, 2024
The inplace=True keyword was removed from newer Pandas versions, thus the inplace option is replaced by re-assigning the variable, following Pandas recs (pandas-dev/pandas#57104)
Changing df['popInd'].cat.set_categories(sim.net.pops.keys(), inplace=True) by df['popInd'] = df['popInd'].cat.set_categories(sim.net.pops.keys()) in analysis/spikes.py
@lithomas1 lithomas1 modified the milestones: 2.2.2, 2.2.3 Apr 10, 2024
@stuarteberg
Copy link
Contributor

stuarteberg commented Sep 13, 2024

Thanks for this thread.

That works now, but the .replace issues a warning. i guess we could update the warning message to suggest this pattern for that particular use case.

The warning says:

In a future version, replace will only be used for cases that preserve the categories.

I would have expected a warning only if I were introducing NEW categories. If I'm just consolidating existing categories, there is no need for the dtype to change (thus, the categories can be preserved, even if some are now unused). Why is a warning necessary at all?

@lithomas1 lithomas1 modified the milestones: 2.2.3, 2.3 Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type inplace Relating to inplace parameter or equivalent Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

7 participants