Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: reorder_categories with inplace=True is not changing the dtype.categories #43232

Closed
2 of 3 tasks
galipremsagar opened this issue Aug 26, 2021 · 16 comments · Fixed by #43597
Closed
2 of 3 tasks

BUG: reorder_categories with inplace=True is not changing the dtype.categories #43232

galipremsagar opened this issue Aug 26, 2021 · 16 comments · Fixed by #43597
Assignees
Labels
Bug Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@galipremsagar
Copy link

galipremsagar commented Aug 26, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.3.2'

In [3]: sr = pd.Series(['a', 'b', 'c'], dtype='category')

In [4]: sr
Out[4]: 
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [5]: new = sr.cat.reorder_categories(['c', 'b', 'a'])

In [6]: new
Out[6]: 
0    a
1    b
2    c
dtype: category
Categories (3, object): ['c', 'b', 'a']

In [7]: new.dtype.categories
Out[7]: Index(['c', 'b', 'a'], dtype='object')

In [8]: sr.cat.reorder_categories(['c', 'b', 'a'], inplace=True)
pandas/core/arrays/categorical.py:2630: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  res = method(*args, **kwargs)

In [9]: sr
Out[9]: 
0    a
1    b
2    c
dtype: category
Categories (3, object): ['c', 'b', 'a']

In [10]: sr.dtype.categories
Out[10]: Index(['a', 'b', 'c'], dtype='object')

Problem description

When we perform reorder_categories operation as an inplace op, the categories attribute of CategoricalDtype is not updated.

Expected Output

In [7]: sr.dtype.categories
Out[7]: Index(['c', 'b', 'a'], dtype='object')

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 5f648bf
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-27-generic
Version : #29~20.04.1-Ubuntu SMP Wed Aug 11 15:58:17 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.2
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 57.4.0
Cython : 0.29.24
pytest : 6.2.4
hypothesis : 6.15.0
sphinx : 4.1.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.26.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.53.1

@galipremsagar galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 26, 2021
@galipremsagar
Copy link
Author

galipremsagar commented Aug 26, 2021

This seems to be the same for inplace operation with remove_categories as well:

In [1]: import pandas as pd

In [2]: sr = pd.Series(['a', 'b', 'c'], dtype='category')

In [3]: 

In [3]: sr
Out[3]: 
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [4]: new = sr.cat.remove_categories(['a'])

In [5]: new
Out[5]: 
0    NaN
1      b
2      c
dtype: category
Categories (2, object): ['b', 'c']

In [6]: new.dtype.categories
Out[6]: Index(['b', 'c'], dtype='object')

In [7]: sr.cat.remove_categories(['a'], inplace=True)
pandas/core/arrays/categorical.py:2630: FutureWarning: The `inplace` parameter in pandas.Categorical.remove_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  res = method(*args, **kwargs)

In [8]: sr
Out[8]: 
0    NaN
1      b
2      c
dtype: category
Categories (2, object): ['b', 'c']

In [9]: sr.dtype.categories
Out[9]: Index(['a', 'b', 'c'], dtype='object')

@PurnashisHazra
Copy link

take

@PurnashisHazra
Copy link

PurnashisHazra commented Aug 27, 2021

seems to be working as expected in pandas v'1.2.4'

`

  import pandas as pd
  sr = pd.Series(['a', 'b', 'c'], dtype='category')
  sr
  sr
  0    a
  1    b
  2    c
  dtype: category
  Categories (3, object): ['a', 'b', 'c']
  new = sr.cat.reorder_categories(['c', 'b', 'a'])
  new
  new
  0    a
  1    b
  2    c
  dtype: category
  Categories (3, object): ['c', 'b', 'a']
  new.dtype.categories
  Index(['c', 'b', 'a'], dtype='object')
  sr.cat.reorder_categories(['c', 'b', 'a'], inplace=True)
  sr
  sr
  0    a
  1    b
  2    c
  dtype: category
  Categories (3, object): ['c', 'b', 'a']
  sr.dtype.categories
  Index(['c', 'b', 'a'], dtype='object')
  pd.__version__
  ​
  '1.2.4'

`

@simonjayhawkins simonjayhawkins added Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 27, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Aug 27, 2021
@simonjayhawkins
Copy link
Member

on 1.2.5 and earlier, sr.dtype is CategoricalDtype(categories=['c', 'b', 'a'], ordered=False)

maybe it should return an ordered categorical dtype. as CategoricalDtype(categories=['c', 'b', 'a'], ordered=False) compares equal to CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)

will bisect shortly for further context.

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 27, 2021
@simonjayhawkins
Copy link
Member

will bisect shortly for further context.

first bad commit: [c68b605] PERF: cache_readonly for Block properties (#40620)

cc @jbrockmendel

@PurnashisHazra
Copy link

will bisect shortly for further context.

first bad commit: [c68b605] PERF: cache_readonly for Block properties (#40620)

cc @jbrockmendel

could you lead me to the files that I need to look at? blocks.py? any others?

@PurnashisHazra
Copy link

changing and reverting blocks.py still gives the same error

@jbrockmendel
Copy link
Member

Also looks like the warning message was copy/pasted from remove_unused_categories and needs to be updated

@jbrockmendel
Copy link
Member

changing and reverting blocks.py still gives the same error

When I change Block.dtype from cache_readonly to property, that seems to fix it

>>> sr = pd.Series(['a', 'b', 'c'], dtype='category')
>>> sr.dtype
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)

>>> sr.cat.reorder_categories(['c', 'b', 'a'], inplace=True)
>>> sr.dtype
CategoricalDtype(categories=['c', 'b', 'a'], ordered=False)

Are you seeing something different?

Note: if you want to make a PR to do this, only override it on CategoricalBlock instead of on Block.

@simonjayhawkins
Copy link
Member

@jbrockmendel xref #43334 (comment)

Is that what you are suggesting? I'm not a advocate of 11th hour changes as the longer things sit in master the more chance of catching any problems. I'll move this to 1.3.4

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
@jbrockmendel
Copy link
Member

Is that what you are suggesting?

Yes.

@debnathshoham
Copy link
Member

Hi @LobRockyl - just checking if you're still working on this..

@PurnashisHazra
Copy link

Hi @LobRockyl - just checking if you're still working on this..

Yeah was a lil caught up. Will raise a PR asap

@PurnashisHazra
Copy link

changing and reverting blocks.py still gives the same error

When I change Block.dtype from cache_readonly to property, that seems to fix it

>>> sr = pd.Series(['a', 'b', 'c'], dtype='category')
>>> sr.dtype
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)

>>> sr.cat.reorder_categories(['c', 'b', 'a'], inplace=True)
>>> sr.dtype
CategoricalDtype(categories=['c', 'b', 'a'], ordered=False)

Are you seeing something different?

Note: if you want to make a PR to do this, only override it on CategoricalBlock instead of on Block.

Hey changing blocks does fix it but didn't get what u meant by "override it on CategoricalBlock instead of on Block."

@jbrockmendel
Copy link
Member

Hey changing blocks does fix it but didn't get what u meant by "override it on CategoricalBlock instead of on Block."

Instead of changing Block.dtype, define CategoricalBlock.dtype

@PurnashisHazra
Copy link

Hey changing blocks does fix it but didn't get what u meant by "override it on CategoricalBlock instead of on Block."

Instead of changing Block.dtype, define CategoricalBlock.dtype

So basically making the changes in categorical.py right? There is no dtype class in categorical.py so should I have to include it and change it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants