BUG: DataFrame.where with category dtype #16979

rhaps0dy · 2017-07-16T12:36:47Z

Code Sample (it is copy-pastable)

import pandas as pd, numpy as np
df = pd.DataFrame(np.arange(2*3).reshape(2,3), columns=list('abc'))
mask = np.random.rand(*df.shape) < 0.5
df.where(mask)
# Output is correct:
#      a   b    c
# 0  NaN NaN  2.0
# 1  3.0 NaN  NaN

df.a = df.a.astype('category')
df.b = df.b.astype('category')
df.c = df.c.astype('category')
df.where(mask)
# ValueError: Wrong number of items passed 2, placement implies 1
# Expected output: the same as before, but now with dtype `category`.

df.a.where(mask[:,0])
# 0    NaN
# 1    3.0
# Name: a, dtype: float64
# should stay in dtype category

df.a.where(mask[:,0], other=None)
# 0    None
# 1    3
# Name: a, dtype: object
# Expected output: should stay in dtype category

Problem description

df.where should work with all dtypes, the documentation doesn't say it works only for some dtypes. Also, NaNs are already correctly handled as missing data in pd.Series of type 'category', so one should be able to assign NaNs to them. Same with converting the dtype.

While writing this report I found that doing it column-by-column works correctly, so I'll use that as a workaround.

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here

INSTALLED VERSIONS [1/1839]

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-81-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Ubuntu `lsb_release -a`:

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04
Codename: xenial

The text was updated successfully, but these errors were encountered:

jreback · 2017-07-16T15:26:34Z

can you make a separate issue about the astype (and remove from the top from here).

jreback · 2017-07-16T15:29:57Z

this could be taken up after #16821

mroeschke · 2019-10-15T03:52:37Z

This looks fixed in master (category dtype is maintained). Could use a test.

In [46]: df.a = df.a.astype('category')
    ...: df.b = df.b.astype('category')
    ...: df.c = df.c.astype('category')
    ...: df.where(mask)
Out[46]:
     a    b    c
0  NaN  NaN  NaN
1  NaN    4    5

In [47]: df.a.where(mask[:,0])
    ...:
Out[47]:
0    NaN
1    NaN
Name: a, dtype: category
Categories (2, int64): [0, 3]

In [48]: df.a.where(mask[:,0], other=None)
    ...:
Out[48]:
0    NaN
1    NaN
Name: a, dtype: category
Categories (2, int64): [0, 3]

In [49]: pd.__version__
Out[49]: '0.26.0.dev0+565.g8c5941cd5'

gfyoung · 2019-11-07T04:20:48Z

The .astype error also seems to have been resolved on master.

xref #16979

ganevgv · 2019-11-08T20:09:16Z

Thanks for the guidance, @gfyoung!

While trying to add a test for the .astype error, I encountered another issue:

df = DataFrame(np.arange(2 * 3).reshape(2, 3), columns=list("ABC"))
mask = np.array([[True, False, True], [False, True, True]])

result = df.where(mask)
expected = DataFrame([[0, np.nan, 2], [np.nan, 4, 5]], columns=list("ABC"))

tm.assert_frame_equal(result, expected)

After running the code above, on some operating systems (linux -- Python 3.6.7, win32 -- Python 3.6.7, win32 -- Python 3.7.5) applying the boolean mask changes the dtype of the last column [2, 5] from int64 to int32. That's the message from the logs:

>       tm.assert_frame_equal(result, expected)
E       AssertionError: Attributes of DataFrame.iloc[:, 2] are different
E       
E       Attribute "dtype" are different
E       [left]:  int32
E       [right]: int64

I cannot reproduce this behaviour locally.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 35e91f9 python : 3.6.5.final.0 python-bits : 64 OS : Darwin OS-release : 17.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 0.26.0.dev0+734.g0de99558b.dirty numpy : 1.17.3 pytz : 2019.3 dateutil : 2.8.0 pip : 19.3.1 setuptools : 39.0.1 Cython : 0.29.14 pytest : 5.2.2 hypothesis : 4.42.6 sphinx : 2.2.1 blosc : 1.8.1 feather : None xlsxwriter : 1.2.2 lxml.etree : 4.4.1 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.10.3 IPython : 7.9.0 pandas_datareader: None bs4 : 4.7.1 bottleneck : 1.2.1 fastparquet : 0.3.2 gcsfs : None lxml.etree : 4.4.1 matplotlib : 3.1.1 numexpr : 2.7.0 odfpy : None openpyxl : 3.0.0 pandas_gbq : None pyarrow : 0.15.1 pytables : None s3fs : 0.3.5 scipy : 1.3.1 sqlalchemy : 1.3.10 tables : 3.6.1 xarray : 0.14.0 xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.2

gfyoung · 2019-11-08T20:29:59Z

@ganevgv : I would try opening a PR with this test, but add print statements to confirm whether the dtype is actually changing. It might actually be a platform thing where the dtype is already int32.

ganevgv · 2019-11-09T18:25:43Z

@gfyoung : After further investigation, I identified where the problem is coming from.

When initialising df with int data w/o nans, the default dtype for all columns on some platforms (linux -- Python 3.6.7, win32 -- Python 3.6.7 and win32 -- Python 3.7.5) is int32:

>       result = DataFrame(np.arange(2 * 3).reshape(2, 3), columns=list("ABC"))
>       expected = DataFrame( np.arange(2 * 3).reshape(2, 3), columns=list("ABC"), dtype=np.int64)
>       tm.assert_frame_equal(result, expected)
E       AssertionError: Attributes of DataFrame.iloc[:, 0] are different
E
E       Attribute "dtype" are different
E       [left]:  int32
E       [right]: int64

However, if you initialise df with int data w/ nans the dtype for columns w/o nans (same platforms) is int64:

>       result = DataFrame([[0, np.nan, 2], [np.nan, 4, 5]], columns=list("ABC"))
>       expected = DataFrame([[0, np.nan, 2], [np.nan, 4, 5]], columns=list("ABC"), dtype=np.int32 )
>       tm.assert_frame_equal(result, expected)
E       AssertionError: Attributes of DataFrame.iloc[:, 2] are different
E       
E       Attribute "dtype" are different
E       [left]:  int64
E       [right]: int32

On the other platforms (linux -- Python 3.6.1, linux -- Python 3.6.9, linux -- Python 3.7.5 and darwin -- Python 3.6.9), both df with int data w/o nans and w/ nans are initialised with int64 so there's no problem.

I believe this behaviour is unrelated to this issue as it's not testing the category preservation when using .where(). Furthermore, all the examples in the discussion have missed this behaviour (the columns have been [int, nan]/[nan, int] or [nan, nan] but not [int, int] alongside a column containing nan). I plan to change the input data in #29498 to the examples given in this issue and open a new issue describing the encountered behaviour. Do you agree with that?

xref pandas-dev#16979

jreback added Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 16, 2017

jreback added Difficulty Intermediate labels Jul 16, 2017

jreback added this to the 0.21.0 milestone Jul 16, 2017

jreback changed the title ~~Errors in DataFrame.where and DataFrame.astype in DataFrames with 'category'~~ BUG: DataFrame.where with category dtype Jul 16, 2017

jreback modified the milestones: Next Major Release, 0.21.0 Jul 16, 2017

rhaps0dy mentioned this issue Jul 16, 2017

BUG: DataFrame.astype with category dtype #16983

Closed

jbrockmendel added the Categorical Categorical Data Type label Oct 16, 2019

ganevgv mentioned this issue Nov 7, 2019

TST: add test for df.where() with category dtype #29454

Merged

5 tasks

gfyoung modified the milestones: Contributions Welcome, 1.0 Nov 7, 2019

gfyoung pushed a commit that referenced this issue Nov 8, 2019

TST: add test for df.where() with category dtype (#29454)

6dbd2b1

xref #16979

ganevgv mentioned this issue Nov 9, 2019

TST: add test for df.where() with int dtype #29498

Merged

5 tasks

jreback closed this as completed in #29498 Nov 12, 2019

Reksbril pushed a commit to Reksbril/pandas that referenced this issue Nov 18, 2019

TST: add test for df.where() with category dtype (pandas-dev#29454)

3e6b970

xref pandas-dev#16979

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

TST: add test for df.where() with category dtype (pandas-dev#29454)

95f5dd2

xref pandas-dev#16979

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

TST: add test for df.where() with category dtype (pandas-dev#29454)

c798631

xref pandas-dev#16979

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.where with category dtype #16979

BUG: DataFrame.where with category dtype #16979

rhaps0dy commented Jul 16, 2017 •

edited

Loading

INSTALLED VERSIONS [1/1839]

Ubuntu `lsb_release -a`:

jreback commented Jul 16, 2017

jreback commented Jul 16, 2017

mroeschke commented Oct 15, 2019

gfyoung commented Nov 7, 2019 •

edited

Loading

ganevgv commented Nov 8, 2019

gfyoung commented Nov 8, 2019

ganevgv commented Nov 9, 2019

BUG: DataFrame.where with category dtype #16979

BUG: DataFrame.where with category dtype #16979

Comments

rhaps0dy commented Jul 16, 2017 • edited Loading

Code Sample (it is copy-pastable)

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS [1/1839]

Ubuntu lsb_release -a:

jreback commented Jul 16, 2017

jreback commented Jul 16, 2017

mroeschke commented Oct 15, 2019

gfyoung commented Nov 7, 2019 • edited Loading

ganevgv commented Nov 8, 2019

gfyoung commented Nov 8, 2019

ganevgv commented Nov 9, 2019

rhaps0dy commented Jul 16, 2017 •

edited

Loading

Output of `pd.show_versions()`

Ubuntu `lsb_release -a`:

gfyoung commented Nov 7, 2019 •

edited

Loading