Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.where with category dtype #16979

Closed
rhaps0dy opened this issue Jul 16, 2017 · 7 comments · Fixed by #29498
Closed

BUG: DataFrame.where with category dtype #16979

rhaps0dy opened this issue Jul 16, 2017 · 7 comments · Fixed by #29498
Labels
Categorical Categorical Data Type good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@rhaps0dy
Copy link

rhaps0dy commented Jul 16, 2017

Code Sample (it is copy-pastable)

import pandas as pd, numpy as np
df = pd.DataFrame(np.arange(2*3).reshape(2,3), columns=list('abc'))
mask = np.random.rand(*df.shape) < 0.5
df.where(mask)
# Output is correct:
#      a   b    c
# 0  NaN NaN  2.0
# 1  3.0 NaN  NaN

df.a = df.a.astype('category')
df.b = df.b.astype('category')
df.c = df.c.astype('category')
df.where(mask)
# ValueError: Wrong number of items passed 2, placement implies 1
# Expected output: the same as before, but now with dtype `category`.

df.a.where(mask[:,0])
# 0    NaN
# 1    3.0
# Name: a, dtype: float64
# should stay in dtype category

df.a.where(mask[:,0], other=None)
# 0    None
# 1    3
# Name: a, dtype: object
# Expected output: should stay in dtype category

Problem description

df.where should work with all dtypes, the documentation doesn't say it works only for some dtypes. Also, NaNs are already correctly handled as missing data in pd.Series of type 'category', so one should be able to assign NaNs to them. Same with converting the dtype.

While writing this report I found that doing it column-by-column works correctly, so I'll use that as a workaround.

Output of pd.show_versions()

# Paste the output here pd.show_versions() here

INSTALLED VERSIONS [1/1839]

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-81-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Ubuntu lsb_release -a:

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04
Codename: xenial

@jreback jreback added Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 16, 2017
@jreback
Copy link
Contributor

jreback commented Jul 16, 2017

can you make a separate issue about the astype (and remove from the top from here).

@jreback jreback added this to the 0.21.0 milestone Jul 16, 2017
@jreback jreback changed the title Errors in DataFrame.where and DataFrame.astype in DataFrames with 'category' BUG: DataFrame.where with category dtype Jul 16, 2017
@jreback
Copy link
Contributor

jreback commented Jul 16, 2017

this could be taken up after #16821

@jreback jreback modified the milestones: Next Major Release, 0.21.0 Jul 16, 2017
@mroeschke
Copy link
Member

This looks fixed in master (category dtype is maintained). Could use a test.

In [46]: df.a = df.a.astype('category')
    ...: df.b = df.b.astype('category')
    ...: df.c = df.c.astype('category')
    ...: df.where(mask)
Out[46]:
     a    b    c
0  NaN  NaN  NaN
1  NaN    4    5

In [47]: df.a.where(mask[:,0])
    ...:
Out[47]:
0    NaN
1    NaN
Name: a, dtype: category
Categories (2, int64): [0, 3]

In [48]: df.a.where(mask[:,0], other=None)
    ...:
Out[48]:
0    NaN
1    NaN
Name: a, dtype: category
Categories (2, int64): [0, 3]

In [49]: pd.__version__
Out[49]: '0.26.0.dev0+565.g8c5941cd5'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type Difficulty Intermediate Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 15, 2019
@jbrockmendel jbrockmendel added the Categorical Categorical Data Type label Oct 16, 2019
@gfyoung
Copy link
Member

gfyoung commented Nov 7, 2019

The .astype error also seems to have been resolved on master.

@gfyoung gfyoung modified the milestones: Contributions Welcome, 1.0 Nov 7, 2019
@ganevgv
Copy link
Contributor

ganevgv commented Nov 8, 2019

Thanks for the guidance, @gfyoung!

While trying to add a test for the .astype error, I encountered another issue:

df = DataFrame(np.arange(2 * 3).reshape(2, 3), columns=list("ABC"))
mask = np.array([[True, False, True], [False, True, True]])

result = df.where(mask)
expected = DataFrame([[0, np.nan, 2], [np.nan, 4, 5]], columns=list("ABC"))

tm.assert_frame_equal(result, expected)

After running the code above, on some operating systems (linux -- Python 3.6.7, win32 -- Python 3.6.7, win32 -- Python 3.7.5) applying the boolean mask changes the dtype of the last column [2, 5] from int64 to int32. That's the message from the logs:

>       tm.assert_frame_equal(result, expected)
E       AssertionError: Attributes of DataFrame.iloc[:, 2] are different
E       
E       Attribute "dtype" are different
E       [left]:  int32
E       [right]: int64

I cannot reproduce this behaviour locally.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 35e91f9 python : 3.6.5.final.0 python-bits : 64 OS : Darwin OS-release : 17.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 0.26.0.dev0+734.g0de99558b.dirty numpy : 1.17.3 pytz : 2019.3 dateutil : 2.8.0 pip : 19.3.1 setuptools : 39.0.1 Cython : 0.29.14 pytest : 5.2.2 hypothesis : 4.42.6 sphinx : 2.2.1 blosc : 1.8.1 feather : None xlsxwriter : 1.2.2 lxml.etree : 4.4.1 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.10.3 IPython : 7.9.0 pandas_datareader: None bs4 : 4.7.1 bottleneck : 1.2.1 fastparquet : 0.3.2 gcsfs : None lxml.etree : 4.4.1 matplotlib : 3.1.1 numexpr : 2.7.0 odfpy : None openpyxl : 3.0.0 pandas_gbq : None pyarrow : 0.15.1 pytables : None s3fs : 0.3.5 scipy : 1.3.1 sqlalchemy : 1.3.10 tables : 3.6.1 xarray : 0.14.0 xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.2

@gfyoung
Copy link
Member

gfyoung commented Nov 8, 2019

@ganevgv : I would try opening a PR with this test, but add print statements to confirm whether the dtype is actually changing. It might actually be a platform thing where the dtype is already int32.

@ganevgv
Copy link
Contributor

ganevgv commented Nov 9, 2019

@gfyoung : After further investigation, I identified where the problem is coming from.

When initialising df with int data w/o nans, the default dtype for all columns on some platforms (linux -- Python 3.6.7, win32 -- Python 3.6.7 and win32 -- Python 3.7.5) is int32:

>       result = DataFrame(np.arange(2 * 3).reshape(2, 3), columns=list("ABC"))
>       expected = DataFrame( np.arange(2 * 3).reshape(2, 3), columns=list("ABC"), dtype=np.int64)
>       tm.assert_frame_equal(result, expected)
E       AssertionError: Attributes of DataFrame.iloc[:, 0] are different
E
E       Attribute "dtype" are different
E       [left]:  int32
E       [right]: int64

However, if you initialise df with int data w/ nans the dtype for columns w/o nans (same platforms) is int64:

>       result = DataFrame([[0, np.nan, 2], [np.nan, 4, 5]], columns=list("ABC"))
>       expected = DataFrame([[0, np.nan, 2], [np.nan, 4, 5]], columns=list("ABC"), dtype=np.int32 )
>       tm.assert_frame_equal(result, expected)
E       AssertionError: Attributes of DataFrame.iloc[:, 2] are different
E       
E       Attribute "dtype" are different
E       [left]:  int64
E       [right]: int32

On the other platforms (linux -- Python 3.6.1, linux -- Python 3.6.9, linux -- Python 3.7.5 and darwin -- Python 3.6.9), both df with int data w/o nans and w/ nans are initialised with int64 so there's no problem.

I believe this behaviour is unrelated to this issue as it's not testing the category preservation when using .where(). Furthermore, all the examples in the discussion have missed this behaviour (the columns have been [int, nan]/[nan, int] or [nan, nan] but not [int, int] alongside a column containing nan). I plan to change the input data in #29498 to the examples given in this issue and open a new issue describing the encountered behaviour. Do you agree with that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants