pandas.DataFrame.where not replacing NaTs properly #15613

grechut · 2017-03-08T11:28:29Z

Problem description

pandas.DataFrame.where seems to be not replacing NaTs properly.

As in the example below, NaT values stay in data frame after applying .where((pd.notnull(df)), None)

Code sample

In [26]: pd.__version__
Out[26]: '0.19.2'

In [27]: df
Out[27]: 
           d     v
0 2015-01-01  30.0
1        NaT  40.0
2 2015-01-03   NaN

In [28]: pd.notnull(df)
Out[28]: 
       d      v
0   True   True
1  False   True
2   True  False

In [29]: df.where((pd.notnull(df)), None)
Out[29]: 
           d     v
0 2015-01-01    30
1        NaT    40
2 2015-01-03  None

In [30]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 16.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 34.3.1
Cython: None
numpy: 1.12.0
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-03-08T13:02:34Z

here's a copy-pastable example

In [13]: df = pd.DataFrame({'A':[pd.Timestamp('20130101'),pd.NaT,pd.Timestamp('20130103')],'B':[1,2,np.nan]})

In [14]: df
Out[14]: 
           A    B
0 2013-01-01  1.0
1        NaT  2.0
2 2013-01-03  NaN

In [15]: df[~df.isnull()] = None
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value

In [16]: df.where(df.notnull(), None)
Out[16]: 
           A     B
0 2013-01-01     1
1        NaT     2
2 2013-01-03  None

note that [15] we don't allow; [16] is not in-place but the same operation.

The issue is that when you reconstruct A we alway infer to datetimes, IOW, we don't allow np.nan, None or any null value to exist in a datetime dtype; instead these are coerced to NaT.

@grechut why exactly are you doing this and what is the utility?
Note I even find [16].B odd, where we actually replace with a None, even though np.nan is our numeric missing value marker.

The entire issue is that setting things to None forces object dtype, which is rarely what one wants.

grechut · 2017-03-08T19:07:32Z

Thanks for the response! :)

Sorry for not copy-pastable example. (pd.read_clipboard would handle it but that's not convenient way :) )

Our use case: We have a very brutal method that sanitizes all None-like values (np.nan etc) to None. It is being run before sending data to database or before exposing data in the API endpoints. We need it because SQLAlchemy is not extra handling None-like values.

So what is unclear/confusing is that float64 series is changed to object and gets None, while series of type datetime64[ns] is silently handled in a different way.

Note I even find [16].B odd, I can assume that dropping this pattern would be a very breaking change where people would get lots of weird bugs.

I thought that maybe for our case, we should serialize before sending values to the database:

In [29]: json.loads(df.to_json(orient='records'))
Out[29]: 
[{u'A': 1356998400000, u'B': 1.0},
 {u'A': None, u'B': 2.0},
 {u'A': 1357171200000, u'B': None}]

But that's an extra step to perform. With large datasets, it can be significant step.

Also though about using to_dict, but it does not convert to None:

In [32]: df.to_dict('r')
Out[32]: 
[{'A': Timestamp('2013-01-01 00:00:00'), 'B': 1.0},
 {'A': NaT, 'B': 2.0},
 {'A': Timestamp('2013-01-03 00:00:00'), 'B': nan}]

..and I felt that it would be more intuitive to return here None instead of NaT and nan.

jreback · 2017-03-09T19:05:45Z

@grechut the way IIRC this is handled in to_sql is you first cast to object the entire frame, then use where to replace things.

In [1]: df = pd.DataFrame({'A':[pd.Timestamp('20130101'),pd.NaT,pd.Timestamp('20130103')],'B':[1,2,np.nan]})
   ...: 

In [2]: df
Out[2]: 
           A    B
0 2013-01-01  1.0
1        NaT  2.0
2 2013-01-03  NaN

In [3]: df.dtypes
Out[3]: 
A    datetime64[ns]
B           float64
dtype: object

In [4]: df.astype(object)
Out[4]: 
                     A    B
0  2013-01-01 00:00:00    1
1                  NaT    2
2  2013-01-03 00:00:00  NaN

In [5]: df2 = df.astype(object)

In [8]: df2.where(df2.notnull(), None)
Out[8]: 
                     A     B
0  2013-01-01 00:00:00     1
1                 None     2
2  2013-01-03 00:00:00  None

jreback · 2017-03-09T19:07:05Z

see also this comment: #15533 (comment) which is a similar issue. we have to come up with a good API for this.

grechut · 2017-03-24T12:39:11Z

So maybe pandas.DataFrame.where.raise_on_error should inform that you're trying to perform operation that would results with result that might be different from what you'd expect.

According to the docs raise_on_error : Whether to raise on invalid data types (e.g. trying to where on strings). So in this case it's trying to where on DateTime column where type implies that null-like values are forced to be NaTs.

Another note, after reading docs, I thought that pandas.DataFrame.where.try_cast=False should allow for implicit conversion of type. So my thoughts were:

try_cast == True : cast back to DateTime
try_cast == False : cast to type implied by other values.

But it didn't work this way.

All those remarks are API-wise. Implementation-wise they might be hard and having little trade-off. So maybe just raise warning/error (partially pseudocode):

if column.dtype == 'datetime' and column_has_NaTs
    and other is not pd.NaT and pd.isnull(other):
    raise ValueError(
        "Trying to replace NaT with {other} would require changing of {column.name} type."
    )

jreback · 2017-03-24T13:54:47Z

note #14968 .

So this is coerce here:
https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L2277

This is correct, though I understand you want a different result. You can disambiguating None and other nulls here. A solution would be to if you detect exactly an None null, then you can change the block to object and repeat.

This would work in this case, but likely will break other things. You can see what breaks and we can go from there.

Note this same thinking would also change in a TimedeltaBlock.

mroeschke · 2021-05-08T01:23:12Z

The NaT is expected, but now we're coercing None to nan which looks preferable. Could use a test

In [82]: In [13]: df = pd.DataFrame({'A':[pd.Timestamp('20130101'),pd.NaT,pd.Timestamp('20130103')],'B':[1,2,np.n
    ...: an]})

In [83]: In [16]: df.where(df.notnull(), None)
Out[83]:
           A    B
0 2013-01-01  1.0
1        NaT  2.0
2 2013-01-03  NaN

Pysion-lin · 2021-08-12T03:07:37Z

pandas==1.0.3
In [16]: df.where(df.notnull(), None)
Out[16]:
A B
0 2013-01-01 1
1 NaT 2
2 2013-01-03 None

but pandas==1.3.1
In [16]: df.where(df.notnull(), None)
Out[16]:
A B
0 2013-01-01 1
1 NaT 2
2 2013-01-03 NaN

why?

MarcoGorelli · 2021-08-12T14:25:30Z

@Pysion-lin please open a new issue with a reproducible example if the above

but now we're coercing None to nan which looks preferable

doesn't answer your question

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels Mar 8, 2017

jreback mentioned this issue Apr 3, 2017

ENH: Provide an errors parameter to fillna #15653

Closed

4 tasks

jreback added this to the Next Major Release milestone Mar 14, 2018

jreback added Difficulty Intermediate labels Mar 14, 2018

jreback mentioned this issue Mar 14, 2018

Inplace boolean setting on mixed-types with a non np.nan value #20326

Closed

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke added the Bug label Mar 31, 2020

StefanBrand mentioned this issue May 4, 2021

BUG: DataFrame.mask does not mask NaT using None #41303

Closed

3 tasks

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels May 8, 2021

mroeschke mentioned this issue May 21, 2021

TST: Old issues #41607

Merged

10 tasks

jreback modified the milestones: Contributions Welcome, 1.3 May 21, 2021

jreback closed this as completed in #41607 May 24, 2021

Pysion-lin mentioned this issue Aug 13, 2021

pandas version #43012

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.DataFrame.where not replacing NaTs properly #15613

pandas.DataFrame.where not replacing NaTs properly #15613

grechut commented Mar 8, 2017

INSTALLED VERSIONS

jreback commented Mar 8, 2017

grechut commented Mar 8, 2017

jreback commented Mar 9, 2017

jreback commented Mar 9, 2017

grechut commented Mar 24, 2017 •

edited

Loading

jreback commented Mar 24, 2017

mroeschke commented May 8, 2021

Pysion-lin commented Aug 12, 2021

MarcoGorelli commented Aug 12, 2021 •

edited

Loading

pandas.DataFrame.where not replacing NaTs properly #15613

pandas.DataFrame.where not replacing NaTs properly #15613

Comments

grechut commented Mar 8, 2017

Problem description

Code sample

INSTALLED VERSIONS

jreback commented Mar 8, 2017

grechut commented Mar 8, 2017

jreback commented Mar 9, 2017

jreback commented Mar 9, 2017

grechut commented Mar 24, 2017 • edited Loading

jreback commented Mar 24, 2017

mroeschke commented May 8, 2021

Pysion-lin commented Aug 12, 2021

MarcoGorelli commented Aug 12, 2021 • edited Loading

grechut commented Mar 24, 2017 •

edited

Loading

MarcoGorelli commented Aug 12, 2021 •

edited

Loading