Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.DataFrame.where not replacing NaTs properly #15613

Closed
grechut opened this issue Mar 8, 2017 · 9 comments · Fixed by #41607
Closed

pandas.DataFrame.where not replacing NaTs properly #15613

grechut opened this issue Mar 8, 2017 · 9 comments · Fixed by #41607
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@grechut
Copy link

grechut commented Mar 8, 2017

Problem description

pandas.DataFrame.where seems to be not replacing NaTs properly.

As in the example below, NaT values stay in data frame after applying .where((pd.notnull(df)), None)

Code sample

In [26]: pd.__version__
Out[26]: '0.19.2'

In [27]: df
Out[27]: 
           d     v
0 2015-01-01  30.0
1        NaT  40.0
2 2015-01-03   NaN

In [28]: pd.notnull(df)
Out[28]: 
       d      v
0   True   True
1  False   True
2   True  False

In [29]: df.where((pd.notnull(df)), None)
Out[29]: 
           d     v
0 2015-01-01    30
1        NaT    40
2 2015-01-03  None
In [30]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 16.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 34.3.1
Cython: None
numpy: 1.12.0
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Mar 8, 2017

here's a copy-pastable example

In [13]: df = pd.DataFrame({'A':[pd.Timestamp('20130101'),pd.NaT,pd.Timestamp('20130103')],'B':[1,2,np.nan]})

In [14]: df
Out[14]: 
           A    B
0 2013-01-01  1.0
1        NaT  2.0
2 2013-01-03  NaN

In [15]: df[~df.isnull()] = None
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value

In [16]: df.where(df.notnull(), None)
Out[16]: 
           A     B
0 2013-01-01     1
1        NaT     2
2 2013-01-03  None

note that [15] we don't allow; [16] is not in-place but the same operation.

The issue is that when you reconstruct A we alway infer to datetimes, IOW, we don't allow np.nan, None or any null value to exist in a datetime dtype; instead these are coerced to NaT.

@grechut why exactly are you doing this and what is the utility?
Note I even find [16].B odd, where we actually replace with a None, even though np.nan is our numeric missing value marker.

The entire issue is that setting things to None forces object dtype, which is rarely what one wants.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels Mar 8, 2017
@grechut
Copy link
Author

grechut commented Mar 8, 2017

Thanks for the response! :)

Sorry for not copy-pastable example. (pd.read_clipboard would handle it but that's not convenient way :) )

Our use case: We have a very brutal method that sanitizes all None-like values (np.nan etc) to None. It is being run before sending data to database or before exposing data in the API endpoints. We need it because SQLAlchemy is not extra handling None-like values.

So what is unclear/confusing is that float64 series is changed to object and gets None, while series of type datetime64[ns] is silently handled in a different way.

Note I even find [16].B odd, I can assume that dropping this pattern would be a very breaking change where people would get lots of weird bugs.

I thought that maybe for our case, we should serialize before sending values to the database:

In [29]: json.loads(df.to_json(orient='records'))
Out[29]: 
[{u'A': 1356998400000, u'B': 1.0},
 {u'A': None, u'B': 2.0},
 {u'A': 1357171200000, u'B': None}]

But that's an extra step to perform. With large datasets, it can be significant step.

Also though about using to_dict, but it does not convert to None:

In [32]: df.to_dict('r')
Out[32]: 
[{'A': Timestamp('2013-01-01 00:00:00'), 'B': 1.0},
 {'A': NaT, 'B': 2.0},
 {'A': Timestamp('2013-01-03 00:00:00'), 'B': nan}]

..and I felt that it would be more intuitive to return here None instead of NaT and nan.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2017

@grechut the way IIRC this is handled in to_sql is you first cast to object the entire frame, then use where to replace things.

In [1]: df = pd.DataFrame({'A':[pd.Timestamp('20130101'),pd.NaT,pd.Timestamp('20130103')],'B':[1,2,np.nan]})
   ...: 

In [2]: df
Out[2]: 
           A    B
0 2013-01-01  1.0
1        NaT  2.0
2 2013-01-03  NaN

In [3]: df.dtypes
Out[3]: 
A    datetime64[ns]
B           float64
dtype: object

In [4]: df.astype(object)
Out[4]: 
                     A    B
0  2013-01-01 00:00:00    1
1                  NaT    2
2  2013-01-03 00:00:00  NaN

In [5]: df2 = df.astype(object)

In [8]: df2.where(df2.notnull(), None)
Out[8]: 
                     A     B
0  2013-01-01 00:00:00     1
1                 None     2
2  2013-01-03 00:00:00  None

@jreback
Copy link
Contributor

jreback commented Mar 9, 2017

see also this comment: #15533 (comment) which is a similar issue. we have to come up with a good API for this.

@grechut
Copy link
Author

grechut commented Mar 24, 2017

So maybe pandas.DataFrame.where.raise_on_error should inform that you're trying to perform operation that would results with result that might be different from what you'd expect.

According to the docs raise_on_error : Whether to raise on invalid data types (e.g. trying to where on strings). So in this case it's trying to where on DateTime column where type implies that null-like values are forced to be NaTs.

Another note, after reading docs, I thought that pandas.DataFrame.where.try_cast=False should allow for implicit conversion of type. So my thoughts were:

  • try_cast == True : cast back to DateTime
  • try_cast == False : cast to type implied by other values.

But it didn't work this way.

All those remarks are API-wise. Implementation-wise they might be hard and having little trade-off. So maybe just raise warning/error (partially pseudocode):

if column.dtype == 'datetime' and column_has_NaTs
    and other is not pd.NaT and pd.isnull(other):
    raise ValueError(
        "Trying to replace NaT with {other} would require changing of {column.name} type."
    )

@jreback
Copy link
Contributor

jreback commented Mar 24, 2017

note #14968 .

So this is coerce here:
https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L2277

This is correct, though I understand you want a different result. You can disambiguating None and other nulls here. A solution would be to if you detect exactly an None null, then you can change the block to object and repeat.

This would work in this case, but likely will break other things. You can see what breaks and we can go from there.

Note this same thinking would also change in a TimedeltaBlock.

@mroeschke
Copy link
Member

The NaT is expected, but now we're coercing None to nan which looks preferable. Could use a test

In [82]: In [13]: df = pd.DataFrame({'A':[pd.Timestamp('20130101'),pd.NaT,pd.Timestamp('20130103')],'B':[1,2,np.n
    ...: an]})

In [83]: In [16]: df.where(df.notnull(), None)
Out[83]:
           A    B
0 2013-01-01  1.0
1        NaT  2.0
2 2013-01-03  NaN

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels May 8, 2021
@mroeschke mroeschke mentioned this issue May 21, 2021
10 tasks
@jreback jreback modified the milestones: Contributions Welcome, 1.3 May 21, 2021
@Pysion-lin
Copy link

pandas==1.0.3
In [16]: df.where(df.notnull(), None)
Out[16]:
A B
0 2013-01-01 1
1 NaT 2
2 2013-01-03 None

but pandas==1.3.1
In [16]: df.where(df.notnull(), None)
Out[16]:
A B
0 2013-01-01 1
1 NaT 2
2 2013-01-03 NaN

why?

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Aug 12, 2021

@Pysion-lin please open a new issue with a reproducible example if the above

but now we're coercing None to nan which looks preferable

doesn't answer your question

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants