Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datetime.date no longer coerced to datetime64 for comparison operations #21152

Closed
innominate227 opened this issue May 21, 2018 · 23 comments · Fixed by #21361
Closed

datetime.date no longer coerced to datetime64 for comparison operations #21152

innominate227 opened this issue May 21, 2018 · 23 comments · Fixed by #21361
Labels
Blocker Blocking issue or pull request for an upcoming release Datetime Datetime data dtype Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@innominate227
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
import datetime

pydate = datetime.date(2018,1,1)
df = pd.DataFrame({'date':[np.datetime64('2018-01-01')]})

sel = df[df['date'] == pydate]
print(len(sel))

Problem description

In pandas 0.22 the code above printed "1", in 0.23 it prints "0".

Expected Output

1

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0
pytest: 2.9.2
pip: 8.1.2
setuptools: 39.0.1
Cython: 0.24.1
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.3.2
xlrd: 1.1.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: 0.9999999
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger TomAugspurger changed the title date comparison change in 0.23 datetime.date no longer coerced to datetime64 for comparison operations May 21, 2018
@TomAugspurger
Copy link
Contributor

cc @jbrockmendel intentional, or regression?

@TomAugspurger TomAugspurger added Datetime Datetime data dtype Numeric Operations Arithmetic, Comparison, and Logical operations labels May 21, 2018
@TomAugspurger
Copy link
Contributor

Interesting, Python considers them different:

In [25]: pydate == datetime.datetime(2018, 1, 1)
Out[25]: False

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 21, 2018

Hmm, #18188 contained a test for this, but it was removed in 87fefe2#diff-790bb1655c69fe970376e710435edb80R352

relevant bit is https://github.com/pandas-dev/pandas/pull/19800/files#r169681238

So, I'm saying this was an intentional change to make Series[datetime64] behave the same as DatetimeIndex here. But we forgot to update the release notes in #19800

Does that sound right?

@jbrockmendel
Copy link
Member

intentional, or regression?

Intentional. The Series/DataFrame behavior was changed to match the DatetimeIndex behavior, which is based on the Timestamp/datetime behavior. If a release note was missed that was likely my mistake.

@changhsinlee
Copy link

I had the same problem. Please add this change to the release note, as it breaks a lot of code :( ... Is it possible to add this comparison back in the next release?

@TomAugspurger
Copy link
Contributor

@changhsinlee a PR adding it to the changelog would be welcome. We can rebuild and re-upload the docs.

Not sure about reverting the change. The fact that we're now consistent

  1. with NumPy and Python
  2. between DatetimeIndex and Series

is attractive

In [1]: import datetime

In [2]: import pandas as pd
im
In [3]: import numpy as np

In [4]: datetime.date(2017, 1, 1) == datetime.datetime(2017, 1, 1)
Out[4]: False

In [5]: np.array(['2017-01-01'], dtype='M8[s]') == datetime.date(2017, 1, 1)
Out[5]: array([False])

In [6]: pd.Series(pd.to_datetime(['2017-01-01'])) == datetime.date(2017, 1, 1)
Out[6]:
0    False
dtype: bool

In [7]: pd.to_datetime(['2017-01-01']) == datetime.date(2017, 1, 1)
Out[7]: array([False])

Previously Out[6] was True and In[7] raised. I'm not sure how much effort it would be to implement a proper deprecation here, though I suspect it will be tricky.

as it breaks a lot of code

What sorts of workloads does this break? Are you able to wrap your datetime.dates in a pd.Timestamp before comparing?

@innominate227
Copy link
Author

innominate227 commented May 22, 2018

Most frustrating to me is how [6] silently became false. Its going to be hard to find all the places in our codebase where comparisons like this were done. Does it make since for [6] to write some warning?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 22, 2018 via email

@jorisvandenbossche jorisvandenbossche added this to the 0.23.1 milestone May 23, 2018
@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label May 23, 2018
@jorisvandenbossche
Copy link
Member

So to recap. Behaviour on 0.22:

In [1]: dtidx = pd.date_range("2012-01-01", periods=3)

In [3]: import datetime

In [4]: pydate = datetime.date(2012, 1, 1)

In [5]: dtidx == pydate
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
...
TypeError: <class 'datetime.date'> type object 2012-01-01

In [6]: pd.Series(dtidx) == pydate
Out[6]: 
0     True
1    False
2    False
dtype: bool

Behaviour on 0.23:

In [21]: dtidx == pydate
Out[21]: array([False, False, False], dtype=bool)

In [22]: pd.Series(dtidx) == pydate
Out[22]: 
0    False
1    False
2    False
dtype: bool

I personally think silently changing True to False was a bad idea of us (if it started raising an error as in 0.22 dtidx, it would at least not give silent changes in behaviour). This is a serious regression IMO.

So personally I would revert this change. And then we can decide to keep it (and eg change DatetimeIndex to be consistent with Series behaviour), or raise a deprecation warning we will change this in the future.

@jorisvandenbossche
Copy link
Member

On which behaviour we would ideally like to have (coerce or not):

Not sure about reverting the change. The fact that we're now consistent

  • with NumPy and Python
  • between DatetimeIndex and Series

is attractive

That is true, but on the other hand, in pandas we generally do much more coercing to the type it is compared with. For example, we also coerce strings (using the same example from above):

In [23]: pd.Series(dtidx) == "2012-01-01"
Out[23]: 
0     True
1    False
2    False
dtype: bool

(which also works for DatetimeIndex)

@jreback
Copy link
Contributor

jreback commented May 23, 2018

This is a good deliberate change. datetime.date is not a first class pandas type. Folks have been using at risk for years. We fully match numpy / python semantics. Changing to broken semantics is unreasonable. -1 on any further change here.

@jreback
Copy link
Contributor

jreback commented May 23, 2018

@jorisvandenbossche your examples [23] is flawed. String are coerced to partial string indexing which gives much flexibility. datetime.data / datetime.date / Timestamps are never coerced in comparisons and are exact matches.

@TomAugspurger
Copy link
Contributor

Folks have been using at risk for years.

"At risk" implies that people know they were doing something "wrong" :)

w.r.t. Out[23], I don't see how partial string indexing comes in. With partial string indexing, I would expect the following to be [True, True, False].

In [3]: pd.Series(pd.to_datetime(['2017-01-01', '2017-01-02', '2018'])) == '2017'
Out[3]:
0     True
1    False
2    False
dtype: bool

But anyway, I'm not sure coercing the string makes sense here, so I'm not going to argue for that :)

@jorisvandenbossche
Copy link
Member

Yes, IMO you could also say that folks have been using this because it was well established behaviour of pandas, no matter how python or numpy did it. And we silently changed the result of their calculation.

And wanted to give the same example as Tom, that I don't think this the example I gave was wrong or flawed. The string comparison is also an exact match.
That is not necessarily an argument for allowing matches with datetime.date, but it is at least a precedent that shows we coerce "datetime-like values".

@jbrockmendel
Copy link
Member

It looks like there are 3 issues here:

  1. A missing release note, almost certainly my mistake
  2. A change in how datetime.date is treated -- the new behavior is correct and should not be reverted
  3. A discussion of how string comparisons are treated, which can be a minefield of its own (e.g. DatetimeIndex lookups tz inconsistency #18435)

Unless I'm mistaken about 3) being tangential to the OP, I think it confuses more than it clarifies.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 23, 2018 via email

@jorisvandenbossche
Copy link
Member

Indeed, I don't agree with how you state 2, but I don't necessarily disagree with the change itself.

Coercing a datetime.date to an exact timestamp to do comparison ops is not "wrong" or "correct", it is a design choice. We can discuss which we think is most consistent, or will lead to the least confusion, is the clearest, ... But in the end: in the past we made the choice to explicitly allow this for Series, and given the inconsistency with DatetimeIndex, for 0.23 we made the choice to change this behaviour.

I am not fully sure this was the best choice (we could also have opted for changing the DatetimeIndex behaviour), but I don't really object the choice (since there is not really a good reason people would want to do this, if they can also use datetime).

I would add a 2a): How the change was made. Should we have done a deprecation cycle? (yes) And is it worth reinstating with a deprecation? (not sure).

Yep, I agree we should ideally have done this with a deprecation cycle. And I personally think it is still worth to do it for 0.23.1

@jbrockmendel
Copy link
Member

Indeed, I don't agree with how you state 2, but I don't necessarily disagree with the change itself.

I evidently communicated poorly. The bit before the "--" was intended as the issue description and the bit after the "--" my opinion on it.

we could also have opted for changing the DatetimeIndex behaviour

There was some discussion about this. The general thrust of it is that DatetimeIndex ops should behave like vectorized versions of Timestamp ops. Timestamp follows the convention set by datetime when it comes to comparisons with date. The analogy I think of is datetime:date :: Timestamp:Period[D]

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 24, 2018

The general thrust of it is that DatetimeIndex ops should behave like vectorized versions of Timestamp ops

OK, but we don't really follow that if it comes to strings anyway ... (as that does not work on Timestamp, but does work for DatetimeIndex or Series).
Which doesn't mean this discrepancy is a problem. For me it is fine to not support datetime.date, but support strings (and maybe we should add it to Timestamp for consistency?). I just think users could not have guessed this that it was "wrong".

@jbrockmendel
Copy link
Member

OK, but we don't really follow that if it comes to strings anyway

This came up in #18435 and the conclusion IIRC is that the string special-casing was very specifically a convenience feature for indexing purposes.

@TomAugspurger
Copy link
Contributor

I think we'll keep the new behavior and issue a warning that users probably don't want to compare to a datetime.date, since it's always false.

@h-vetinari
Copy link
Contributor

It's not just about equality - arguably, datetime >= date makes sense in all scenarios, and this change broke my code.

But even if consistency is ultimately decided to be the higher good, at least the error warning should be better in one of the following cases -- i.e. not TypeError: <class 'datetime.date'> type object 2013-01-01 but consistent with TypeError: can't compare datetime.datetime to datetime.date.

import pandas as pd
from datetime import date, datetime

v.0.22:

pd.Series([date(2014, 1, 1), date(2013, 1, 1)]) >= date(2013, 1, 1)
# 0    True
# 1    True
# dtype: bool
pd.Series([date(2014, 1, 1), datetime(2013, 1, 1)]) >= date(2013, 1, 1)
# 0    True
# 1    True
# dtype: bool
pd.Series([date(2014, 1, 1), date(2013, 1, 1)]) >= datetime(2013, 1, 1)
# TypeError: can't compare datetime.datetime to datetime.date
pd.Series([date(2014, 1, 1), datetime(2013, 1, 1)]) >= datetime(2013, 1, 1)
# 0    True
# 1    True
# dtype: bool

v.0.23:

pd.Series([date(2014, 1, 1), date(2013, 1, 1)]) >= date(2013, 1, 1)
# 0    True
# 1    True
# dtype: bool
pd.Series([date(2014, 1, 1), datetime(2013, 1, 1)]) >= date(2013, 1, 1)
# TypeError: <class 'datetime.date'> type object 2013-01-01
pd.Series([date(2014, 1, 1), date(2013, 1, 1)]) >= datetime(2013, 1, 1)
# TypeError: can't compare datetime.datetime to datetime.date
pd.Series([date(2014, 1, 1), datetime(2013, 1, 1)]) >= datetime(2013, 1, 1)
# 0    True
# 1    True
# dtype: bool

@jreback jreback modified the milestones: 0.23.1, 0.23.2 Jun 7, 2018
@TomAugspurger TomAugspurger removed this from the 0.23.2 milestone Jun 7, 2018
@TomAugspurger TomAugspurger added this to the 0.23.1 milestone Jun 7, 2018
@TomAugspurger TomAugspurger added the Blocker Blocking issue or pull request for an upcoming release label Jun 7, 2018
@TomAugspurger
Copy link
Contributor

I think we agreed this was a blocker for 0.23.1.

Will try to get to it today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Datetime Datetime data dtype Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants