Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected results in Numpy datetime64 comparison within DataFrame #16831

Closed
gmatheus95 opened this issue Jul 5, 2017 · 8 comments
Closed

Unexpected results in Numpy datetime64 comparison within DataFrame #16831

gmatheus95 opened this issue Jul 5, 2017 · 8 comments
Labels
Bug Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request

Comments

@gmatheus95
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
from datetime import datetime,timedelta

threshold = np.datetime64(datetime.today()+timedelta(weeks=3))
df[threshold < df['date']]
df[df['date'] < threshold]

Problem description

As the two comparisons above show, they should present opposite results. Instead, both of them return the same result, as if df['date'] was always the first comparison operand.

Expected Output

The picture below illustrates the issue. It was expected that the line df[threshold < df['date']] would result in an empty DataFrame.
screenshot from 2017-07-05 15-17-58

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-83-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 34.4.1
Cython: None
numpy: 1.12.1
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: None
httplib2: 0.10.3
apiclient: 1.6.2
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: 2.38.0
pandas_datareader: None

@cristianornelas
Copy link

Same issue here.

@TomAugspurger
Copy link
Contributor

Can you make a complete example? Your df is undefined.

@gmatheus95
Copy link
Author

I cannot give you the actual dataframe I'm working with since it's private data, but here's a very naive executable code to illustrate the issue.

import pandas as pd
import numpy as np
from datetime import datetime,timedelta

today = datetime.today()
x = [today] * 10000
df = pd.DataFrame({'date':x})

threshold = np.datetime64(datetime.today()+timedelta(weeks=3))

#and then the comparisons:
df[threshold < df['date']]
df[df['date'] < threshold]

Thanks.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 5, 2017

Seems to be related to the numpy timestamp being microsecond precision:

In [111]: np.datetime64(today + timedelta(weeks=3)) < pd.Series([today])
Out[111]:
0    True
dtype: bool

In [112]: np.datetime64(today + timedelta(weeks=3)).astype("<M8[ns]") < pd.Series([today])
Out[112]:
0    False
dtype: bool

I'm not sure what the desired outcome is here. pandas only deals with nanosecond precision timestamps, so do we silently change the precision of the input, or raise an error?

Either way, we need to fix things to be consistent between Series and DataFrame here.

@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Jul 5, 2017
@TomAugspurger TomAugspurger modified the milestones: Next Major Release, 0.21.0 Jul 5, 2017
@gmatheus95
Copy link
Author

I'm sorry, I don't get it. Even after adding three weeks of delta, why do I still get True in Out[111] because of timestamp precision? I'm sorry if it's a naive question, I'm not really experienced in numpy.

Thank you anyway for addressing the issue so fast!

@TomAugspurger
Copy link
Contributor

I'm sorry, I don't get it. Even after adding three weeks of delta, why do I still get True in Out[111] because of timestamp precision?

Sorry if I wasn't clear, it's definitely a bug. It should be False (or maybe an exception).

Numpy stores datetimes as int64s, where the exact datetime of an integer depends on the resolution.

In [119]: np.datetime64(today).view('i8')
Out[119]: 1499262896667864

In [120]: np.datetime64(today).astype('<M8[ns]').view('i8')
Out[120]: 1499262896667864000

It's possible (haven't confirmed yet) that when you do threshold < df['date'], pandas looks at those integers without checking that they're at the same resolution. And since your threshold is in microseconds it's going to be smaller than the pandas one (which is in nanoseconds). This is guess a guess though.

@TomAugspurger
Copy link
Contributor

Ah, indeed this seems to be a duplicate of #7996.

For now, you can workaround by converting threshold to a pd.Timestamp, which will ensure that you have nanosecond-precision datetimes everywhere.

@TomAugspurger TomAugspurger added the Duplicate Report Duplicate issue or pull request label Jul 5, 2017
@TomAugspurger TomAugspurger modified the milestones: No action, Next Major Release Jul 5, 2017
@gmatheus95
Copy link
Author

Oh now I get it, thanks again!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

3 participants