Unexpected results in Numpy datetime64 comparison within DataFrame #16831

gmatheus95 · 2017-07-05T18:20:47Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
from datetime import datetime,timedelta

threshold = np.datetime64(datetime.today()+timedelta(weeks=3))
df[threshold < df['date']]
df[df['date'] < threshold]

Problem description

As the two comparisons above show, they should present opposite results. Instead, both of them return the same result, as if df['date'] was always the first comparison operand.

Expected Output

The picture below illustrates the issue. It was expected that the line df[threshold < df['date']] would result in an empty DataFrame.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-83-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 34.4.1
Cython: None
numpy: 1.12.1
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: None
httplib2: 0.10.3
apiclient: 1.6.2
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: 2.38.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

cristianornelas · 2017-07-05T18:23:36Z

Same issue here.

TomAugspurger · 2017-07-05T18:28:41Z

Can you make a complete example? Your df is undefined.

gmatheus95 · 2017-07-05T18:38:15Z

I cannot give you the actual dataframe I'm working with since it's private data, but here's a very naive executable code to illustrate the issue.

import pandas as pd
import numpy as np
from datetime import datetime,timedelta

today = datetime.today()
x = [today] * 10000
df = pd.DataFrame({'date':x})

threshold = np.datetime64(datetime.today()+timedelta(weeks=3))

#and then the comparisons:
df[threshold < df['date']]
df[df['date'] < threshold]

Thanks.

TomAugspurger · 2017-07-05T18:59:18Z

Seems to be related to the numpy timestamp being microsecond precision:

In [111]: np.datetime64(today + timedelta(weeks=3)) < pd.Series([today])
Out[111]:
0    True
dtype: bool

In [112]: np.datetime64(today + timedelta(weeks=3)).astype("<M8[ns]") < pd.Series([today])
Out[112]:
0    False
dtype: bool

I'm not sure what the desired outcome is here. pandas only deals with nanosecond precision timestamps, so do we silently change the precision of the input, or raise an error?

Either way, we need to fix things to be consistent between Series and DataFrame here.

gmatheus95 · 2017-07-05T19:06:59Z

I'm sorry, I don't get it. Even after adding three weeks of delta, why do I still get True in Out[111] because of timestamp precision? I'm sorry if it's a naive question, I'm not really experienced in numpy.

Thank you anyway for addressing the issue so fast!

TomAugspurger · 2017-07-05T19:14:31Z

I'm sorry, I don't get it. Even after adding three weeks of delta, why do I still get True in Out[111] because of timestamp precision?

Sorry if I wasn't clear, it's definitely a bug. It should be False (or maybe an exception).

Numpy stores datetimes as int64s, where the exact datetime of an integer depends on the resolution.

In [119]: np.datetime64(today).view('i8')
Out[119]: 1499262896667864

In [120]: np.datetime64(today).astype('<M8[ns]').view('i8')
Out[120]: 1499262896667864000

It's possible (haven't confirmed yet) that when you do threshold < df['date'], pandas looks at those integers without checking that they're at the same resolution. And since your threshold is in microseconds it's going to be smaller than the pandas one (which is in nanoseconds). This is guess a guess though.

TomAugspurger · 2017-07-05T19:17:36Z

Ah, indeed this seems to be a duplicate of #7996.

For now, you can workaround by converting threshold to a pd.Timestamp, which will ensure that you have nanosecond-precision datetimes everywhere.

gmatheus95 · 2017-07-05T19:40:56Z

Oh now I get it, thanks again!!

TomAugspurger added this to the 0.21.0 milestone Jul 5, 2017

TomAugspurger added Bug Difficulty Intermediate Datetime Datetime data dtype labels Jul 5, 2017

TomAugspurger modified the milestones: Next Major Release, 0.21.0 Jul 5, 2017

TomAugspurger closed this as completed Jul 5, 2017

TomAugspurger added the Duplicate Report Duplicate issue or pull request label Jul 5, 2017

TomAugspurger modified the milestones: No action, Next Major Release Jul 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected results in Numpy datetime64 comparison within DataFrame #16831

Unexpected results in Numpy datetime64 comparison within DataFrame #16831

gmatheus95 commented Jul 5, 2017

INSTALLED VERSIONS

cristianornelas commented Jul 5, 2017

TomAugspurger commented Jul 5, 2017

gmatheus95 commented Jul 5, 2017

TomAugspurger commented Jul 5, 2017 •

edited

Loading

gmatheus95 commented Jul 5, 2017

TomAugspurger commented Jul 5, 2017

TomAugspurger commented Jul 5, 2017

gmatheus95 commented Jul 5, 2017

Unexpected results in Numpy datetime64 comparison within DataFrame #16831

Unexpected results in Numpy datetime64 comparison within DataFrame #16831

Comments

gmatheus95 commented Jul 5, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

cristianornelas commented Jul 5, 2017

TomAugspurger commented Jul 5, 2017

gmatheus95 commented Jul 5, 2017

TomAugspurger commented Jul 5, 2017 • edited Loading

gmatheus95 commented Jul 5, 2017

TomAugspurger commented Jul 5, 2017

TomAugspurger commented Jul 5, 2017

gmatheus95 commented Jul 5, 2017

Output of `pd.show_versions()`

TomAugspurger commented Jul 5, 2017 •

edited

Loading