Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: nonexistent Timestamp pre-summer/winter DST change with dateutil timezone #31043

Closed
AlexKirko opened this issue Jan 15, 2020 · 4 comments · Fixed by #31155
Closed

BUG: nonexistent Timestamp pre-summer/winter DST change with dateutil timezone #31043

AlexKirko opened this issue Jan 15, 2020 · 4 comments · Fixed by #31155
Assignees
Labels
Bug Timezones Timezone data dtype
Milestone

Comments

@AlexKirko
Copy link
Member

Code Sample, a copy-pastable example if possible

This is fine:

>>> pd.__version__
'0.26.0.dev0+1790.gdd94e0db9'
>>> epoch =  1552211999999999871
>>> t = pd.Timestamp(epoch, tz='dateutil/US/Pacific')
>>> t
Timestamp('2019-03-10 01:59:59.999999871-0800', tz='dateutil/US/Pacific')
>>> t.value
1552211999999999871
>>> pd.Timestamp(t)
Timestamp('2019-03-10 01:59:59.999999871-0800', tz='dateutil/US/Pacific')
>>> pd.Timestamp(t).value
1552211999999999871

This is also fine:

>>> epoch =  1552212000000000000
>>> t = pd.Timestamp(epoch, tz='dateutil/US/Pacific')
>>> t
Timestamp('2019-03-10 03:00:00-0700', tz='dateutil/US/Pacific')
>>>
>>> t.value
1552212000000000000
>>> pd.Timestamp(t)
Timestamp('2019-03-10 03:00:00-0700', tz='dateutil/US/Pacific')
>>> pd.Timestamp(t).value
1552212000000000000

Meanwhile, this breaks representation and gets us nonexistent times:

>>> epoch =  1552211999999999872
>>> t = pd.Timestamp(epoch, tz='dateutil/US/Pacific')
>>> t
Timestamp('2019-03-10 01:59:59.999999872-0700', tz='dateutil/US/Pacific')
>>> t.value
1552211999999999872
>>> pd.Timestamp(t)
Timestamp('2019-03-10 01:59:59.999999872-0800', tz='dateutil/US/Pacific')
>>> pd.Timestamp(t).value
1552208399999999872

And right on the cusp, the value breaks too:

>>> epoch =  1552211999999999999
>>> t = pd.Timestamp(epoch, tz='dateutil/US/Pacific')
>>> t
Timestamp('2019-03-10 01:59:59.999999999-0700', tz='dateutil/US/Pacific')
>>> t.value
1552211999999999999
>>> pd.Timestamp(t)
Timestamp('2019-03-10 01:59:59.999999999-0800', tz='dateutil/US/Pacific')
>>> pd.Timestamp(t).value
1552208399999999999

Problem description

When we use dateutil timezones and try to create a Timestamp that is right on the cusp of the change from winter to summer time, we can get nonexistent times (the clock is supposed to jump from 2 A.M. to 3 A.M. and yet we get 2:59:59).

I've investigated this, and it appears that at 128 nanoseconds before the clock jump, DST offset and utcoffset in dateutil change, so we end up in a situation when the offsets are what they are supposed to be after the jump, but the time hasn't jumped yet, so the constructor returns a nonexistent time. Calling the constructor again moves the clock 1 hour back.

This can be checked out with:

>>> epoch =  1552211999999999872
>>> t = pd.Timestamp(epoch, tz='dateutil/US/Pacific')
>>> t.tz.dst(t)
datetime.timedelta(seconds=3600)

My assumption is that when we need to determine UTC offset, rounding happens at some point, and we round to epoch=1552212000000000000, get offset, and then use it on time pre-clock jump.

I'd like to try to fix this one.

Expected Output

>>> epoch =  1552211999999999872
>>> t = pd.Timestamp(epoch, tz='dateutil/US/Pacific')
>>> t
Timestamp('2019-03-10 01:59:59.999999872-0800', tz='dateutil/US/Pacific')
>>> t.value
1552211999999999872
>>> pd.Timestamp(t)
Timestamp('2019-03-10 01:59:59.999999872-0800', tz='dateutil/US/Pacific')
>>> pd.Timestamp(t).value
1552208399999999872

Notes

This was thought to be part of #24329 but turned to be a separate bug as I worked on closing that issue in PR #30995.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : dd94e0d
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : ru_RU.UTF-8
LOCALE : None.None

pandas : 0.26.0.dev0+1790.gdd94e0db9
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0.post20200106
Cython : 0.29.14
pytest : 5.3.2
hypothesis : 5.1.5
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.3.2
s3fs : 0.4.0
scipy : 1.3.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : 0.8.6
xarray : 0.14.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.47.0

@jreback
Copy link
Contributor

jreback commented Jan 15, 2020

cc @pganssle

@pganssle
Copy link
Contributor

I am not sure why the same thing is not happening with pytz - it's probably just going through a different code path. I believe what's happening here is that at some point in the process pandas (or, less likely, dateutil) is converting the integer timestamp into a float before truncating to the nearest microsecond, which is why the transition point happens at 1552211999999999872. Note:

>>> print(f"{float(1552211999999999872):0.0f}")
1552212000000000000
>>> print(f"{float(1552211999999999871):0.0f}") 
1552211999999999744

Seems that in that region of the number line, floats are spaced apart by 256, and the behavior for converting invalid floats into floats is to round to the nearest valid float (which in this case crosses boundary of seconds).

Very interesting situation. I think if you track down where the int to double conversion is happening and prevent it or otherwise make sure it doesn't round up you should be fine. Presumably it's in one of the dateutil-specific code branches. The current version of dateutil.tz.tz does not contain any instances of float or any division operations (plus, it wouldn't have any use for a datetime specified in nanoseconds-since-epoch anyway), so it's almost certainly in pandas.

@AlexKirko
Copy link
Member Author

@pganssle Thanks, this narrows it down a lot. I'll investigate this issue further and try to find where the rounding up happens.

@AlexKirko
Copy link
Member Author

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants