Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: nonexistent Timestamp pre-summer/winter DST w/dateutil timezone #31155

Merged
merged 26 commits into from
Jan 24, 2020

Conversation

AlexKirko
Copy link
Member

This implements rounding down to microseconds into Timedelta.total_seconds(). Lack of this rounding led to dateutil.tz.tzinfo.utcoffset and dst breaking less than 128 nanoseconds before winter/summer DST switch. This happened due to an unintended cast to float in Timedelta.total_seconds() compounded with Timedelta.value being np.int64 type which doesn't support long arithmetic. The loss of precision led to rounding up, which meant that total_seconds since epoch time implied DST time while the Timedelta.value hasn't yet reached DST.

Details and code examples below.

This was quite a journey, but I found out what's going on. Let's say we have a dateutil.tz.tzinfo object named du_tz and want to find out the DST-aware UTC offset.

  1. We call du_tz.utcoffset(dt) which calls du_tz._find_ttinfo(dt) which calls du_tz._resolve_ambiguous_time(dt) to find the index of the last transition time that it uses to return the correct offset.
  2. du_tz._resolve_ambiguous_time(dt) calls du_tz._find_last_transition(dt) which calls the _datetime_to_timestamp(dt) dateutil function.
  3. This is what this function does:
def _datetime_to_timestamp(dt):
    """
    Convert a :class:`datetime.datetime` object to an epoch timestamp in
    seconds since January 1, 1970, ignoring the time zone.
    """
    return (dt.replace(tzinfo=None) - EPOCH).total_seconds()

The problem is dateutil's reliance on Timedelta.total_seconds which casts to float:

    def total_seconds(self):
        """
        Total duration of timedelta in seconds (to ns precision).
        """
        return self.value / 1e9

Demonstration:

IN:
import datetime
import pandas as pd

epoch =  1552211999999999872
ts = pd.Timestamp(epoch, tz='dateutil/US/Pacific')

EPOCH = datetime.datetime.utcfromtimestamp(0)

delta = ts.replace(tzinfo=None) - EPOCH
print(delta.value)
OUT:
1552183199999999872
IN:
print(delta.total_seconds())
OUT:
1552183200.0
IN:
print(ts.tz.dst(ts))
OUT:
datetime.timedelta(seconds=3600)

The same thing happens with a pytz timezone, only pytz relies on something else to check for DST transitions:

IN:
import datetime
import pandas as pd

epoch =  1552211999999999872
ts = pd.Timestamp(epoch, tz='US/Pacific')

EPOCH = datetime.datetime.utcfromtimestamp(0)

delta = ts.replace(tzinfo=None) - EPOCH
print(delta.value)
OUT:
1552183199999999872
IN:
print(delta.total_seconds())
OUT:
1552183200.0
IN:
print(ts.tz.dst(ts))
OUT:
datetime.timedelta(0)

So timedelta value is okay, but timedelta.total_seconds hits the precision limit of floats, and this leads to rounding and an incorrect DST offset.

@jreback jreback added Datetime Datetime data dtype Timezones Timezone data dtype Bug labels Jan 20, 2020
@jreback jreback added this to the 1.1 milestone Jan 20, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. ping on green.

@AlexKirko AlexKirko force-pushed the fix-nonexistent-time branch from cee72ed to c9a87bd Compare January 20, 2020 17:56
@@ -1092,3 +1092,15 @@ def test_constructor_ambigous_dst():
expected = ts.value
result = Timestamp(ts).value
assert result == expected


def test_constructor_before_dst_switch():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test the example in the issue here? (i.e. that Timestamp(Timestamp(epoch_time, tz=..)) doesn't change value)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke Sure!
But the value breaks only 1 nanosecond before the switch. Currently, I'm struggling to make everything else work on every azure pipeline (see comment below).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke
Added to the test. Pinging you on change.

@AlexKirko
Copy link
Member Author

AlexKirko commented Jan 21, 2020

@jreback
Okay, I found why this breaks on minimum versions and macOS. Both specify python-dateutil == 2.6.1.
Before version 2.7.0, dateutil didn't fully rely on timedelta.total_seconds(). Instead it implemented this function:

def _total_seconds(td):
    # Python 2.6 doesn't have a total_seconds() method on timedelta objects
    return ((td.seconds + td.days * 86400) * 1000000 +
            td.microseconds) // 1000000


_total_seconds = getattr(timedelta, 'total_seconds', _total_seconds)

They had this for Python 2.6 datetime support and dropped it in this PR.
This implementation grabs datetime.timedelta.total_seconds() with getattr, and we can't influence this from the pandas side.

I set up a python 3.6 environment and tested this hypothesis. The tests fail with dateutil 2.6.1 and pass with dateutil 2.7.0.

Basically, I want to know how I should proceed.
I don't see a way to work around this without overriding dateutil's tzinfo methods, and this seems like a horrible idea.
Or we could change our minimum supported version of dateutil to 2.7.0 if that's acceptable. We don't support Python 2 in pandas, so there is really no need to support a version of dateutil this low.
I wonder why we specify version 2.6.1 for macOS though.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2020

so just change the test have different cases for >=2,7 and <2,7
we already moved to
2.6.1 as the min (for 1.0.0);

@@ -2,10 +2,12 @@
Tests for DatetimeIndex timezone-related methods
"""
from datetime import date, datetime, time, timedelta, tzinfo
from distutils.version import LooseVersion
Copy link
Member Author

@AlexKirko AlexKirko Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to parse version properly, so have to import this.

@AlexKirko AlexKirko requested a review from jreback January 21, 2020 16:09
ts = Timestamp(epoch, tz="dateutil/US/Pacific")
result = ts.tz.dst(ts)
expected = timedelta(seconds=0)
assert Timestamp(ts).value == epoch
Copy link
Member Author

@AlexKirko AlexKirko Jan 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke
This tests that the value no longer shifts when we call the constructor again. In the issue, this failed for epoch = 1552211999999999999

@AlexKirko
Copy link
Member Author

AlexKirko commented Jan 22, 2020

Because dateutil.__version__ breaks the mypy check, I'm using pandas.compat._optional._get_version. Seems like it was written precisely for something like this, and it's certainly a better idea than importing pkg_resources.get_distribution.
Once the numpy dev PR is merged, I'll merge master into my branch, and we should be green.

@AlexKirko
Copy link
Member Author

@jreback
All green, please review.

@jreback jreback merged commit 3b2c8f6 into pandas-dev:master Jan 24, 2020
@jreback
Copy link
Contributor

jreback commented Jan 24, 2020

thanks @AlexKirko very nice

@AlexKirko AlexKirko deleted the fix-nonexistent-time branch January 24, 2020 06:00
# Make sure that calling Timestamp constructor
# on time just before DST switch doesn't lead to
# nonexistent time or value change
# Works only with dateutil >= 2.7.0 as dateutil overrid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/overrid/overrode

# nonexistent time or value change
# Works only with dateutil >= 2.7.0 as dateutil overrid
# pandas.Timedelta.total_seconds with
# datetime.timedelta.total_seconds before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not make sense to me - datetime.timedelta.total_seconds should succeed, because it's equivalent to:

def total_seconds(td):
  useconds = td.days * 86400
  useconds += td.seconds 
  useconds *= 1000000
  useconds += td.microseconds
  return useconds / 1e6

I think there's actually a deeper issue here, which is that td.microseconds and td.seconds are rounded rather than truncated. Consider this:

def to_ns(td):
  ns = td.days * 86400
  ns += td.seconds
  ns *= 1000000
  ns += td.microseconds
  ns *= 1000
  ns += td.nanoseconds
  return ns

td = timedelta(1552211999999999872, unit="ns")
print(td.value)  # 1552211999999999872
print(to_ns(td))  # 1552212000000000872

That seems to be the actual root cause of this issue and should probably be fixed.

# Works only with dateutil >= 2.7.0 as dateutil overrid
# pandas.Timedelta.total_seconds with
# datetime.timedelta.total_seconds before
ts = Timestamp(epoch, tz="dateutil/US/Pacific")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention earlier: you should probably use dateutil/America/Los_Angeles, as that is the canonical name for this zone. The US/... zones are symlinks for backwards compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: nonexistent Timestamp pre-summer/winter DST change with dateutil timezone
4 participants