Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make datetime_to_numeric more robust to overflow errors #3642

Closed
wants to merge 2 commits into from

Conversation

huard
Copy link
Contributor

@huard huard commented Dec 18, 2019

This is likely only safe with NumPy>=1.17 though.

if offset is None:
if array.dtype.kind in "Mm":
offset = _datetime_nanmin(array)
else:
offset = min(array)

# Compute timedelta object.
# For np.datetime64, this can silently yield garbage due to overflow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch; this is another bug (we end up with a NaN in this example):

In [1]: import numpy as np; import xarray as xr

In [2]: times = [np.datetime64('1690-01-01', 'ns'), np.datetime64('2200-01-01', 'ns')]

In [3]: da = xr.DataArray(range(2), dims=['time'], coords=[times])

In [4]: da
Out[4]:
<xarray.DataArray (time: 2)>
array([0, 1])
Coordinates:
  * time     (time) datetime64[ns] 1690-01-01 2200-01-01

In [5]: da.interp(time=['1700-01-01'])
Out[5]:
<xarray.DataArray (time: 1)>
array([nan])
Coordinates:
  * time     (time) datetime64[ns] 1700-01-01

This is another potential plus of using a universal offset. In the case of NumPy dates, I think if we always use 1970-01-01T00:00:00 as the offset we are guaranteed this will never return garbage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd also have to enforce a microsecond resolution, which could break some code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the issue I raised in the comment above, I think switching to 1970-01-01T00:00:00 as the offset would work for NumPy dates at nanosecond resolution (maybe there is something I'm missing?). Let me think a little more about the implications of casting the datetime.timedelta objects to timedelta64[us] objects, though. I think you're right that we should think carefully about that.

if array.dtype.kind in "O":
# possibly convert object array containing datetime.timedelta
array = np.asarray(pd.Series(array.ravel())).reshape(array.shape)
array = array.astype("timedelta64[ms]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we may want to use microsecond units ("timedelta64[us]"); using millisecond resolution may lead to unnecessary truncation of precision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. In get_clean_interp_index, should we also use datetime_to_numeric for pd.DatetimeIndex with the us resolution ?

    if isinstance(index, (CFTimeIndex, pd.DatetimeIndex)):
        offset = type(index[0])(1970, 1, 1)
        index = Variable(
            data=datetime_to_numeric(index, offset=offset, datetime_unit="us"),
            dims=(dim,),
        )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to "us" breaks interpolate_na because max_gap is ns by default.

Throughout xarray, "ns" is the default resolution. I'm a bit worried that switching this for "us" in a couple of places it a recipe for future bugs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that we could use nanosecond units (or units even finer) for cftime dates, so long as we do the unit conversion step after converting to floats. E.g. one could write a function like the following to aid in that:

SUB_MICROSECOND_UNIT_CONVERSION_FACTORS = {
    "ns": 1e3,
    "ps": 1e6,
    "fs": 1e9,
    "as": 1e12
}

def _py_timedelta_to_float(array, datetime_unit="ns"):
    array = array.astype("timedelta64[us]")
    if datetime_unit in SUB_MICROSECOND_UNIT_CONVERSION_FACTORS:
        factor = SUB_MICROSECOND_UNIT_CONVERSION_FACTORS[datetime_unit]
        return factor * array.astype(np.float64)
    else:
        return array / np.timedelta64(1, datetime_unit)

I have not thought carefully yet about how to handle max_gap in interpolate_na in the case of cftime dates, but this seems like a possible compromise to allow us to continue to default to nanosecond resolution internally for these functions. @huard do you think that could help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's perhaps a cleaner version:

def _py_timedelta_to_float(array, datetime_unit="ns"):
    array = array.astype("timedelta64[us]").astype(np.float64)
    conversion_factor = np.timedelta64(1, "us") / np.timedelta64(1, datetime_unit)
    return conversion_factor * array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Keep in mind though the default units for differentiate and integrate have been nanoseconds, and that is user-facing; should we preserve that? Also should we preserve the ability to specify sub-microsecond max_gap's in interpolate_na?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see. Thanks @spencerkclark

Keep in mind though the default units for differentiate and integrate have been nanoseconds, and that is user-facing; should we preserve that?

Yes I think we should. Or we have to go through a deprecation cycle.

Also should we preserve the ability to specify sub-microsecond max_gap's in interpolate_na

From a user perspective, it seems like we should allow all units that may be used for the time coordinate. I admit I don't understand the subtleties of precision though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dcherian -- I agree on all points.

So @huard, I think it makes sense to do the following (feel free to merge your PRs or not as you see fit):

  1. Maintain the ability to output floats with nanosecond units (or finer) from datetime_to_numeric by converting timedelta-like objects to floats before doing the unit conversion.
  2. Use a similar unit conversion approach when converting max_gap to a float, so that we again get the best of both worlds. I.e. continue to support nanosecond-resolution max_gap timedeltas, but at the same time allow for > 292-year max_gap timedeltas (specified, e.g., using datetime.timedelta objects or np.timedelta64 objects of microsecond or coarser resolution).

(2) might require some careful logic, but in principle I think it would be doable. Does that make sense? Thanks already for your help in identifying/sorting all these issues out. Sorry for taking a while to think all this through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged my two PRs into #3631 and implemented the py_timedelta_to_float function.
It seems to work except for earlier numpy versions. I've special-cased earlier numpy versions since I don't think I can fix that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @huard -- I'll have a closer look over the coming days.

@dcherian dcherian mentioned this pull request Jan 17, 2020
11 tasks
@huard
Copy link
Contributor Author

huard commented Jan 20, 2020

Merged with #3631.

@huard huard closed this Jan 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

interp with long cftime coordinates raises an error
3 participants