Make datetime_to_numeric more robust to overflow errors #3642

huard · 2019-12-18T17:34:41Z

Closes interp with long cftime coordinates raises an error #3641
Tests added
Passes black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

This is likely only safe with NumPy>=1.17 though.

…ype conversion. add overflow tests

spencerkclark · 2019-12-18T19:01:43Z

xarray/core/duck_array_ops.py

    if offset is None:
        if array.dtype.kind in "Mm":
            offset = _datetime_nanmin(array)
        else:
            offset = min(array)
+
+    # Compute timedelta object.
+    # For np.datetime64, this can silently yield garbage due to overflow.


Ah good catch; this is another bug (we end up with a NaN in this example):

In [1]: import numpy as np; import xarray as xr In [2]: times = [np.datetime64('1690-01-01', 'ns'), np.datetime64('2200-01-01', 'ns')] In [3]: da = xr.DataArray(range(2), dims=['time'], coords=[times]) In [4]: da Out[4]: <xarray.DataArray (time: 2)> array([0, 1]) Coordinates: * time (time) datetime64[ns] 1690-01-01 2200-01-01 In [5]: da.interp(time=['1700-01-01']) Out[5]: <xarray.DataArray (time: 1)> array([nan]) Coordinates: * time (time) datetime64[ns] 1700-01-01

This is another potential plus of using a universal offset. In the case of NumPy dates, I think if we always use 1970-01-01T00:00:00 as the offset we are guaranteed this will never return garbage.

We'd also have to enforce a microsecond resolution, which could break some code.

For the issue I raised in the comment above, I think switching to 1970-01-01T00:00:00 as the offset would work for NumPy dates at nanosecond resolution (maybe there is something I'm missing?). Let me think a little more about the implications of casting the datetime.timedelta objects to timedelta64[us] objects, though. I think you're right that we should think carefully about that.

spencerkclark · 2019-12-18T19:06:38Z

xarray/core/duck_array_ops.py

    if array.dtype.kind in "O":
-        # possibly convert object array containing datetime.timedelta
-        array = np.asarray(pd.Series(array.ravel())).reshape(array.shape)
+        array = array.astype("timedelta64[ms]")


Here we may want to use microsecond units ("timedelta64[us]"); using millisecond resolution may lead to unnecessary truncation of precision.

Will do. In get_clean_interp_index, should we also use datetime_to_numeric for pd.DatetimeIndex with the us resolution ?

if isinstance(index, (CFTimeIndex, pd.DatetimeIndex)): offset = type(index[0])(1970, 1, 1) index = Variable( data=datetime_to_numeric(index, offset=offset, datetime_unit="us"), dims=(dim,), )

Switching to "us" breaks interpolate_na because max_gap is ns by default.

Throughout xarray, "ns" is the default resolution. I'm a bit worried that switching this for "us" in a couple of places it a recipe for future bugs.

It occurs to me that we could use nanosecond units (or units even finer) for cftime dates, so long as we do the unit conversion step after converting to floats. E.g. one could write a function like the following to aid in that:

SUB_MICROSECOND_UNIT_CONVERSION_FACTORS = { "ns": 1e3, "ps": 1e6, "fs": 1e9, "as": 1e12 } def _py_timedelta_to_float(array, datetime_unit="ns"): array = array.astype("timedelta64[us]") if datetime_unit in SUB_MICROSECOND_UNIT_CONVERSION_FACTORS: factor = SUB_MICROSECOND_UNIT_CONVERSION_FACTORS[datetime_unit] return factor * array.astype(np.float64) else: return array / np.timedelta64(1, datetime_unit)

I have not thought carefully yet about how to handle max_gap in interpolate_na in the case of cftime dates, but this seems like a possible compromise to allow us to continue to default to nanosecond resolution internally for these functions. @huard do you think that could help?

Here's perhaps a cleaner version:

def _py_timedelta_to_float(array, datetime_unit="ns"): array = array.astype("timedelta64[us]").astype(np.float64) conversion_factor = np.timedelta64(1, "us") / np.timedelta64(1, datetime_unit) return conversion_factor * array

Yes. Keep in mind though the default units for differentiate and integrate have been nanoseconds, and that is user-facing; should we preserve that? Also should we preserve the ability to specify sub-microsecond max_gap's in interpolate_na?

ah I see. Thanks @spencerkclark

Keep in mind though the default units for differentiate and integrate have been nanoseconds, and that is user-facing; should we preserve that?

Yes I think we should. Or we have to go through a deprecation cycle.

Also should we preserve the ability to specify sub-microsecond max_gap's in interpolate_na

From a user perspective, it seems like we should allow all units that may be used for the time coordinate. I admit I don't understand the subtleties of precision though.

Thanks @dcherian -- I agree on all points.

So @huard, I think it makes sense to do the following (feel free to merge your PRs or not as you see fit):

Maintain the ability to output floats with nanosecond units (or finer) from datetime_to_numeric by converting timedelta-like objects to floats before doing the unit conversion.

Use a similar unit conversion approach when converting max_gap to a float, so that we again get the best of both worlds. I.e. continue to support nanosecond-resolution max_gap timedeltas, but at the same time allow for > 292-year max_gap timedeltas (specified, e.g., using datetime.timedelta objects or np.timedelta64 objects of microsecond or coarser resolution).

(2) might require some careful logic, but in principle I think it would be doable. Does that make sense? Thanks already for your help in identifying/sorting all these issues out. Sorry for taking a while to think all this through.

I've merged my two PRs into #3631 and implemented the py_timedelta_to_float function.
It seems to work except for earlier numpy versions. I've special-cased earlier numpy versions since I don't think I can fix that.

Thanks a lot @huard -- I'll have a closer look over the coming days.

huard · 2020-01-20T19:21:48Z

Merged with #3631.

huard added 2 commits December 18, 2019 12:28

datetime_to_numeric: convert timedelta objects using np.timedelta64 t…

1cfe72d

…ype conversion. add overflow tests

added interp test

4964163

spencerkclark reviewed Dec 18, 2019

View reviewed changes

dcherian mentioned this pull request Jan 17, 2020

release 0.15.0? #3702

Closed

11 tasks

huard closed this Jan 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make datetime_to_numeric more robust to overflow errors #3642

Make datetime_to_numeric more robust to overflow errors #3642

huard commented Dec 18, 2019

spencerkclark Dec 18, 2019

huard Dec 18, 2019

spencerkclark Dec 18, 2019

spencerkclark Dec 18, 2019

huard Dec 18, 2019

huard Dec 18, 2019

spencerkclark Dec 19, 2019

spencerkclark Dec 19, 2019

spencerkclark Dec 20, 2019

dcherian Dec 20, 2019

spencerkclark Dec 21, 2019

huard Jan 6, 2020

spencerkclark Jan 7, 2020

huard commented Jan 20, 2020

Make datetime_to_numeric more robust to overflow errors #3642

Make datetime_to_numeric more robust to overflow errors #3642

Conversation

huard commented Dec 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huard commented Jan 20, 2020