-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make datetime_to_numeric more robust to overflow errors #3642
Conversation
…ype conversion. add overflow tests
if offset is None: | ||
if array.dtype.kind in "Mm": | ||
offset = _datetime_nanmin(array) | ||
else: | ||
offset = min(array) | ||
|
||
# Compute timedelta object. | ||
# For np.datetime64, this can silently yield garbage due to overflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch; this is another bug (we end up with a NaN in this example):
In [1]: import numpy as np; import xarray as xr
In [2]: times = [np.datetime64('1690-01-01', 'ns'), np.datetime64('2200-01-01', 'ns')]
In [3]: da = xr.DataArray(range(2), dims=['time'], coords=[times])
In [4]: da
Out[4]:
<xarray.DataArray (time: 2)>
array([0, 1])
Coordinates:
* time (time) datetime64[ns] 1690-01-01 2200-01-01
In [5]: da.interp(time=['1700-01-01'])
Out[5]:
<xarray.DataArray (time: 1)>
array([nan])
Coordinates:
* time (time) datetime64[ns] 1700-01-01
This is another potential plus of using a universal offset. In the case of NumPy dates, I think if we always use 1970-01-01T00:00:00 as the offset we are guaranteed this will never return garbage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd also have to enforce a microsecond resolution, which could break some code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the issue I raised in the comment above, I think switching to 1970-01-01T00:00:00 as the offset would work for NumPy dates at nanosecond resolution (maybe there is something I'm missing?). Let me think a little more about the implications of casting the datetime.timedelta
objects to timedelta64[us]
objects, though. I think you're right that we should think carefully about that.
if array.dtype.kind in "O": | ||
# possibly convert object array containing datetime.timedelta | ||
array = np.asarray(pd.Series(array.ravel())).reshape(array.shape) | ||
array = array.astype("timedelta64[ms]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we may want to use microsecond units ("timedelta64[us]"
); using millisecond resolution may lead to unnecessary truncation of precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. In get_clean_interp_index
, should we also use datetime_to_numeric
for pd.DatetimeIndex with the us resolution ?
if isinstance(index, (CFTimeIndex, pd.DatetimeIndex)):
offset = type(index[0])(1970, 1, 1)
index = Variable(
data=datetime_to_numeric(index, offset=offset, datetime_unit="us"),
dims=(dim,),
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching to "us" breaks interpolate_na
because max_gap is ns by default.
Throughout xarray, "ns" is the default resolution. I'm a bit worried that switching this for "us" in a couple of places it a recipe for future bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that we could use nanosecond units (or units even finer) for cftime dates, so long as we do the unit conversion step after converting to floats. E.g. one could write a function like the following to aid in that:
SUB_MICROSECOND_UNIT_CONVERSION_FACTORS = {
"ns": 1e3,
"ps": 1e6,
"fs": 1e9,
"as": 1e12
}
def _py_timedelta_to_float(array, datetime_unit="ns"):
array = array.astype("timedelta64[us]")
if datetime_unit in SUB_MICROSECOND_UNIT_CONVERSION_FACTORS:
factor = SUB_MICROSECOND_UNIT_CONVERSION_FACTORS[datetime_unit]
return factor * array.astype(np.float64)
else:
return array / np.timedelta64(1, datetime_unit)
I have not thought carefully yet about how to handle max_gap
in interpolate_na
in the case of cftime dates, but this seems like a possible compromise to allow us to continue to default to nanosecond resolution internally for these functions. @huard do you think that could help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's perhaps a cleaner version:
def _py_timedelta_to_float(array, datetime_unit="ns"):
array = array.astype("timedelta64[us]").astype(np.float64)
conversion_factor = np.timedelta64(1, "us") / np.timedelta64(1, datetime_unit)
return conversion_factor * array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Keep in mind though the default units for differentiate
and integrate
have been nanoseconds, and that is user-facing; should we preserve that? Also should we preserve the ability to specify sub-microsecond max_gap
's in interpolate_na
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see. Thanks @spencerkclark
Keep in mind though the default units for differentiate and integrate have been nanoseconds, and that is user-facing; should we preserve that?
Yes I think we should. Or we have to go through a deprecation cycle.
Also should we preserve the ability to specify sub-microsecond max_gap's in interpolate_na
From a user perspective, it seems like we should allow all units that may be used for the time coordinate. I admit I don't understand the subtleties of precision though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dcherian -- I agree on all points.
So @huard, I think it makes sense to do the following (feel free to merge your PRs or not as you see fit):
- Maintain the ability to output floats with nanosecond units (or finer) from
datetime_to_numeric
by converting timedelta-like objects to floats before doing the unit conversion. - Use a similar unit conversion approach when converting
max_gap
to a float, so that we again get the best of both worlds. I.e. continue to support nanosecond-resolutionmax_gap
timedeltas, but at the same time allow for > 292-yearmax_gap
timedeltas (specified, e.g., usingdatetime.timedelta
objects ornp.timedelta64
objects of microsecond or coarser resolution).
(2) might require some careful logic, but in principle I think it would be doable. Does that make sense? Thanks already for your help in identifying/sorting all these issues out. Sorry for taking a while to think all this through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've merged my two PRs into #3631 and implemented the py_timedelta_to_float
function.
It seems to work except for earlier numpy versions. I've special-cased earlier numpy versions since I don't think I can fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @huard -- I'll have a closer look over the coming days.
Merged with #3631. |
black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new APIThis is likely only safe with NumPy>=1.17 though.