-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make datetime_to_numeric more robust to overflow errors #3642
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -372,43 +372,62 @@ def _datetime_nanmin(array): | |
|
||
|
||
def datetime_to_numeric(array, offset=None, datetime_unit=None, dtype=float): | ||
"""Convert an array containing datetime-like data to an array of floats. | ||
"""Return an array containing datetime-like data to numerical values. | ||
|
||
Convert the datetime array to a timedelta relative to an offset. | ||
|
||
Parameters | ||
---------- | ||
da : np.array | ||
Input data | ||
offset: Scalar with the same type of array or None | ||
If None, subtract minimum values to reduce round off error | ||
datetime_unit: None or any of {'Y', 'M', 'W', 'D', 'h', 'm', 's', 'ms', | ||
'us', 'ns', 'ps', 'fs', 'as'} | ||
dtype: target dtype | ||
da : array-like | ||
Input data | ||
offset: None, datetime or cftime.datetime | ||
Datetime offset. If None, this is set by default to the array's minimum | ||
value to reduce round off errors. | ||
datetime_unit: {None, Y, M, W, D, h, m, s, ms, us, ns, ps, fs, as} | ||
If not None, convert output to a given datetime unit. Note that some | ||
conversions are not allowed due to non-linear relationships between units. | ||
dtype: dtype | ||
Output dtype. | ||
|
||
Returns | ||
------- | ||
array | ||
Numerical representation of datetime object relative to an offset. | ||
|
||
Notes | ||
----- | ||
Some datetime unit conversions won't work, for example from days to years, even | ||
though some calendars would allow for them (e.g. no_leap). This is because there | ||
is no `cftime.timedelta` object. | ||
""" | ||
# TODO: make this function dask-compatible? | ||
# Set offset to minimum if not given | ||
if offset is None: | ||
if array.dtype.kind in "Mm": | ||
offset = _datetime_nanmin(array) | ||
else: | ||
offset = min(array) | ||
|
||
# Compute timedelta object. | ||
# For np.datetime64, this can silently yield garbage due to overflow. | ||
array = array - offset | ||
|
||
if not hasattr(array, "dtype"): # scalar is converted to 0d-array | ||
# Scalar is converted to 0d-array | ||
if not hasattr(array, "dtype"): | ||
array = np.array(array) | ||
|
||
# Convert timedelta objects to timedelta64[ms] dtype. | ||
if array.dtype.kind in "O": | ||
# possibly convert object array containing datetime.timedelta | ||
array = np.asarray(pd.Series(array.ravel())).reshape(array.shape) | ||
array = array.astype("timedelta64[ms]") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here we may want to use microsecond units ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do. In
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Switching to "us" breaks Throughout xarray, "ns" is the default resolution. I'm a bit worried that switching this for "us" in a couple of places it a recipe for future bugs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It occurs to me that we could use nanosecond units (or units even finer) for cftime dates, so long as we do the unit conversion step after converting to floats. E.g. one could write a function like the following to aid in that: SUB_MICROSECOND_UNIT_CONVERSION_FACTORS = {
"ns": 1e3,
"ps": 1e6,
"fs": 1e9,
"as": 1e12
}
def _py_timedelta_to_float(array, datetime_unit="ns"):
array = array.astype("timedelta64[us]")
if datetime_unit in SUB_MICROSECOND_UNIT_CONVERSION_FACTORS:
factor = SUB_MICROSECOND_UNIT_CONVERSION_FACTORS[datetime_unit]
return factor * array.astype(np.float64)
else:
return array / np.timedelta64(1, datetime_unit) I have not thought carefully yet about how to handle There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's perhaps a cleaner version: def _py_timedelta_to_float(array, datetime_unit="ns"):
array = array.astype("timedelta64[us]").astype(np.float64)
conversion_factor = np.timedelta64(1, "us") / np.timedelta64(1, datetime_unit)
return conversion_factor * array There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. Keep in mind though the default units for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah I see. Thanks @spencerkclark
Yes I think we should. Or we have to go through a deprecation cycle.
From a user perspective, it seems like we should allow all units that may be used for the time coordinate. I admit I don't understand the subtleties of precision though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @dcherian -- I agree on all points. So @huard, I think it makes sense to do the following (feel free to merge your PRs or not as you see fit):
(2) might require some careful logic, but in principle I think it would be doable. Does that make sense? Thanks already for your help in identifying/sorting all these issues out. Sorry for taking a while to think all this through. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've merged my two PRs into #3631 and implemented the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks a lot @huard -- I'll have a closer look over the coming days. |
||
|
||
# Convert to specified timedelta units. | ||
if datetime_unit: | ||
array = array / np.timedelta64(1, datetime_unit) | ||
|
||
# convert np.NaT to np.nan | ||
# Convert np.NaT to np.nan | ||
if array.dtype.kind in "mM": | ||
return np.where(isnull(array), np.nan, array.astype(dtype)) | ||
|
||
return array.astype(dtype) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch; this is another bug (we end up with a NaN in this example):
This is another potential plus of using a universal offset. In the case of NumPy dates, I think if we always use 1970-01-01T00:00:00 as the offset we are guaranteed this will never return garbage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd also have to enforce a microsecond resolution, which could break some code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the issue I raised in the comment above, I think switching to 1970-01-01T00:00:00 as the offset would work for NumPy dates at nanosecond resolution (maybe there is something I'm missing?). Let me think a little more about the implications of casting the
datetime.timedelta
objects totimedelta64[us]
objects, though. I think you're right that we should think carefully about that.