Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't cast NaN to integer #7098

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 1 addition & 7 deletions xarray/coding/times.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,17 +237,11 @@ def _decode_datetime_with_pandas(flat_num_dates, units, calendar):
if flat_num_dates.dtype.kind in "iu":
flat_num_dates = flat_num_dates.astype(np.int64)

# Cast input ordinals to integers of nanoseconds because pd.to_timedelta
# works much faster when dealing with integers (GH 1399).
flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(
np.int64
)
Comment on lines -240 to -244
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe preserve this optimization (#1399) in the case that NaNs are not present in flat_num_dates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any ASV tests for this so we can see if pd.to_timedelta is still slow with non ints?

Copy link
Member

@spencerkclark spencerkclark May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, the issue that motivated that PR is now quite old. It would be good to revisit this. If we use the test cases in the notebook linked to in #1399, we do see that pd.to_timedelta has improved a lot in this regard, though still trails the current approach by about a factor of four:

In [1]: import numpy as np; import pandas as pd

In [2]: t_minutes = np.arange(1.0,100000.0, 0.13, dtype=np.float64)

In [3]: %%timeit
   ...: pd.to_timedelta(t_minutes, unit='m')
   ...:
   ...:
10 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %%timeit
   ...: pd.to_timedelta((t_minutes * 60 * 1e9).astype(np.int64), unit='ns')
   ...:
   ...:
2.26 ms ± 79.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: pd.__version__
Out[5]: '2.0.0'

In [6]: np.__version__
Out[6]: '1.24.3'

Previous timings were 5.2 seconds and 10.6 milliseconds, respectively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why there should be any floating point time values at all when decoding. Are there backends which save times as floating point?

Currently if times are in numpy datetime64 representation and they have NaT, that NaT has a certain int64 representation. We would only need to skip CFMaskCoder and do the masking in CFDatetimeCoder at least for those cases. Does that make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why there should be any floating point time values at all when decoding. Are there backends which save times as floating point?

This is generally up to how xarray, the user, or whoever created the data we are reading in, configured the encoding. Some files indeed do have times encoded as floats with a fill value of NaN. Some of those files were created by xarray (maybe we have some control over this, but not over past decisions); some may have been created with other tools (unfortunately we have no control over this).

I think one could argue that xarray should not create these files automatically (as you've noted it currently does if NaT is present in the data, which I agree should be fixed), but I'm not sure how we would control against the case that someone explicitly set the units encoding of the times in such a way that would require floating point values (raising feels a little extreme).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spencerkclark I very much agree here, especially the point of raising error. A warning would be sufficient.

My real concern here is, and this should really be fixed, that on decoding any time array (even if we get it as int64) with an associated _FillValue will be transformed to floating point in CFMaskCoder. I think I'll better to open a new issue for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean. I agree it would be good to open a new issue for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my oceanographic experience, the units of "days since X" are very common with floating point values to indicate the time. Argo comes immediately to mind. Fill values are encountered when the time data is not part of a dimension/coordinate variable, this is also considered valid, especially in observational data.


# Use pd.to_timedelta to safely cast integer values to timedeltas,
# and add those to a Timestamp to safely produce a DatetimeIndex. This
# ensures that we do not encounter integer overflow at any point in the
# process without raising OutOfBoundsDatetime.
return (pd.to_timedelta(flat_num_dates_ns_int, "ns") + ref_date).values
return (pd.to_timedelta(flat_num_dates, delta) + ref_date).values


def decode_cf_datetime(num_dates, units, calendar=None, use_cftime=None):
Expand Down