-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decode_cf_datetime()
slow because pd.to_timedelta()
is slow if floats are passed
#1399
Comments
Hi Christian!
This sounds much less error prone to me. In particular, I am getting a bit nervous when I hear "nanoseconds" ;-) (see #789) |
Good catch! We should definitely speed this up.
Yes, very much agreed. For units such as months or years, we already are giving the wrong result when we use pandas:
In these cases, we should fall back to using netCDF4/netcdftime instead. We may also need to add more checks for numeric overflow.
Yes, this might also work. I no longer recall why we cast all inputs to floats (maybe just for consistency), but I suspect that that one of our time conversion libraries (probably netCDF4/netcdftime) expects a float array. Certainly we will still need to support floating point times saved in netCDF files, which are pretty common in my experience. |
Hmm... The "nanosecond"-issue seems to need a fix very much at the foundation. As long as pandas and xarray rely on An intermediate fix (@shoyer, do you actually want one?) that I could think of for the performance issue right now would be to do the conversion to
The only thing that bothers me is that I am not sure if the "number of nanoseconds" is always the same in every day or hour in the view of @shoyer: Does this sound reasonable or did I forget to take into account any side effects? |
yes, you can ignore my comment! |
@spencerkclark has been working on patch to natively support other datetime precisions in xarray (see #1252).
For better or worse, NumPy's datetime64 ignores leap seconds.
This sounds pretty reasonable to me. The main challenge here will be guarding against integer overflow -- you might need to do the math twice, once with floats (to check for overflow) and then with integers. You could also experiment with doing the conversion with NumPy instead of pandas, using |
Okay. I will try to come up with a PR within the next days. |
Hi,
decode_cf_datetime
is slowed down because it always passes floats topd.to_timedelta
, whilepd.to_timedelta
is much faster when working on integers.Here is a notebook that shows the differences. Working with integers is approx. one order of magnitude faster.
Hence, it would be great to automatically do the conversion from raw time value floats to integers in nanoseconds where possible (likely limited to resolutions bellow days or hours to avoid coping with different durations numbers of nanoseconds within e.g. different months).
As alternative, maybe avoid forcing the cast to floats and indicate in the docstring that the raw values should be integers to speed up the conversion.
This could possibly also be resolved in
pd.to_timedelta
but I assume it will be more complicated to deal with all the edge cases there.The text was updated successfully, but these errors were encountered: