Speed up `decode_cf_datetime` #1414

cchwala · 2017-05-18T21:15:40Z

Closes decode_cf_datetime() slow because pd.to_timedelta() is slow if floats are passed #1399
Tests added / passed
Passes git diff upstream/master | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

Instead of casting the input numeric dates to float, they are now
casted to nanoseconds as int64 which makes pd.to_timedelta()
work much faster (x100 speedup on my machine).

On my machine all existing tests for conventions.py pass. Overflows should be handled by these two already existing lines since everything in the valid range of pd.to_datetime should be save.

Instead of casting the input numeric dates to float, they are casted to nanoseconds as integer which makes `pd.to_timedelta()` work much faster (x100 speedup on my machine)

jhamman · 2017-05-19T03:23:18Z

xarray/conventions.py

+                             's': 1e9,
+                             'm': 1e9 * 60,
+                             'h': 1e9 * 60 * 60,
+                             'D': 1e9 * 60 * 60 * 24}


should we put this in a module level scope? Not sure we need to create this dictionary each time the decode_cf_datetime function is called.

I left it there because the performance impact should be negligible, but it could of course be placed somewhere outside to create it only once.

Yes, I would suggest saving this as a module level constant, _NS_PER_TIME_DELTA.

shoyer

This looks good to me. I don't think you need new tests (unless the existing test coverage looks lacking in any way), but a brief note in "what's new" would be appreciated.

shoyer · 2017-05-20T15:30:15Z

xarray/conventions.py

+                             's': 1e9,
+                             'm': 1e9 * 60,
+                             'h': 1e9 * 60 * 60,
+                             'D': 1e9 * 60 * 60 * 24}


Yes, I would suggest saving this as a module level constant, _NS_PER_TIME_DELTA.

cchwala · 2017-05-21T15:28:15Z

Thanks @shoyer and @jhamman for the feedback. I will change things accordingly.

Concerning tests, I will think again about additional checking for correct handling of overflow. I must admit, that I am not 100% sure that every case is handled correctly by the current code and checked by the current tests. Will have to think about it a little when I find time within the next days...

cchwala · 2017-06-01T11:43:27Z

Just a short notice. Sorry, for the delay. I am still working on this PR, but I am too busy right now to finish the overflow testing. I think I found some edge cases which have to be handled. I will provide more details soon.

jhamman · 2017-07-13T21:24:02Z

@cchwala - Any update on the testing / overflow issue you mentioned?

cchwala · 2017-07-14T10:05:04Z

@jhamman - Sorry. I was away from office (and everything related to work) for more than a month and had to catchup with a lot of things. I will sum up my stuff and post here, hopefully after todays lunch break.

cchwala · 2017-07-16T21:15:04Z

@jhamman - I found some differences between the old code in master an my code when decoding values close to the np.datetime64 overflow. My code produces NaT where the old code returned some date.

First, I wanted to test and fix that. However, I may have found that the old implementation did not behave correctly when crossing the "overflow" line just slightly.

I have summed that up in a notebook here.

My conclusion would be, that the code in this PR here is not only faster, but also more correct than the old one. However, since it is quite late in the evening and my head needs some rest, I would like to get a second (or third) opinion...

cchwala · 2017-07-16T22:41:50Z

...but wait. The NaTs that my code produces beyond the int64 overflow should be valid dates, produced using _decode_datetime_with_netcdf4, right?

Hence, I should still add a check for NaT results and fall back to the netcdf version then.

jhamman · 2017-07-17T19:38:03Z

@cchwala - thanks for keeping this moving. Once you've taking another pass at the code and added a Whatsnew note, I'll give it a final review.

cchwala · 2017-07-21T10:10:54Z

hmm... it's still complicated. To avoid the NaTs in my code, I tried to extend the current overflow check so that it switches to _decode_datetime_with_netcdf4() earlier. This was my attempt:

(pd.to_timedelta(flat_num_dates.min(), delta) -
 pd.to_timedelta(1, 'd') +
 ref_date)
(pd.to_timedelta(flat_num_dates.max(), delta) +
 pd.to_timedelta(1, 'd') +
 ref_date)

But unfortunately, as shown in my notebook above, pandas.to_timedelta() has a bug and does not detect the overflow in those esoteric cases that I have identified... I have filed this Issue pandas-dev/pandas/issues/17037 because it should be solved there.

Since I do not think this will be fixed soon (I would gladly look at it, but have no time and probably not enough knowledge about the pandas core stuff), I am not sure what to do.

Do you want to merge this PR, knowing that there still is the overflow issue that was in the code before? Or should I continue to try to fix the current overflow bug in this PR?

shoyer · 2017-07-21T15:44:15Z

Do you want to merge this PR, knowing that there still is the overflow issue that was in the code before?

This still sounds like an improvement to me!

jhamman · 2017-07-21T16:56:52Z

Agreed. Let's leave the overflow fix for later. I'll give this one more review and we'll try to get this merged.

jhamman

This looks good. Let's still move the _NS_PER_TIME_DELTA dict to the top of the conventions.py module and add a note to the whatsnew.rst.

I think our existing test coverage is sufficient here.

…f_datetime

cchwala · 2017-07-25T16:03:46Z

@jhamman @shoyer
This should be ready to merge.

Should I open an xarray issue concerning the bug with pandas.to_timedelta() or is it enough to have the issue I submitted for pandas? I think the bug should be resolved in xarray when it is resolved in pandas because then the overflow check here should catch the cases I discovered.

jhamman · 2017-07-25T17:42:58Z

Thanks @cchwala! I think the Pandas issue is sufficient on this one.

Speed up decode_cf_datetime

d7d7c01

Instead of casting the input numeric dates to float, they are casted to nanoseconds as integer which makes `pd.to_timedelta()` work much faster (x100 speedup on my machine)

jhamman reviewed May 19, 2017

View reviewed changes

shoyer approved these changes May 20, 2017

View reviewed changes

jhamman added enhancement topic-performance labels Jul 13, 2017

jhamman approved these changes Jul 21, 2017

View reviewed changes

cchwala added 3 commits July 24, 2017 11:59

Moved _NS_PER_TIME_DELTA to top of module file

1f9b29f

Merge remote-tracking branch 'upstream/master' into speed_up_decode_c…

4e2327d

…f_datetime

Added entry to whats-new.rst

3095ecc

jhamman merged commit d275ad6 into pydata:master Jul 25, 2017

cchwala deleted the speed_up_decode_cf_datetime branch July 26, 2017 07:40

fmaussion mentioned this pull request Mar 24, 2018

Unexpected decoded time in xarray >= 0.10.1 #2002

Closed

spencerkclark mentioned this pull request Dec 12, 2020

Ensure maximum accuracy when encoding and decoding np.datetime64[ns] values #4684

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `decode_cf_datetime` #1414

Speed up `decode_cf_datetime` #1414

cchwala commented May 18, 2017 •

edited

Loading

jhamman May 19, 2017

cchwala May 19, 2017

shoyer May 20, 2017

shoyer left a comment

shoyer May 20, 2017

cchwala commented May 21, 2017

cchwala commented Jun 1, 2017

jhamman commented Jul 13, 2017

cchwala commented Jul 14, 2017

cchwala commented Jul 16, 2017 •

edited

Loading

cchwala commented Jul 16, 2017

jhamman commented Jul 17, 2017

cchwala commented Jul 21, 2017

shoyer commented Jul 21, 2017

jhamman commented Jul 21, 2017

jhamman left a comment •

edited

Loading

cchwala commented Jul 25, 2017

jhamman commented Jul 25, 2017 •

edited

Loading

Speed up decode_cf_datetime #1414

Speed up decode_cf_datetime #1414

Conversation

cchwala commented May 18, 2017 • edited Loading

jhamman May 19, 2017

Choose a reason for hiding this comment

cchwala May 19, 2017

Choose a reason for hiding this comment

shoyer May 20, 2017

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

shoyer May 20, 2017

Choose a reason for hiding this comment

cchwala commented May 21, 2017

cchwala commented Jun 1, 2017

jhamman commented Jul 13, 2017

cchwala commented Jul 14, 2017

cchwala commented Jul 16, 2017 • edited Loading

cchwala commented Jul 16, 2017

jhamman commented Jul 17, 2017

cchwala commented Jul 21, 2017

shoyer commented Jul 21, 2017

jhamman commented Jul 21, 2017

jhamman left a comment • edited Loading

Choose a reason for hiding this comment

cchwala commented Jul 25, 2017

jhamman commented Jul 25, 2017 • edited Loading

Speed up `decode_cf_datetime` #1414

Speed up `decode_cf_datetime` #1414

cchwala commented May 18, 2017 •

edited

Loading

cchwala commented Jul 16, 2017 •

edited

Loading

jhamman left a comment •

edited

Loading

jhamman commented Jul 25, 2017 •

edited

Loading