Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up decode_cf_datetime #1414

Merged
merged 4 commits into from
Jul 25, 2017
Merged

Conversation

cchwala
Copy link
Contributor

@cchwala cchwala commented May 18, 2017

Instead of casting the input numeric dates to float, they are now
casted to nanoseconds as int64 which makes pd.to_timedelta()
work much faster (x100 speedup on my machine).

On my machine all existing tests for conventions.py pass. Overflows should be handled by these two already existing lines since everything in the valid range of pd.to_datetime should be save.

Instead of casting the input numeric dates to float, they are
casted to nanoseconds as integer which makes `pd.to_timedelta()`
work much faster (x100 speedup on my machine)
's': 1e9,
'm': 1e9 * 60,
'h': 1e9 * 60 * 60,
'D': 1e9 * 60 * 60 * 24}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we put this in a module level scope? Not sure we need to create this dictionary each time the decode_cf_datetime function is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it there because the performance impact should be negligible, but it could of course be placed somewhere outside to create it only once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would suggest saving this as a module level constant, _NS_PER_TIME_DELTA.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I don't think you need new tests (unless the existing test coverage looks lacking in any way), but a brief note in "what's new" would be appreciated.

's': 1e9,
'm': 1e9 * 60,
'h': 1e9 * 60 * 60,
'D': 1e9 * 60 * 60 * 24}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would suggest saving this as a module level constant, _NS_PER_TIME_DELTA.

@cchwala
Copy link
Contributor Author

cchwala commented May 21, 2017

Thanks @shoyer and @jhamman for the feedback. I will change things accordingly.

Concerning tests, I will think again about additional checking for correct handling of overflow. I must admit, that I am not 100% sure that every case is handled correctly by the current code and checked by the current tests. Will have to think about it a little when I find time within the next days...

@cchwala
Copy link
Contributor Author

cchwala commented Jun 1, 2017

Just a short notice. Sorry, for the delay. I am still working on this PR, but I am too busy right now to finish the overflow testing. I think I found some edge cases which have to be handled. I will provide more details soon.

@jhamman
Copy link
Member

jhamman commented Jul 13, 2017

@cchwala - Any update on the testing / overflow issue you mentioned?

@cchwala
Copy link
Contributor Author

cchwala commented Jul 14, 2017

@jhamman - Sorry. I was away from office (and everything related to work) for more than a month and had to catchup with a lot of things. I will sum up my stuff and post here, hopefully after todays lunch break.

@cchwala
Copy link
Contributor Author

cchwala commented Jul 16, 2017

@jhamman - I found some differences between the old code in master an my code when decoding values close to the np.datetime64 overflow. My code produces NaT where the old code returned some date.

First, I wanted to test and fix that. However, I may have found that the old implementation did not behave correctly when crossing the "overflow" line just slightly.

I have summed that up in a notebook here.

My conclusion would be, that the code in this PR here is not only faster, but also more correct than the old one. However, since it is quite late in the evening and my head needs some rest, I would like to get a second (or third) opinion...

@cchwala
Copy link
Contributor Author

cchwala commented Jul 16, 2017

...but wait. The NaTs that my code produces beyond the int64 overflow should be valid dates, produced using _decode_datetime_with_netcdf4, right?

Hence, I should still add a check for NaT results and fall back to the netcdf version then.

@jhamman
Copy link
Member

jhamman commented Jul 17, 2017

@cchwala - thanks for keeping this moving. Once you've taking another pass at the code and added a Whatsnew note, I'll give it a final review.

@cchwala
Copy link
Contributor Author

cchwala commented Jul 21, 2017

hmm... it's still complicated. To avoid the NaTs in my code, I tried to extend the current overflow check so that it switches to _decode_datetime_with_netcdf4() earlier. This was my attempt:

(pd.to_timedelta(flat_num_dates.min(), delta) -
 pd.to_timedelta(1, 'd') +
 ref_date)
(pd.to_timedelta(flat_num_dates.max(), delta) +
 pd.to_timedelta(1, 'd') +
 ref_date)

But unfortunately, as shown in my notebook above, pandas.to_timedelta() has a bug and does not detect the overflow in those esoteric cases that I have identified... I have filed this Issue pandas-dev/pandas/issues/17037 because it should be solved there.

Since I do not think this will be fixed soon (I would gladly look at it, but have no time and probably not enough knowledge about the pandas core stuff), I am not sure what to do.

Do you want to merge this PR, knowing that there still is the overflow issue that was in the code before? Or should I continue to try to fix the current overflow bug in this PR?

@shoyer
Copy link
Member

shoyer commented Jul 21, 2017

Do you want to merge this PR, knowing that there still is the overflow issue that was in the code before?

This still sounds like an improvement to me!

@jhamman
Copy link
Member

jhamman commented Jul 21, 2017

Agreed. Let's leave the overflow fix for later. I'll give this one more review and we'll try to get this merged.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Let's still move the _NS_PER_TIME_DELTA dict to the top of the conventions.py module and add a note to the whatsnew.rst.

I think our existing test coverage is sufficient here.

@cchwala
Copy link
Contributor Author

cchwala commented Jul 25, 2017

@jhamman @shoyer
This should be ready to merge.

Should I open an xarray issue concerning the bug with pandas.to_timedelta() or is it enough to have the issue I submitted for pandas? I think the bug should be resolved in xarray when it is resolved in pandas because then the overflow check here should catch the cases I discovered.

@jhamman jhamman merged commit d275ad6 into pydata:master Jul 25, 2017
@jhamman
Copy link
Member

jhamman commented Jul 25, 2017

Thanks @cchwala! I think the Pandas issue is sufficient on this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants