-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't cast NaN to integer #7098
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe preserve this optimization (#1399) in the case that NaNs are not present in
flat_num_dates
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any ASV tests for this so we can see if pd.to_timedelta is still slow with non ints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, the issue that motivated that PR is now quite old. It would be good to revisit this. If we use the test cases in the notebook linked to in #1399, we do see that
pd.to_timedelta
has improved a lot in this regard, though still trails the current approach by about a factor of four:Previous timings were 5.2 seconds and 10.6 milliseconds, respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why there should be any floating point time values at all when decoding. Are there backends which save times as floating point?
Currently if times are in numpy datetime64 representation and they have NaT, that NaT has a certain int64 representation. We would only need to skip CFMaskCoder and do the masking in CFDatetimeCoder at least for those cases. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generally up to how xarray, the user, or whoever created the data we are reading in, configured the encoding. Some files indeed do have times encoded as floats with a fill value of NaN. Some of those files were created by xarray (maybe we have some control over this, but not over past decisions); some may have been created with other tools (unfortunately we have no control over this).
I think one could argue that xarray should not create these files automatically (as you've noted it currently does if NaT is present in the data, which I agree should be fixed), but I'm not sure how we would control against the case that someone explicitly set the units encoding of the times in such a way that would require floating point values (raising feels a little extreme).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spencerkclark I very much agree here, especially the point of raising error. A warning would be sufficient.
My real concern here is, and this should really be fixed, that on decoding any time array (even if we get it as int64) with an associated
_FillValue
will be transformed to floating point in CFMaskCoder. I think I'll better to open a new issue for that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see what you mean. I agree it would be good to open a new issue for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my oceanographic experience, the units of "days since X" are very common with floating point values to indicate the time. Argo comes immediately to mind. Fill values are encountered when the time data is not part of a dimension/coordinate variable, this is also considered valid, especially in observational data.