-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't cast NaN to integer #7098
Conversation
Hi @andreas-schwab, and welcome to xarray! Can you tell what specific failures this change fixed for you? If there was something failing we want to capture it in our test suite, but I am not sure what failure you are referring to. |
It's already perfectly covered by the testsuite.
=========================== short test summary info ============================
FAILED xarray/tests/test_backends.py::TestScipyInMemoryData::test_roundtrip_numpy_datetime_data
FAILED xarray/tests/test_backends.py::TestScipyFileObject::test_roundtrip_numpy_datetime_data
FAILED xarray/tests/test_backends.py::TestGenericNetCDFData::test_roundtrip_numpy_datetime_data
FAILED xarray/tests/test_backends.py::TestScipyFilePath::test_roundtrip_numpy_datetime_data
= 4 failed, 4636 passed, 5632 skipped, 19 xfailed, 22 xpassed, 38 warnings in 266.18s (0:04:26) =
|
Okay great, but then why are these tests failing for you (locally?) and not in our CI runs? (Our I'm asking because if our current CI test runs don't reproduce this error, then we (a) have no way to check that your change fixed the error and (b) will not know if some regression causes this error to resurface in future. Standard practice would be to raise an issue that demonstrates the problem, then link the fix PR to that issue. |
That's the beauty of undefined behaviour. It comes in many flavors, and
you never know which one you get. Which also means that it is
impossible to reliably test for it.
|
@TomNicholas Something different will need to happen with that cast eventually. See #6191 for something that is failing on some systems that users have but is currently unable to be captured in the tests. Numpy has already added runtime warnings about doing this, and is "thinking about" making nan to int casts raise numpy/numpy#14412. Xarray's own @shoyer has hit issues like this before as well numpy/numpy#6109. |
Thank you very much for the context @DocOtak ! |
I think the real solution here is to explicitly handle NaNs during the decoding step. We do want these to be NaT in the output. |
Do not try to convert NaN to integer type, as the operation is undefined and results in random values. This fixes all testsuite failures.
Hi, is there any reason that keeps this pr from being merged? This PR is a fix for some failing tests on riscv64. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andreas-schwab -- sorry that we let this sit for so long. This is indeed an important issue to address. The solution is good. I just think we might be able to preserve the optimization in the case that NaNs are not present in the input data, which might be nice.
It looks like we've been getting the warnings that @DocOtak mentioned when running the test_roundtrip_numpy_datetime_data
tests:
RuntimeWarning: invalid value encountered in cast
flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(
# Cast input ordinals to integers of nanoseconds because pd.to_timedelta | ||
# works much faster when dealing with integers (GH 1399). | ||
flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype( | ||
np.int64 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe preserve this optimization (#1399) in the case that NaNs are not present in flat_num_dates
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any ASV tests for this so we can see if pd.to_timedelta is still slow with non ints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, the issue that motivated that PR is now quite old. It would be good to revisit this. If we use the test cases in the notebook linked to in #1399, we do see that pd.to_timedelta
has improved a lot in this regard, though still trails the current approach by about a factor of four:
In [1]: import numpy as np; import pandas as pd
In [2]: t_minutes = np.arange(1.0,100000.0, 0.13, dtype=np.float64)
In [3]: %%timeit
...: pd.to_timedelta(t_minutes, unit='m')
...:
...:
10 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %%timeit
...: pd.to_timedelta((t_minutes * 60 * 1e9).astype(np.int64), unit='ns')
...:
...:
2.26 ms ± 79.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: pd.__version__
Out[5]: '2.0.0'
In [6]: np.__version__
Out[6]: '1.24.3'
Previous timings were 5.2 seconds and 10.6 milliseconds, respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why there should be any floating point time values at all when decoding. Are there backends which save times as floating point?
Currently if times are in numpy datetime64 representation and they have NaT, that NaT has a certain int64 representation. We would only need to skip CFMaskCoder and do the masking in CFDatetimeCoder at least for those cases. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why there should be any floating point time values at all when decoding. Are there backends which save times as floating point?
This is generally up to how xarray, the user, or whoever created the data we are reading in, configured the encoding. Some files indeed do have times encoded as floats with a fill value of NaN. Some of those files were created by xarray (maybe we have some control over this, but not over past decisions); some may have been created with other tools (unfortunately we have no control over this).
I think one could argue that xarray should not create these files automatically (as you've noted it currently does if NaT is present in the data, which I agree should be fixed), but I'm not sure how we would control against the case that someone explicitly set the units encoding of the times in such a way that would require floating point values (raising feels a little extreme).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spencerkclark I very much agree here, especially the point of raising error. A warning would be sufficient.
My real concern here is, and this should really be fixed, that on decoding any time array (even if we get it as int64) with an associated _FillValue
will be transformed to floating point in CFMaskCoder. I think I'll better to open a new issue for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see what you mean. I agree it would be good to open a new issue for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my oceanographic experience, the units of "days since X" are very common with floating point values to indicate the time. Argo comes immediately to mind. Fill values are encountered when the time data is not part of a dimension/coordinate variable, this is also considered valid, especially in observational data.
@andreas-schwab Thanks again for this PR. It turned out it was a bit more involved to make this work and we hope #7827 solves the underlying issue. |
Do not try to convert NaN to integer type, as the operation is undefined
and results in random values. This fixes all testsuite failures.