Don't cast NaN to integer #7098

andreas-schwab · 2022-09-28T11:13:44Z

Do not try to convert NaN to integer type, as the operation is undefined
and results in random values. This fixes all testsuite failures.

TomNicholas · 2022-09-29T16:44:38Z

Hi @andreas-schwab, and welcome to xarray!

Can you tell what specific failures this change fixed for you? If there was something failing we want to capture it in our test suite, but I am not sure what failure you are referring to.

andreas-schwab · 2022-09-29T17:12:43Z

It's already perfectly covered by the testsuite. =========================== short test summary info ============================ FAILED xarray/tests/test_backends.py::TestScipyInMemoryData::test_roundtrip_numpy_datetime_data FAILED xarray/tests/test_backends.py::TestScipyFileObject::test_roundtrip_numpy_datetime_data FAILED xarray/tests/test_backends.py::TestGenericNetCDFData::test_roundtrip_numpy_datetime_data FAILED xarray/tests/test_backends.py::TestScipyFilePath::test_roundtrip_numpy_datetime_data = 4 failed, 4636 passed, 5632 skipped, 19 xfailed, 22 xpassed, 38 warnings in 266.18s (0:04:26) =

TomNicholas · 2022-09-29T17:20:48Z

It's already perfectly covered by the testsuite.

Okay great, but then why are these tests failing for you (locally?) and not in our CI runs? (Our main branch just passed all automated tests just now.) Do you have a different version of some package that we aren't testing against?

I'm asking because if our current CI test runs don't reproduce this error, then we (a) have no way to check that your change fixed the error and (b) will not know if some regression causes this error to resurface in future. Standard practice would be to raise an issue that demonstrates the problem, then link the fix PR to that issue.

andreas-schwab · 2022-09-29T18:08:29Z

That's the beauty of undefined behaviour. It comes in many flavors, and you never know which one you get. Which also means that it is impossible to reliably test for it.

DocOtak · 2022-09-29T21:22:32Z

@TomNicholas Something different will need to happen with that cast eventually. See #6191 for something that is failing on some systems that users have but is currently unable to be captured in the tests. Numpy has already added runtime warnings about doing this, and is "thinking about" making nan to int casts raise numpy/numpy#14412. Xarray's own @shoyer has hit issues like this before as well numpy/numpy#6109.

TomNicholas · 2022-09-29T21:28:31Z

Thank you very much for the context @DocOtak !

dcherian · 2022-09-30T14:47:06Z

I think the real solution here is to explicitly handle NaNs during the decoding step. We do want these to be NaT in the output.

cc @spencerclark

Do not try to convert NaN to integer type, as the operation is undefined and results in random values. This fixes all testsuite failures.

kxxt · 2023-04-16T15:01:24Z

Hi, is there any reason that keeps this pr from being merged? This PR is a fix for some failing tests on riscv64.

spencerkclark

Thanks @andreas-schwab -- sorry that we let this sit for so long. This is indeed an important issue to address. The solution is good. I just think we might be able to preserve the optimization in the case that NaNs are not present in the input data, which might be nice.

It looks like we've been getting the warnings that @DocOtak mentioned when running the test_roundtrip_numpy_datetime_data tests:

RuntimeWarning: invalid value encountered in cast
  flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(

spencerkclark · 2023-05-02T22:08:27Z

xarray/coding/times.py

-    # Cast input ordinals to integers of nanoseconds because pd.to_timedelta
-    # works much faster when dealing with integers (GH 1399).
-    flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(
-        np.int64
-    )


Could we maybe preserve this optimization (#1399) in the case that NaNs are not present in flat_num_dates?

are there any ASV tests for this so we can see if pd.to_timedelta is still slow with non ints?

True, the issue that motivated that PR is now quite old. It would be good to revisit this. If we use the test cases in the notebook linked to in #1399, we do see that pd.to_timedelta has improved a lot in this regard, though still trails the current approach by about a factor of four:

In [1]: import numpy as np; import pandas as pd In [2]: t_minutes = np.arange(1.0,100000.0, 0.13, dtype=np.float64) In [3]: %%timeit ...: pd.to_timedelta(t_minutes, unit='m') ...: ...: 10 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [4]: %%timeit ...: pd.to_timedelta((t_minutes * 60 * 1e9).astype(np.int64), unit='ns') ...: ...: 2.26 ms ± 79.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [5]: pd.__version__ Out[5]: '2.0.0' In [6]: np.__version__ Out[6]: '1.24.3'

Previous timings were 5.2 seconds and 10.6 milliseconds, respectively.

I'm wondering why there should be any floating point time values at all when decoding. Are there backends which save times as floating point?

Currently if times are in numpy datetime64 representation and they have NaT, that NaT has a certain int64 representation. We would only need to skip CFMaskCoder and do the masking in CFDatetimeCoder at least for those cases. Does that make sense?

I'm wondering why there should be any floating point time values at all when decoding. Are there backends which save times as floating point?

This is generally up to how xarray, the user, or whoever created the data we are reading in, configured the encoding. Some files indeed do have times encoded as floats with a fill value of NaN. Some of those files were created by xarray (maybe we have some control over this, but not over past decisions); some may have been created with other tools (unfortunately we have no control over this).

I think one could argue that xarray should not create these files automatically (as you've noted it currently does if NaT is present in the data, which I agree should be fixed), but I'm not sure how we would control against the case that someone explicitly set the units encoding of the times in such a way that would require floating point values (raising feels a little extreme).

@spencerkclark I very much agree here, especially the point of raising error. A warning would be sufficient.

My real concern here is, and this should really be fixed, that on decoding any time array (even if we get it as int64) with an associated _FillValue will be transformed to floating point in CFMaskCoder. I think I'll better to open a new issue for that.

Ah I see what you mean. I agree it would be good to open a new issue for that.

In my oceanographic experience, the units of "days since X" are very common with floating point values to indicate the time. Argo comes immediately to mind. Fill values are encountered when the time data is not part of a dimension/coordinate variable, this is also considered valid, especially in observational data.

kmuehlbauer · 2023-09-17T08:20:28Z

@andreas-schwab Thanks again for this PR. It turned out it was a bit more involved to make this work and we hope #7827 solves the underlying issue.

github-actions bot added the topic-cftime label Sep 28, 2022

dcherian mentioned this pull request Oct 3, 2022

Handle NaNs when decoding times (failures on riscv64) #7096

Closed

4 tasks

headtr1ck added needs discussion needs review labels Oct 14, 2022

Don't cast NaN to integer

c847762

Do not try to convert NaN to integer type, as the operation is undefined and results in random values. This fixes all testsuite failures.

kmuehlbauer mentioned this pull request May 2, 2023

Fill values in time arrays (numpy.datetime64) are lost in zarr #7790

Closed

4 tasks

dcherian requested a review from spencerkclark May 2, 2023 21:10

spencerkclark reviewed May 2, 2023

View reviewed changes

This was referenced May 4, 2023

nanosecond precision lost when reading time data #7817

Closed

Preserve nanosecond resolution when encoding/decoding times #7827

Merged

Use numpy.can_cast instead of casting and checking #7834

Closed

tovogt mentioned this pull request Jun 29, 2023

Flood of cast warnings after improved hdf5 I/O CLIMADA-project/climada_petals#84

Open

kmuehlbauer closed this in #7827 Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't cast NaN to integer #7098

Don't cast NaN to integer #7098

andreas-schwab commented Sep 28, 2022

TomNicholas commented Sep 29, 2022

andreas-schwab commented Sep 29, 2022 via email

TomNicholas commented Sep 29, 2022

andreas-schwab commented Sep 29, 2022 via email

DocOtak commented Sep 29, 2022

TomNicholas commented Sep 29, 2022

dcherian commented Sep 30, 2022

kxxt commented Apr 16, 2023

spencerkclark left a comment

spencerkclark May 2, 2023

DocOtak May 2, 2023

spencerkclark May 2, 2023 •

edited

Loading

kmuehlbauer May 3, 2023

spencerkclark May 3, 2023

kmuehlbauer May 3, 2023

spencerkclark May 4, 2023

DocOtak May 4, 2023

kmuehlbauer commented Sep 17, 2023

Don't cast NaN to integer #7098

Don't cast NaN to integer #7098

Conversation

andreas-schwab commented Sep 28, 2022

TomNicholas commented Sep 29, 2022

andreas-schwab commented Sep 29, 2022 via email

TomNicholas commented Sep 29, 2022

andreas-schwab commented Sep 29, 2022 via email

DocOtak commented Sep 29, 2022

TomNicholas commented Sep 29, 2022

dcherian commented Sep 30, 2022

kxxt commented Apr 16, 2023

spencerkclark left a comment

Choose a reason for hiding this comment

spencerkclark May 2, 2023

Choose a reason for hiding this comment

DocOtak May 2, 2023

Choose a reason for hiding this comment

spencerkclark May 2, 2023 • edited Loading

Choose a reason for hiding this comment

kmuehlbauer May 3, 2023

Choose a reason for hiding this comment

spencerkclark May 3, 2023

Choose a reason for hiding this comment

kmuehlbauer May 3, 2023

Choose a reason for hiding this comment

spencerkclark May 4, 2023

Choose a reason for hiding this comment

DocOtak May 4, 2023

Choose a reason for hiding this comment

kmuehlbauer commented Sep 17, 2023

spencerkclark May 2, 2023 •

edited

Loading