Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve nanosecond resolution when encoding/decoding times #7827

Merged
merged 26 commits into from
Sep 17, 2023

Conversation

kmuehlbauer
Copy link
Contributor

@kmuehlbauer kmuehlbauer commented May 8, 2023

Closes #7098 as not needed, but worked around to preserve the fast int-based timedelta calculation.

@kmuehlbauer
Copy link
Contributor Author

@spencerkclark I'd appreciate if you could have a look here. All but one test pass, but I can't immediately see what that test is doing. Looks like mismatched dtypes on the attributes. If you have any suggestions how to possibly improve, please let me know. I've not added tests here, yet.

@kmuehlbauer kmuehlbauer force-pushed the nanoseconds-resolution branch from 848ac09 to bb99536 Compare May 8, 2023 19:57
@kmuehlbauer
Copy link
Contributor Author

I've reset the order of coders to the initial behaviour. Instead the times are special cased in the CFMaskCoder. Locally it works, but I'll only trust the CI.

@kmuehlbauer kmuehlbauer force-pushed the nanoseconds-resolution branch from 85c318c to b304aa0 Compare May 8, 2023 20:18
@kmuehlbauer
Copy link
Contributor Author

All tests have passed. Rebased now on latest main. The issue described in #7817 is resolved. Ready for first reviews.

@kmuehlbauer kmuehlbauer marked this pull request as ready for review May 8, 2023 20:22
@spencerkclark
Copy link
Member

Thanks @kmuehlbauer -- I just wanted to give you a heads up that I'm pretty busy this week. Hopefully I'll get a free moment to look at this more closely next week.

@kmuehlbauer
Copy link
Contributor Author

Thanks for the heads-up, @spencerkclark. No worries, I need to apply some changes anyway as it turns out.

@kmuehlbauer kmuehlbauer force-pushed the nanoseconds-resolution branch from 5648c0f to 584b46e Compare May 9, 2023 14:01
@kmuehlbauer kmuehlbauer force-pushed the nanoseconds-resolution branch from 584b46e to 8750475 Compare May 10, 2023 20:17
@kmuehlbauer
Copy link
Contributor Author

@dcherian You were right from the beginning, changing order for decoding and handling _FillValue in CFDatetimeCoder seems to be one working solution with minimal code changes.

If the CI is happy I'll add tests to cover for the nanosecond issues in #7817.

xarray/coding/times.py Outdated Show resolved Hide resolved
Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kmuehlbauer -- I'm catching up here a little bit. I think if we want to swap the order of decoding datetimes / timedeltas with decoding the mask information, we will also need to decode the _FillValue into something that has a datetime or timedelta type. We may not have great test coverage in this area.

As a concrete example I'm thinking about something like the following:

>>> import numpy as np; import xarray as xr
>>> times = [np.datetime64("2000-01-01"), np.datetime64("NaT")]
>>> da = xr.DataArray(times, dims=["time"], name="foo")
>>> da.encoding["dtype"] = np.float64
>>> da.encoding["_FillValue"] = 20.0
>>> da.to_dataset().to_netcdf("test-encode.nc")

On the current main branch we get the following behavior:

>>> xr.open_dataset("test-encode.nc").foo.values
array(['2000-01-01T00:00:00.000000000',                           'NaT'],
      dtype='datetime64[ns]')
>>> xr.open_dataset("test-encode.nc", decode_cf=False).foo.values
array([ 0., 20.])

With this branch we get:

>>> xr.open_dataset("test-encode.nc").foo.values
array(['2000-01-01T00:00:00.000000000',                           'NaT'],
      dtype='datetime64[ns]')
>>> xr.open_dataset("test-encode.nc", decode_cf=False).foo.values
array([ 0., nan])

Note the difference in the case where decode_cf=False. It seems like with this branch NaN is written out to disk when the fill value prescribes 20.0, which is incorrect. I think maybe the fact that NaN is being written out to disk is hiding the issue that the _FillValue needs to be decoded?

(There is probably a way to test this without doing I/O, but this was the quickest example that came to mind).

@kmuehlbauer
Copy link
Contributor Author

Thanks @spencerkclark for taking the time. NaN has been written to disk (as you assumed). Let's have another try next week.

@kmuehlbauer
Copy link
Contributor Author

@spencerkclark With current master I get the following RuntimeWarning running your code example:

  • on encoding (calling to_netcdf()):
/home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/coding/times.py:618: RuntimeWarning: invalid value encountered in cast
  int_num = np.asarray(num, dtype=np.int64)
  • on decoding (calling open_dataset()):
/home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/coding/times.py:254: RuntimeWarning: invalid value encountered in cast
  flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(
/home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/coding/times.py:254: RuntimeWarning: invalid value encountered in cast
  flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(

The latter was discussed in #7098 (casting float64 to int64), the former was aimed to be resolved with this PR.

I'll try to create a test case using Variable and the respective encoding/decoding functions without involving IO (per your suggestion @spencerkclark).

@kmuehlbauer
Copy link
Contributor Author

The example below is only based on Variable and the cf encode/decode variable functions.

import xarray as xr
import numpy as np

# create DataArray
times = [np.datetime64("2000-01-01", "ns"), np.datetime64("NaT")]
da = xr.DataArray(times, dims=["time"], name="foo")
da.encoding["dtype"] = np.float64
da.encoding["_FillValue"] = 20.0

# extract Variable
source_var = da.variable
print("---------- source_var ------------------")
print(source_var)
print(source_var.encoding)

# encode Variable
encoded_var = xr.conventions.encode_cf_variable(source_var)
print("\n---------- encoded_var ------------------")
print(encoded_var)

# decode Variable
decoded_var = xr.conventions.decode_cf_variable("foo", encoded_var)
print("\n---------- decoded_var ------------------")
print(decoded_var.load())
/home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/coding/times.py:618: RuntimeWarning: invalid value encountered in cast
  int_num = np.asarray(num, dtype=np.int64)
/home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/coding/times.py:254: RuntimeWarning: invalid value encountered in cast
  flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(
/home/kai/miniconda/envs/xarray_311/lib/python3.11/site-packages/xarray/coding/times.py:254: RuntimeWarning: invalid value encountered in cast
  flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(

---------- source_var ------------------
<xarray.Variable (time: 2)>
array(['2000-01-01T00:00:00.000000000',                           'NaT'],
      dtype='datetime64[ns]')
{'dtype': <class 'numpy.float64'>, '_FillValue': 20.0}
dtype num float64

---------- encoded_var ------------------
<xarray.Variable (time: 2)>
array([ 0., 20.])
Attributes:
    units:       days since 2000-01-01 00:00:00
    calendar:    proleptic_gregorian
    _FillValue:  20.0

---------- decoded_var ------------------
<xarray.Variable (time: 2)>
array(['2000-01-01T00:00:00.000000000',                           'NaT'],
      dtype='datetime64[ns]')
{'_FillValue': 20.0, 'units': 'days since 2000-01-01 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('float64')}

@spencerkclark
Copy link
Member

Great, yeah, that's a nice example without writing to disk. Indeed I saw those warnings too, but omitted them in my earlier message to focus on the encoding issue (sorry about that). I agree that these are something we should address.

@spencerkclark
Copy link
Member

One other tricky edge case that occurs to me is one where an extreme fill value (e.g. 1e30) is used for floating point fields. If we decode the times first, it might appear that the dates cannot be represented as nanosecond-precision values, but in reality they would be. We may need to think more about how to handle this edge case in addition to #7817.

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful @kmuehlbauer -- I appreciate those additional tests, the additional warning, and the further cleanup. One more suggestion for a minor tweak to a test, and I think this is ready for a what's new entry. Huge thanks for taking on this gnarly issue!

xarray/tests/test_coding_times.py Outdated Show resolved Hide resolved
xarray/tests/test_coding_times.py Outdated Show resolved Hide resolved
xarray/backends/netcdf3.py Show resolved Hide resolved
@kmuehlbauer
Copy link
Contributor Author

Thanks all for pushing this to a mergeable state. @spencerkclark 👍 💯

@kmuehlbauer kmuehlbauer added run-upstream Run upstream CI run-benchmark Run the ASV benchmark workflow labels Sep 13, 2023
doc/whats-new.rst Outdated Show resolved Hide resolved
doc/whats-new.rst Outdated Show resolved Hide resolved
@kmuehlbauer kmuehlbauer added plan to merge Final call for comments and removed run-benchmark Run the ASV benchmark workflow labels Sep 16, 2023
@dcherian
Copy link
Contributor

Amazing work, @kmuehlbauer and @spencerkclark . Thanks!

@kmuehlbauer kmuehlbauer merged commit b6d4bf7 into pydata:main Sep 17, 2023
22 of 24 checks passed
@kmuehlbauer kmuehlbauer deleted the nanoseconds-resolution branch September 17, 2023 08:15
max-sixty pushed a commit to max-sixty/xarray that referenced this pull request Sep 17, 2023
)

* preserve nanosecond resolution when encoding/decoding times.

* Apply suggestions from code review

Co-authored-by: Spencer Clark <[email protected]>

* use emit_user_level_warning

* move time alignment for nc3 to encode_nc3_variable

* fix test for encode_cf_timedelta

* fix CFMaskCoder for time-like (also allow timedelta64), add first tests

* rename to _unpack_time_units_and_ref_date as suggested in review

* refactor delta -> time_units as suggested in review

* refactor out function _time_units_to_timedelta64, reorder flow and remove unneeded checks, apply filterwarnings, adapt tests

* import _is_time_like from coding.variables

* adapt tests, add _numpy_to_netcdf_timeunit-conversion function

* adapt tests, add _numpy_to_netcdf_timeunit-conversion function

* adapt test as per review, remove arm_xfail for backend test

* add whats-new.rst entry

* Update doc/whats-new.rst

Co-authored-by: Spencer Clark <[email protected]>

* Update doc/whats-new.rst

Co-authored-by: Spencer Clark <[email protected]>

* fix whats-new.rst

---------

Co-authored-by: Spencer Clark <[email protected]>
@meteoDaniel
Copy link

Awesome contribute @kmuehlbauer :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment