Millisecond precision is lost on datetime64 during IO roundtrip #4045

half-adder · 2020-05-07T19:01:18Z

I have millisecond-resolution time data as a coordinate on a DataArray. That data loses precision when round-tripping through disk.

MCVE Code Sample

bug_data.p.zip

Unzip the data. It will result in a pickle file.

bug_data_path = '/path/to/unzipped/bug_data.p'
tmp_path = '~/Desktop/test.nc'

with open(bug_data_path, 'rb') as f:
    data = pickle.load(f)

selector = dict(animal=0, timepoint=0, wavelength='410', pair=0)

before_disk_ts = data.time.sel(**selector).values[()]

data.time.encoding = {'units': 'microseconds since 1900-01-01', 'calendar': 'proleptic_gregorian'}

data.to_netcdf(tmp_path)
after_disk_ts = xr.load_dataarray(tmp_path).time.sel(**selector).values[()]

print(f'before roundtrip: {before_disk_ts}')
print(f' after roundtrip: {after_disk_ts}')

output:

before roundtrip: 2017-02-22T16:24:10.586000000
after roundtrip:  2017-02-22T16:24:10.585999872

Expected Output

Before: 2017-02-22T16:24:10.586000000
After:  2017-02-22T16:24:10.586000000

Problem Description

As you can see, I lose millisecond precision in this data. (The same happens when I use millisecond in the encoding).

Versions

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:05:27)
[Clang 9.0.1 ]
python-bits: 64
OS: Darwin
OS-release: 19.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.3

xarray: 0.15.1
pandas: 1.0.1
numpy: 1.18.1
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.8.0
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.11.0
distributed: 2.14.0
matplotlib: 3.1.3
cartopy: None
seaborn: 0.10.0
numbagg: None
setuptools: 45.2.0.post20200209
pip: 20.0.2
conda: None
pytest: 5.3.5
IPython: 7.12.0
sphinx: 2.4.3

The text was updated successfully, but these errors were encountered:

DocOtak · 2020-05-07T20:32:22Z

This has something to do with the time values at some point being a float:

>>> import numpy as np
>>> np.datetime64("2017-02-22T16:24:10.586000000").astype("float64").astype(np.dtype('<M8[ns]'))
numpy.datetime64('2017-02-22T16:24:10.585999872')

It looks like this is happening somewhere in the cftime library.

spencerkclark · 2020-05-10T01:15:53Z

Thanks for the report @half-adder.

This indeed is related to times being encoded as floats, but actually is not cftime-related (the times here not being encoded using cftime; we only use cftime for non-standard calendars and out of nanosecond-resolution bounds dates).

Here's a minimal working example that illustrates the issue with the current logic in coding.times.encode_cf_datetime:

In [1]: import numpy as np; import pandas as pd

In [2]: times = pd.DatetimeIndex([np.datetime64("2017-02-22T16:27:08.732000000")])

In [3]: reference = pd.Timestamp("1900-01-01")

In [4]: units = np.timedelta64(1, "us")

In [5]: (times - reference).values[0]
Out[5]: numpy.timedelta64(3696769628732000000,'ns')

In [6]: ((times - reference) / units).values[0]
Out[6]: 3696769628732000.5

In principle, we should be able to represent the difference between this date and the reference date in an integer amount of microseconds, but timedelta division produces a float. We currently try to cast these floats to integers when possible, but that's not always safe to do, e.g. in the case above.

It would be great to make roundtripping times -- particularly standard calendar datetimes like these -- more robust. It's possible we could now leverage floor division (i.e. //) of timedeltas within NumPy for this (assuming we first check that the unit conversion divisor exactly divides each timedelta; if it doesn't we'd fall back to using floats):

In [7]: ((times - reference) // units).values[0]
Out[7]: 3696769628732000

These precision issues can be tricky, however, so we'd need to think things through carefully. Even if we fixed this on the encoding side, things are converted to floats during decoding, so we'd need to make a change there too.

aldanor · 2020-11-27T18:47:26Z

Just stumbled upon this as well. Internally, datetime64[ns] is simply an 8-byte int. Why on earth would it be serialized in a lossy way as a float64?...

Simply telling it to encoding={...: {'dtype': 'int64'}} won't work since then it complains about serializing float as an int.

Is there a way out of this, other than not using M8[ns] dtypes at all with xarray?

This is a huge issue, as anyone using nanosecond-precision timestamps with xarray would unknowingly and silently read wrong data after deserializing.

spencerkclark · 2020-11-30T13:35:26Z

Internally, datetime64[ns] is simply an 8-byte int. Why on earth would it be serialized in a lossy way as a float64?...

The short answer is that CF conventions allow for dates to be encoded with floating point values, so we encounter that in data that xarray ingests from other sources (i.e. files that were not even produced with Python, let alone xarray). If we didn't have to worry about roundtripping files that followed those conventions, I agree we would just encode everything with nanosecond units as int64 values.

This is a huge issue, as anyone using nanosecond-precision timestamps with xarray would unknowingly and silently read wrong data after deserializing.

Yes, I can see why this would be quite frustrating. In principle we should be able to handle this (contributions are welcome); it just has not been a priority up to this point. In my experience xarray's current encoding and decoding methods for standard calendar times work well up to at least second precision.

dcherian · 2020-11-30T15:15:13Z

Can we use the encoding["dtype"] field to solve this? i.e. use int64 when encoding["dtype"] is not set and use the specified value when available?

aldanor · 2020-11-30T15:18:55Z

In principle we should be able to handle this (contributions are welcome)

I don't mind contributing but not knowing the netcdf stuff inside out I'm not sure I have a good vision on what's the proper way to do it. My use case is very simple - I have an in-memory xr.Dataset that I want to save() and then load() without losses.

Should it just be an xr.save(..., m8=True) (or whatever that flag would be called), so that all of numpy's M8[...] and m8[...] would be serialized transparently (as int64, that is) without passing them through the whole cftime pipeline. It would be then nice, of course, if xr.load was also aware of this convention (via some special attribute or somehow else) and could convert them back like .view('M8[ns]') when loading. I think xarray should also throw an exception if it detects timestamps/timedeltas of nanosecond precision that it can't serialize without going through int-float-int routine (or automatically revert to using this transparent but netcdf-incompatible mode).

Maybe this is not the proper way to do it - ideas welcome (there's also an open PR - #4400 - mind checking that out?)

aldanor · 2020-11-30T15:22:09Z

Can we use the encoding["dtype"] field to solve this? i.e. use int64 when encoding["dtype"] is not set and use the specified value when available?

I think a lot of logic needs to be reshuffled, because as of right now it will complain "you can't store a float64 in int64" or something along those lines, when trying to do it with a nanosecond timestamp.

dcherian · 2020-11-30T16:05:12Z

I would look here:

xarray/xarray/coding/times.py

Lines 440 to 474 in 255bc8e

    
           class CFDatetimeCoder(VariableCoder): 
        
               def __init__(self, use_cftime=None): 
        
                   self.use_cftime = use_cftime 
        
               def encode(self, variable, name=None): 
        
                   dims, data, attrs, encoding = unpack_for_encoding(variable) 
        
                   if np.issubdtype(data.dtype, np.datetime64) or contains_cftime_datetimes( 
        
                       variable 
        
                   ): 
        
                       (data, units, calendar) = encode_cf_datetime( 
        
                           data, encoding.pop("units", None), encoding.pop("calendar", None) 
        
                       ) 
        
                       safe_setitem(attrs, "units", units, name=name) 
        
                       safe_setitem(attrs, "calendar", calendar, name=name) 
        
                   return Variable(dims, data, attrs, encoding) 
        
               def decode(self, variable, name=None): 
        
                   dims, data, attrs, encoding = unpack_for_decoding(variable) 
        
                   if "units" in attrs and "since" in attrs["units"]: 
        
                       units = pop_to(attrs, encoding, "units") 
        
                       calendar = pop_to(attrs, encoding, "calendar") 
        
                       dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime) 
        
                       transform = partial( 
        
                           decode_cf_datetime, 
        
                           units=units, 
        
                           calendar=calendar, 
        
                           use_cftime=self.use_cftime, 
        
                       ) 
        
                       data = lazy_elemwise_func(data, transform, dtype) 
        
                   return Variable(dims, data, attrs, encoding)

spencerkclark · 2020-12-14T00:50:46Z

@half-adder I've verified that #4684 fixes your initial issue. Note, however, that outside of the time you referenced, your Dataset contained times that required nanosecond precision, e.g.:

>>> data.time.isel(animal=0, timepoint=0, pair=-1, wavelength=0)
<xarray.DataArray 'time' ()>
array('2017-02-22T16:24:14.722999999', dtype='datetime64[ns]')
Coordinates:
    wavelength         <U3 '410'
    strain             object 'HD233'
    stage_x            float64 1.64e+04
    stage_y            float64 -429.0
    stage_z            float64 2.155e+04
    bin_x              float64 4.0
    bin_y              float64 4.0
    exposure           float64 90.0
    mvmt-anterior      uint8 0
    mvmt-posterior     uint8 0
    mvmt-sides_of_tip  uint8 0
    mvmt-tip           uint8 0
    experiment_id      object '2017_02_22-HD233_SAY47'
    time               datetime64[ns] 2017-02-22T16:24:14.722999999
    animal_            uint64 0

So in order for things to be round-tripped accurately you will need to override the original units in the dataset with nanoseconds instead of microseconds. This was not possible before, but now is with #4684.

>>> data.time.encoding["units"] = "nanoseconds since 1900-01-01"

With #4684 you could also just simply delete the original units, and xarray will now automatically choose the appropriate units so that the datetimes can be serialized with int64 values (and hence be round-tripped exactly).

>>> del data.time.encoding["units"]

half-adder changed the title ~~Losing precision on datetime64 during IO roundtrip~~ Millisecond precision is lost on datetime64 during IO roundtrip May 7, 2020

spencerkclark mentioned this issue Dec 12, 2020

Ensure maximum accuracy when encoding and decoding np.datetime64[ns] values #4684

Merged

4 tasks

dcherian closed this as completed in #4684 Jan 3, 2021

russ-schumacher mentioned this issue Mar 8, 2021

trouble slicing on sub-minute dataset #5012

Open

spencerkclark mentioned this issue Nov 11, 2021

Time dtype encoding defaulting to int64 when writing netcdf or zarr #3942

Open

kmuehlbauer mentioned this issue Apr 28, 2023

Fill values in time arrays (numpy.datetime64) are lost in zarr #7790

Closed

4 tasks

spencerkclark mentioned this issue Jun 23, 2024

Default time encoding of nanoseconds is NOT good. #9154

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Millisecond precision is lost on datetime64 during IO roundtrip #4045

Millisecond precision is lost on datetime64 during IO roundtrip #4045

half-adder commented May 7, 2020

INSTALLED VERSIONS

DocOtak commented May 7, 2020

spencerkclark commented May 10, 2020

aldanor commented Nov 27, 2020 •

edited

Loading

spencerkclark commented Nov 30, 2020 •

edited

Loading

dcherian commented Nov 30, 2020

aldanor commented Nov 30, 2020 •

edited

Loading

aldanor commented Nov 30, 2020

dcherian commented Nov 30, 2020

spencerkclark commented Dec 14, 2020

Millisecond precision is lost on datetime64 during IO roundtrip #4045

Millisecond precision is lost on datetime64 during IO roundtrip #4045

Comments

half-adder commented May 7, 2020

MCVE Code Sample

Expected Output

Problem Description

Versions

INSTALLED VERSIONS

DocOtak commented May 7, 2020

spencerkclark commented May 10, 2020

aldanor commented Nov 27, 2020 • edited Loading

spencerkclark commented Nov 30, 2020 • edited Loading

dcherian commented Nov 30, 2020

aldanor commented Nov 30, 2020 • edited Loading

aldanor commented Nov 30, 2020

dcherian commented Nov 30, 2020

spencerkclark commented Dec 14, 2020

aldanor commented Nov 27, 2020 •

edited

Loading

spencerkclark commented Nov 30, 2020 •

edited

Loading

aldanor commented Nov 30, 2020 •

edited

Loading