Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combine_by_coords with loaded cftime variable demotes dtype #168

Closed
TomNicholas opened this issue Jul 1, 2024 · 4 comments
Closed

combine_by_coords with loaded cftime variable demotes dtype #168

TomNicholas opened this issue Jul 1, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Jul 1, 2024

I realized that xr.combine_by_coords should actually already work fine - if you are willing to load the relevant coordinates into memory (and therefore also have those values saved into the resultant references on-disk).

In [1]: from virtualizarr import open_virtual_dataset

In [2]: import xarray as xr

In [3]: ds1 = open_virtual_dataset('air1.nc', loadable_variables=['time', 'lat', 'lon'], cftime_variables=['time'])

In [4]: ds2 = open_virtual_dataset('air2.nc', loadable_variables=['time', 'lat', 'lon'], cftime_variables=['time'])

In [5]: xr.combine_by_coords([ds2, ds1], coords='minimal', compat='override')
Out[5]: 
<xarray.Dataset> Size: 8MB
Dimensions:  (time: 2920, lat: 25, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) float32 12kB 1.867e+06 1.867e+06 ... 1.885e+06 1.885e+06
Data variables:
    air      (time, lat, lon) int16 8MB ManifestArray<shape=(2920, 25, 53), d...
Attributes:
    Conventions:  COARDS
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
    title:        4x daily NMC reanalysis (1948)

the only issue with this result is that somehow the dtype of time has been changed from datetime64[ns] to float32.


This approach doesn't solve the original issue, but it might also be fine in a lot of cases. xr.combine_by_coords can only auto-order along dimensions that have a 1D coordinate, and 1D variables are small, so if they are split across many files its likely that you wanted to include them in loadable_variables anyway.

Originally posted by @TomNicholas in #18 (comment)

@TomNicholas TomNicholas added the bug Something isn't working label Jul 1, 2024
@TomNicholas
Copy link
Member Author

@jsignell not sure if this is a bug with the loading of cftime variables, the implementation of xr.combine_by_coords, or somewhere else.

@TomNicholas TomNicholas changed the title combine_by_coords with loaded cftime variable combine_by_coords with loaded cftime variable demotes dtype Jul 1, 2024
@TomNicholas
Copy link
Member Author

Okay so this happens with xr.concat too

In [11]: ds1 = open_virtual_dataset('air1.nc', loadable_variables=['time', 'lat', 'lon'], cftime_variables=['time'])

In [12]: ds2 = open_virtual_dataset('air2.nc', loadable_variables=['time', 'lat', 'lon'], cftime_variables=['time'])

In [13]: xr.concat([ds1, ds2], coords='minimal', compat='override', dim='time')
Out[13]: 
<xarray.Dataset> Size: 8MB
Dimensions:  (time: 2920, lat: 25, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) float32 12kB 1.867e+06 1.867e+06 ... 1.885e+06 1.885e+06
Data variables:
    air      (time, lat, lon) int16 8MB ManifestArray<shape=(2920, 25, 53), d...
Attributes:
    Conventions:  COARDS
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
    title:        4x daily NMC reanalysis (1948)

but only because I didn't pass indexes={}. Concat works fine if you don't create indexes:

In [8]: ds1 = open_virtual_dataset('air1.nc', loadable_variables=['time', 'lat', 'lon'], cftime_variables=['time'], indexes={})

In [9]: ds2 = open_virtual_dataset('air2.nc', loadable_variables=['time', 'lat', 'lon'], cftime_variables=['time'], indexes={})

In [10]: xr.concat([ds1, ds2], coords='minimal', compat='override', dim='time')
Out[10]: 
<xarray.Dataset> Size: 8MB
Dimensions:  (time: 2920, lat: 25, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
    time     (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
Data variables:
    air      (time, lat, lon) int16 8MB ManifestArray<shape=(2920, 25, 53), d...
Attributes:
    Conventions:  COARDS
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
    title:        4x daily NMC reanalysis (1948)

This almost certainly does just link back to #18 then.

@norlandrhagen
Copy link
Collaborator

Ran into this same issue:

from virtualizarr import open_virtual_dataset

cf_time_var = open_virtual_dataset('https://climate.northwestknowledge.net/TERRACLIMATE-DATA/TerraClimate_tmax_1958.nc', loadable_variables=['time','lat','lon','crs'], cftime_variables=['time'], indexes={})

no_cf_time_var = open_virtual_dataset('https://climate.northwestknowledge.net/TERRACLIMATE-DATA/TerraClimate_tmax_1958.nc', loadable_variables=['time','lat','lon','crs'], indexes={})
print(cf_time_var.xindexes)

Indexes:
    lat      PandasIndex
    lon      PandasIndex
    crs      PandasIndex
print(cf_no_time_var.xindexes)

Indexes:
    lat      PandasIndex
    lon      PandasIndex
    time     PandasIndex
    crs      PandasIndex

It seems like specifying cftime_variables='time' drops the time index.

@norlandrhagen
Copy link
Collaborator

#232

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants