Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trouble slicing on sub-minute dataset #5012

Open
russ-schumacher opened this issue Mar 8, 2021 · 2 comments
Open

trouble slicing on sub-minute dataset #5012

russ-schumacher opened this issue Mar 8, 2021 · 2 comments

Comments

@russ-schumacher
Copy link

I have two netcdf datasets with the primary difference being the time resolution: one has data every minute, the other every 30 seconds. When I try to subset the 1-minute dataset with a time slice, it works fine, but when I do the same for the 30-second dataset it throws an error.

Two minimal netcdf files are in the attached, the one with "works" in the name has the 1-minute frequency and the other has the 30-second frequency.
parcels.zip

What happened:

ds = xr.open_dataset("parcel_example.nc")
print(ds)
<xarray.Dataset>
Dimensions:  (time: 293, yh: 1, zh: 1)
Coordinates:
    xh       float32 ...
  * yh       (yh) float32 0.0
  * zh       (zh) float32 0.0
  * time     (time) timedelta64[ns] 00:00:00 00:00:30 ... 02:14:30 02:15:00
Data variables: (12/26)
    x        (time) float32 ...
    y        (time) float32 ...
    z        (time) float32 ...
    u        (time) float32 ...
    v        (time) float32 ...
    w        (time) float32 ...
    ...       ...
print(ds.sel(time=slice("01:00:00","01:15:00")))

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/envs/cm1/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.DatetimeEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.DatetimeEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._maybe_get_bool_indexer()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine._unpack_bool_indexer()

KeyError: 4559999999999

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-16-c208cd704d0a> in <module>
----> 1 print(ds.sel(time=slice("01:00:00","01:15:00")))

~/miniconda3/envs/cm1/lib/python3.8/site-packages/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   2228         """
   2229         indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 2230         pos_indexers, new_indexes = remap_label_indexers(
   2231             self, indexers=indexers, method=method, tolerance=tolerance
   2232         )

~/miniconda3/envs/cm1/lib/python3.8/site-packages/xarray/core/coordinates.py in remap_label_indexers(obj, indexers, method, tolerance, **indexers_kwargs)
    414     }
    415 
--> 416     pos_indexers, new_indexes = indexing.remap_label_indexers(
    417         obj, v_indexers, method=method, tolerance=tolerance
    418     )

~/miniconda3/envs/cm1/lib/python3.8/site-packages/xarray/core/indexing.py in remap_label_indexers(data_obj, indexers, method, tolerance)
    268             coords_dtype = data_obj.coords[dim].dtype
    269             label = maybe_cast_to_coords_dtype(label, coords_dtype)
--> 270             idxr, new_idx = convert_label_indexer(index, label, dim, method, tolerance)
    271             pos_indexers[dim] = idxr
    272             if new_idx is not None:

~/miniconda3/envs/cm1/lib/python3.8/site-packages/xarray/core/indexing.py in convert_label_indexer(index, label, index_name, method, tolerance)
    119                 "cannot use ``method`` argument if any indexers are slice objects"
    120             )
--> 121         indexer = index.slice_indexer(
    122             _sanitize_slice_element(label.start),
    123             _sanitize_slice_element(label.stop),

~/miniconda3/envs/cm1/lib/python3.8/site-packages/pandas/core/indexes/base.py in slice_indexer(self, start, end, step, kind)
   5275         slice(1, 3, None)
   5276         """
-> 5277         start_slice, end_slice = self.slice_locs(start, end, step=step, kind=kind)
   5278 
   5279         # return a slice

~/miniconda3/envs/cm1/lib/python3.8/site-packages/pandas/core/indexes/base.py in slice_locs(self, start, end, step, kind)
   5480         end_slice = None
   5481         if end is not None:
-> 5482             end_slice = self.get_slice_bound(end, "right", kind)
   5483         if end_slice is None:
   5484             end_slice = len(self)

~/miniconda3/envs/cm1/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   5394             except ValueError:
   5395                 # raise the original KeyError
-> 5396                 raise err
   5397 
   5398         if isinstance(slc, np.ndarray):

~/miniconda3/envs/cm1/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   5388         # we need to look up the label
   5389         try:
-> 5390             slc = self.get_loc(label)
   5391         except KeyError as err:
   5392             try:

~/miniconda3/envs/cm1/lib/python3.8/site-packages/pandas/core/indexes/timedeltas.py in get_loc(self, key, method, tolerance)
    198             raise KeyError(key) from err
    199 
--> 200         return Index.get_loc(self, key, method, tolerance)
    201 
    202     def _maybe_cast_slice_bound(self, label, side: str, kind):

~/miniconda3/envs/cm1/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:
-> 3082                 raise KeyError(key) from err
   3083 
   3084         if tolerance is not None:

KeyError: Timedelta('0 days 01:15:59.999999999')

What you expected to happen:

ds = xr.open_dataset("parcel_example_works.nc")
print(ds)
<xarray.Dataset>
Dimensions:  (time: 136, yh: 1, zh: 1)
Coordinates:
    xh       float32 ...
  * yh       (yh) float32 0.0
  * zh       (zh) float32 0.0
  * time     (time) timedelta64[ns] 00:00:00 00:01:00 ... 02:14:00 02:15:00
Data variables: (12/28)
    x        (time) float32 ...
    y        (time) float32 ...
    z        (time) float32 ...
    u        (time) float32 ...
    v        (time) float32 ...
    w        (time) float32 ...
    ...       ...
print(ds.sel(time=slice("01:00:00","01:15:00")))

<xarray.Dataset>
Dimensions:  (time: 16, yh: 1, zh: 1)
Coordinates:
    xh       float32 1.0
  * yh       (yh) float32 0.0
  * zh       (zh) float32 0.0
  * time     (time) timedelta64[ns] 01:00:00 01:01:00 ... 01:14:00 01:15:00
Data variables: (12/28)
    x        (time) float32 ...
    y        (time) float32 ...
    z        (time) float32 ...
    u        (time) float32 ...
    v        (time) float32 ...
    w        (time) float32 ...
    ...       ...
    b        (time) float32 ...
    vpg      (time) float32 ...
    zvort    (time) float32 ...
    rho      (time) float32 ...
    qsl      (time) float32 ...
    qsi      (time) float32 ...
Attributes:
    CM1 version:    cm1r20.1
    Conventions:    CF-1.7
    missing_value:  -999999.9

Anything else we need to know?:

I can do what I need to do by using ds.where and other functions, so this isn't critical, but I thought I'd pass it along in case it is a broader issue. I at first thought it was the existing issue with timestamps highlighted in other issues like #4045, but I upgraded xarray and pandas and the problem still existed.

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1127.18.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.17.0
pandas: 1.2.3
numpy: 1.20.1
scipy: 1.6.0
netCDF4: 1.5.4
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.4.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.26.0
distributed: 2.26.0
matplotlib: 3.3.2
cartopy: 0.18.0
seaborn: None
numbagg: None
pint: 0.16.1
setuptools: 49.6.0.post20210108
pip: 21.0.1
conda: None
pytest: None
IPython: 7.21.0
sphinx: None

@jthielen
Copy link
Contributor

jthielen commented Mar 9, 2021

I'll leave the exact issue up to someone more acquainted with string and slice handling on TimedeltaIndex, but in case it is useful, here are two things I noticed (to hopefully speed along the troubleshooting):

  1. Using pandas Timedeltas instead of strings (i.e., ds.sel(time=slice(pd.Timedelta("01:00:00"), pd.Timedelta("01:15:00")))) works, giving
<xarray.Dataset>
Dimensions:  (time: 39, yh: 1, zh: 1)
Coordinates:
    xh       float32 ...
  * yh       (yh) float32 0.0
  * zh       (zh) float32 0.0
  * time     (time) timedelta64[ns] 01:00:00 01:07:30 ... 01:14:30 01:15:00
Data variables: (12/26)
    x        (time) float32 ...
    y        (time) float32 ...
    z        (time) float32 ...
    u        (time) float32 ...
    v        (time) float32 ...
    w        (time) float32 ...
    ...       ...
    b        (time) float32 ...
    vpg      (time) float32 ...
    zvort    (time) float32 ...
    rho      (time) float32 ...
    qsl      (time) float32 ...
    qsi      (time) float32 ...
Attributes:
    CM1 version:    cm1r20.1
    Conventions:    CF-1.7
    missing_value:  -999999.9

However, notice that jump in the time coordinate...which leads to

  1. In the failing example, the time coordinate is non-monotonic. If we .sortby('time'), then ds.sel(time=slice("01:00:00","01:15:00")) works as expected:
<xarray.Dataset>
Dimensions:  (time: 47, yh: 1, zh: 1)
Coordinates:
    xh       float32 1.0
  * yh       (yh) float32 0.0
  * zh       (zh) float32 0.0
  * time     (time) timedelta64[ns] 01:00:00 01:00:30 ... 01:15:00 01:15:30
Data variables: (12/26)
    x        (time) float32 ...
    y        (time) float32 ...
    z        (time) float32 ...
    u        (time) float32 ...
    v        (time) float32 ...
    w        (time) float32 ...
    ...       ...
    b        (time) float32 ...
    vpg      (time) float32 ...
    zvort    (time) float32 ...
    rho      (time) float32 ...
    qsl      (time) float32 ...
    qsi      (time) float32 ...
Attributes:
    CM1 version:    cm1r20.1
    Conventions:    CF-1.7
    missing_value:  -999999.9

(the working example also has a monotonic time coordinate)

And so, I would guess this is an example of string-based timedelta selection failing on non-monotonic coordinates.

@russ-schumacher
Copy link
Author

good catch! I'm not sure why the model would've written it out this way, but the original dataset (along with the small subset I wrote out and attached) is indeed non-monotonic in the non-working example. That is strange but is almost certainly the reason for the error. This'll help me do what I need to do, and I'm not sure whether this is so much of an edge case that it's even worth a bigger fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants