You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I expected combine_by_coords to explicitly reject all cases where the coordinates overlap, producing a ValueError in cases like the following with coords [0, 1] and [1, 2]:
a = xarray.DataArray(dims=('x',), data=np.ones((2,)), coords={'x': [0, 1]})
b = xarray.DataArray(dims=('x',), data=np.ones((2,)), coords={'x': [1, 2]})
xarray.combine_by_coords([a, b])
As well as cases with larger overlaps e.g. [0, 1, 2] and [1, 2, 3].
What did you expect to happen?
It fails to reject the case above with coordinates [0, 1] and [1, 2], despite rejecting cases where the overlap is bigger (e.g. [0, 1, 2] and [1, 2, 3]). Instead it returns a DataArray with duplicate coordinates. (See example below).
Minimal Complete Verifiable Example
# This overlap is caught, as expecteda=xarray.DataArray(dims=('x',), data=np.ones((3,)), coords={'x': [0, 1, 2]})
b=xarray.DataArray(dims=('x',), data=np.ones((3,)), coords={'x': [1, 2, 3]})
xarray.combine_by_coords([a, b])
=>ValueError: Resultingobjectdoesnothavemonotonicglobalindexesalongdimensionx# This overlap is not caughta=xarray.DataArray(dims=('x',), data=np.ones((2,)), coords={'x': [0, 1]})
b=xarray.DataArray(dims=('x',), data=np.ones((2,)), coords={'x': [1, 2]})
xarray.combine_by_coords([a, b])
=><xarray.DataArray (x: 4)>array([1., 1., 1., 1.])
Coordinates:
*x (x) int640112
MVCE confirmation
Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
No response
Anything else we need to know?
As far as I can tell this happens because indexes.is_monotonic_increasing or indexes.is_monotonic_decreasing are not checking for strict monotonicity and allow consecutive values to be the same.
I assume it wasn't intentional to allow overlaps like this. If so, do you think anyone is depending on this (I'd hope not...) and would you take a PR to fix it to produce a ValueError in this case?
If this behaviour is intentional or relied upon, could we have an option to do a strict check instead?
Also for performance reasons I'd propose to do some extra upfront checks to catch index overlap (e.g. by checking for index overlap in _infer_concat_order_from_coords), rather than doing a potentially large concat and only detecting duplicates afterwards.
Thanks so much for the MVCE, sleuthing, and clear issue @mjwillson !
As far as I can tell this happens because indexes.is_monotonic_increasing or indexes.is_monotonic_decreasing are not checking for strict monotonicity and allow consecutive values to be the same.
That sounds likely to me.
I assume it wasn't intentional to allow overlaps like this.
Definitely not intentional. I'm not sure this case even occurred to me when I originally wrote combine_by_coords.
If so, do you think anyone is depending on this (I'd hope not...)
I hope not too because it's sort of an undefined operation. The purpose of combine_by_coords is to order datasets when their coordinates give an unambiguous ordering of all data points. The case you've highlighted isn't really an unambigous ordering of the data points.
We do also state clearly in the docstring "Will attempt to order the datasets such that the values in their dimension coordinates are monotonic along all dimensions. If it cannot determine the order in which to concatenate the datasets, it will raise a ValueError." This is just making that stricter to say "monotonic and unique".
and would you take a PR to fix it to produce a ValueError in this case?
A PR would be most welcome 😄
We should also consider the behaviour in the case of repeated elements in the coordinates, e.g. this currently happens:
I think it's more likely that someone is relying on that behaviour though... 😕
If this behaviour is intentional or relied upon, could we have an option to do a strict check instead?
Alternatively we might consider first changing the code to raise a warning if the resulting coordinate is monotonic but non-unique, so that when we change the actual behaviour later people are ready.
Also for performance reasons I'd propose to do some extra upfront checks to catch index overlap (e.g. by checking for index overlap in _infer_concat_order_from_coords), rather than doing a potentially large concat and only detecting duplicates afterwards.
That sounds sensible.
As an aside, I wonder if we could create a hypothesis test for combine_by_coords, where we assert that any dataset that is split then combined should remain unchanged... xref #6908 and #1846
What happened?
I expected combine_by_coords to explicitly reject all cases where the coordinates overlap, producing a ValueError in cases like the following with coords
[0, 1]
and[1, 2]
:As well as cases with larger overlaps e.g.
[0, 1, 2]
and[1, 2, 3]
.What did you expect to happen?
It fails to reject the case above with coordinates
[0, 1]
and[1, 2]
, despite rejecting cases where the overlap is bigger (e.g.[0, 1, 2]
and[1, 2, 3]
). Instead it returns a DataArray with duplicate coordinates. (See example below).Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
As far as I can tell this happens because
indexes.is_monotonic_increasing or indexes.is_monotonic_decreasing
are not checking for strict monotonicity and allow consecutive values to be the same.I assume it wasn't intentional to allow overlaps like this. If so, do you think anyone is depending on this (I'd hope not...) and would you take a PR to fix it to produce a ValueError in this case?
If this behaviour is intentional or relied upon, could we have an option to do a strict check instead?
Also for performance reasons I'd propose to do some extra upfront checks to catch index overlap (e.g. by checking for index overlap in
_infer_concat_order_from_coords
), rather than doing a potentially large concat and only detecting duplicates afterwards.Environment
xarray: 2022.06.0
pandas: 1.1.5
numpy: 1.23.2
scipy: 1.8.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 3.2.1
Nio: None
zarr: 2.7.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.3.4
cartopy: None
seaborn: 0.11.2
numbagg: None
fsspec: 0.7.4
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: None
pip: None
conda: None
pytest: None
IPython: 3.2.3
sphinx: None
The text was updated successfully, but these errors were encountered: