combine_by_coords allows one overlapping coordinate value, but not more than one #7189

mjwillson · 2022-10-19T21:13:28Z

What happened?

I expected combine_by_coords to explicitly reject all cases where the coordinates overlap, producing a ValueError in cases like the following with coords [0, 1] and [1, 2]:

a = xarray.DataArray(dims=('x',), data=np.ones((2,)), coords={'x': [0, 1]})
b = xarray.DataArray(dims=('x',), data=np.ones((2,)), coords={'x': [1, 2]})
xarray.combine_by_coords([a, b])

As well as cases with larger overlaps e.g. [0, 1, 2] and [1, 2, 3].

What did you expect to happen?

It fails to reject the case above with coordinates [0, 1] and [1, 2], despite rejecting cases where the overlap is bigger (e.g. [0, 1, 2] and [1, 2, 3]). Instead it returns a DataArray with duplicate coordinates. (See example below).

Minimal Complete Verifiable Example

# This overlap is caught, as expected
a = xarray.DataArray(dims=('x',), data=np.ones((3,)), coords={'x': [0, 1, 2]})
b = xarray.DataArray(dims=('x',), data=np.ones((3,)), coords={'x': [1, 2, 3]})
xarray.combine_by_coords([a, b])
=> ValueError: Resulting object does not have monotonic global indexes along dimension x

# This overlap is not caught
a = xarray.DataArray(dims=('x',), data=np.ones((2,)), coords={'x': [0, 1]})
b = xarray.DataArray(dims=('x',), data=np.ones((2,)), coords={'x': [1, 2]})
xarray.combine_by_coords([a, b])
=> <xarray.DataArray (x: 4)>
array([1., 1., 1., 1.])
Coordinates:
  * x        (x) int64 0 1 1 2

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

As far as I can tell this happens because indexes.is_monotonic_increasing or indexes.is_monotonic_decreasing are not checking for strict monotonicity and allow consecutive values to be the same.

I assume it wasn't intentional to allow overlaps like this. If so, do you think anyone is depending on this (I'd hope not...) and would you take a PR to fix it to produce a ValueError in this case?

If this behaviour is intentional or relied upon, could we have an option to do a strict check instead?

Also for performance reasons I'd propose to do some extra upfront checks to catch index overlap (e.g. by checking for index overlap in _infer_concat_order_from_coords), rather than doing a potentially large concat and only detecting duplicates afterwards.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.15 (stable, redacted, redacted) [Clang google3-trunk (11897708c0229c92802e747564e7c34b722f045f)] python-bits: 64 OS: Linux OS-release: 5.18.16-1rodete1-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: None

xarray: 2022.06.0
pandas: 1.1.5
numpy: 1.23.2
scipy: 1.8.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 3.2.1
Nio: None
zarr: 2.7.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.3.4
cartopy: None
seaborn: 0.11.2
numbagg: None
fsspec: 0.7.4
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: None
pip: None
conda: None
pytest: None
IPython: 3.2.3
sphinx: None

The text was updated successfully, but these errors were encountered:

TomNicholas · 2022-10-20T16:40:56Z

Thanks so much for the MVCE, sleuthing, and clear issue @mjwillson !

As far as I can tell this happens because indexes.is_monotonic_increasing or indexes.is_monotonic_decreasing are not checking for strict monotonicity and allow consecutive values to be the same.

That sounds likely to me.

I assume it wasn't intentional to allow overlaps like this.

Definitely not intentional. I'm not sure this case even occurred to me when I originally wrote combine_by_coords.

If so, do you think anyone is depending on this (I'd hope not...)

I hope not too because it's sort of an undefined operation. The purpose of combine_by_coords is to order datasets when their coordinates give an unambiguous ordering of all data points. The case you've highlighted isn't really an unambigous ordering of the data points.

We do also state clearly in the docstring "Will attempt to order the datasets such that the values in their dimension coordinates are monotonic along all dimensions. If it cannot determine the order in which to concatenate the datasets, it will raise a ValueError." This is just making that stricter to say "monotonic and unique".

and would you take a PR to fix it to produce a ValueError in this case?

A PR would be most welcome 😄

We should also consider the behaviour in the case of repeated elements in the coordinates, e.g. this currently happens:

a = xarray.DataArray(dims=('x',), data=np.ones((2,)), coords={'x': [0, 0]})
b = xarray.DataArray(dims=('x',), data=np.ones((2,)), coords={'x': [1, 1]})
xarray.combine_by_coords([a, b])

Out[11]: 
<xarray.DataArray (x: 4)>
array([1., 1., 1., 1.])
Coordinates:
  * x        (x) int64 0 0 1 1

I think it's more likely that someone is relying on that behaviour though... 😕

If this behaviour is intentional or relied upon, could we have an option to do a strict check instead?

Alternatively we might consider first changing the code to raise a warning if the resulting coordinate is monotonic but non-unique, so that when we change the actual behaviour later people are ready.

Also for performance reasons I'd propose to do some extra upfront checks to catch index overlap (e.g. by checking for index overlap in _infer_concat_order_from_coords), rather than doing a potentially large concat and only detecting duplicates afterwards.

That sounds sensible.

As an aside, I wonder if we could create a hypothesis test for combine_by_coords, where we assert that any dataset that is split then combined should remain unchanged... xref #6908 and #1846

mjwillson added bug needs triage Issue that has not been reviewed by xarray team member labels Oct 19, 2022

TomNicholas added topic-combine combine/concat/merge and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

combine_by_coords allows one overlapping coordinate value, but not more than one #7189

combine_by_coords allows one overlapping coordinate value, but not more than one #7189

mjwillson commented Oct 19, 2022 •

edited

Loading

TomNicholas commented Oct 20, 2022

combine_by_coords allows one overlapping coordinate value, but not more than one #7189

combine_by_coords allows one overlapping coordinate value, but not more than one #7189

Comments

mjwillson commented Oct 19, 2022 • edited Loading

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

TomNicholas commented Oct 20, 2022

mjwillson commented Oct 19, 2022 •

edited

Loading