Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assign_coords' behavior depends on input DataArrays #8180

Open
4 tasks done
yichechang opened this issue Sep 13, 2023 · 7 comments
Open
4 tasks done

assign_coords' behavior depends on input DataArrays #8180

yichechang opened this issue Sep 13, 2023 · 7 comments

Comments

@yichechang
Copy link

What happened?

I'm trying to compute masks (from DataArray's data itself) and assign them as coordinates, but it appears that depending on the combination of coords/dims of the computed masks, sometimes .assign_coords will fail.

It seems like

  • it fails when all the mask DataArray's (each is a mask computed, but it probably doesn't matter) to be assigned as coordinates, share a dimension common to the target DataArray, and the dimension contains only a singular value (across all mask DataArray's)
  • it doesn't fail when the shared dimension contains more than one value.

It's a bit hard to describe as I don't know the xarray internal itself, but my self-contained minimal example below should demonstrate the issue much clearer.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

import xarray as xr

data = xr.DataArray(
    data=[
        [0, 1, 2], 
        [0, 1, 2]
    ], 
    coords={
        'd1': ['m', 'n'], 
        'd2': ['a', 'b', 'c']
    }
)


# this will fail:
data.assign_coords({'mask_d1_m': data.sel(d1='m')==0})
# ValueError: dimension 'd1' already exists as a scalar variable

# this will fail too:
data.assign_coords({'mask_d1_n': data.sel(d1='n')==0})
# ValueError: dimension 'd1' already exists as a scalar variable

# but this will work:
data.assign_coords(
    {
        'mask_d1_m': data.sel(d1='m')==0,
        'mask_d1_n': data.sel(d1='n')==0
    }
)
# <xarray.DataArray (d1: 2, d2: 3)>
# array([[0, 1, 2],
#        [0, 1, 2]])
# Coordinates:
#   * d1         (d1) <U1 'm' 'n'
#   * d2         (d2) <U1 'a' 'b' 'c'
#     mask_d1_m  (d2) bool True False False
#     mask_d1_n  (d2) bool True False False

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[27], line 1
----> 1 data.assign_coords({'mask_d1_n': data.sel(d1='n')==0})

File ~/mambaforge/envs/quickquant/lib/python3.8/site-packages/xarray/core/common.py:615, in DataWithCoords.assign_coords(self, coords, **coords_kwargs)
    613 data = self.copy(deep=False)
    614 results: dict[Hashable, Any] = self._calc_assign_results(coords_combined)
--> 615 data.coords.update(results)
    616 return data

File ~/mambaforge/envs/quickquant/lib/python3.8/site-packages/xarray/core/coordinates.py:177, in Coordinates.update(self, other)
    173 self._maybe_drop_multiindex_coords(set(other_vars))
    174 coords, indexes = merge_coords(
    175     [self.variables, other_vars], priority_arg=1, indexes=self.xindexes
    176 )
--> 177 self._update_coords(coords, indexes)

File ~/mambaforge/envs/quickquant/lib/python3.8/site-packages/xarray/core/coordinates.py:393, in DataArrayCoordinates._update_coords(self, coords, indexes)
    391 coords_plus_data = coords.copy()
    392 coords_plus_data[_THIS_ARRAY] = self._data.variable
--> 393 dims = calculate_dimensions(coords_plus_data)
    394 if not set(dims) <= set(self.dims):
    395     raise ValueError(
    396         "cannot add coordinates with new dimensions to a DataArray"
    397     )

File ~/mambaforge/envs/quickquant/lib/python3.8/site-packages/xarray/core/variable.py:3209, in calculate_dimensions(variables)
   3207 for dim, size in zip(var.dims, var.shape):
   3208     if dim in scalar_vars:
-> 3209         raise ValueError(
   3210             f"dimension {dim!r} already exists as a scalar variable"
   3211         )
   3212     if dim not in dims:
   3213         dims[dim] = size

ValueError: dimension 'd1' already exists as a scalar variable

Anything else we need to know?

No response

Environment

~/mambaforge/envs/quickquant/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit: None
python: 3.8.17 | packaged by conda-forge | (default, Jun 16 2023, 07:11:32)
[Clang 14.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: en_US.UTF-8
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2023.1.0
pandas: 1.5.3
numpy: 1.24.0
scipy: 1.10.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.15.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2023.5.0
distributed: 2023.5.0
matplotlib: 3.7.2
cartopy: None
seaborn: 0.12.2
numbagg: None
fsspec: 2023.9.0
cupy: None
pint: 0.21
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.1.2
pip: 23.2.1
conda: 23.7.3
pytest: 7.4.1
mypy: None
IPython: 8.12.2
sphinx: 4.5.0

@yichechang yichechang added bug needs triage Issue that has not been reviewed by xarray team member labels Sep 13, 2023
@welcome
Copy link

welcome bot commented Sep 13, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@yichechang yichechang changed the title assign_coords' behavior depends on assign_coords' behavior depends on input DataArray's Sep 13, 2023
@max-sixty
Copy link
Collaborator

max-sixty commented Sep 14, 2023

Thanks for the issue. What would you expect the output to be?

It does seem surprising that passing two arguments succeeds while passing each of them alone succeeds...

A partial look — adding .drop_vars('d1') makes it succeed, because the argument has a d1 dimension.

data.sel(d1='m')==0

<xarray.DataArray (d2: 3)>
array([ True, False, False])
Coordinates:
    d1       <U1 'm'
  * d2       (d2) <U1 'a' 'b' 'c'
[nav] In [36]: data.assign_coords(
          ...:     {
          ...:         'mask_d1_m': (data.sel(d1='n')==0).drop_vars('d1')
          ...: #        'mask_d1_n': data.sel(d1='n')==0
          ...:     }
          ...: )

...though I'm not sure why it succeeds when there are two arguments.

@yichechang
Copy link
Author

yichechang commented Sep 14, 2023

Hi - Thanks for the reply and testing!

I guess I would expect that all my examples would all fail, or all succeed.

Sorry I didn't include what I would expect... because I didn't really know what to expect since I haven't thought about why it failed. (But it seems like it should fail, as now it became clearer to me that my mask_d1_m should not have dimension d1?). From the error message, I would guess that doing .drop_vars('d1') would fix the problem (which is true as you demonstrated above!), but I couldn't wrap my head around why a DataArray of dim ('d1', 'd2') cannot be passed as coordinates for DataArray of dim ('d1', 'd2')?

Edit:
Also just wanted to point out that it's not just having two arguments that prevents it from failing, it has to be that d1 dim has both d1='m' and d1='n'.

import xarray as xr

data = xr.DataArray(
    data=[
        [0, 1, 2], 
        [0, 1, 2]
    ], 
    coords={
        'd1': ['m', 'n'], 
        'd2': ['a', 'b', 'c']
    }
)


# so this will FAIL
data.assign_coords(
    {
        'mask_d1_m': data.sel(d1='m')==0,
#                                ^^^
        'mask_d1_n': data.sel(d1='m')==1,
#                                ^^^
    }
)

# but this will SUCCEED
data.assign_coords(
    {
        'mask_d1_m': data.sel(d1='m')==0,
#                                ^^^
        'mask_d1_n': data.sel(d1='n')==1,
#                                ^^^
    }
)

@yichechang
Copy link
Author

Just wanted to give an update: For my own application (computing masks for a DataArray using its own data, and then attach resulting masks to the DataArray as its new coords), I should make sure my masks contain only labels for its dimensions, in the first place. So, something like

# compute mask based on my data
# mask = data.sel(d1=...) > whatever

# remove labels for extra dimensions
mask = mask.drop_vars([v for v in mask.coords if v not in mask.dims]

# assign mask as coords for the original DataArray
data = data.assign_coords({'my_mask': mask})

But the inconsistent behaviors still seem like a bug, or there's some magic happening during the process of combining multiple coords assignments, that are not explicitly documented. It is not too much of an issue, as

  1. the error messages kind of give enough hint on how to do things correctly, and
  2. it's more so just a surprise that I can quickly tell myself that "okay, I'm doing something wrong, but there was some magic that hide this when I assign coords in a certain way. fix it and move on."

Therefore, I'll leave this issue open but please feel free to close it if this isn't something to be fixed. From my point of view, the inconsistency is surprising, but not a major issue. Thank you for taking the time testing!

@keewis
Copy link
Collaborator

keewis commented Sep 17, 2023

as far as I can tell, the cause for the surprising/inconsistent behavior is that xr.core.coordinates.create_coords_with_default_indexes (or xr.core.merge.merge_coordinates_without_align, I didn't step through create_coords_with_default_indexes) drops scalar coordinates with conflicting values. cc @benbovy

In your case, I think the easiest way to work around this is to use .sel(..., drop=True) which will not keep the scalar coordinate.

@yichechang
Copy link
Author

yichechang commented Sep 17, 2023

Hi @keewis - Thank you for the explanation!

Yeah, I was digging into the codebase a little bit, but unfortunately -- as it is probably evident that I'm not that familiar with xarray's internals -- I was a bit lost. Not that I completely understand now, but I am grateful for an explanation so I know I'm not crazy 🤣

Also thank you for the recommended route with the drop kwarg in DataArray.sel method. This is more succinct and can convey my intention more explicitly. Good to learn how to do things correctly... I think I need to sit down and re-learn xarray. I've been using it for a long time, but usually I can get it to do what I want while knowing there must be something I'm still missing.

Really appreciate all of your help : )

@benbovy
Copy link
Member

benbovy commented Sep 18, 2023

Since we are relaxing the constraints that are related to dimension coordinates (e.g., #7989), I'm wondering if we couldn't also relax the case where a scalar coordinate has the same name as a dimension.

I don't think that this would help much here, though. Using .assign_coords with a dictionary of DataArray objects extracts all their coordinates and tries to merge them with the current Dataset or DataArray. In many cases this is good but sometimes this could have unwanted side effects. Using drop may help, or alternatively we can ignore all the coordinates (and keep the dimension names) by extracting the variable from the DataArray:

data.assign_coords({'mask_d1_m': (data.sel(d1='m')==0).variable})

@max-sixty max-sixty added topic-indexing and removed needs triage Issue that has not been reviewed by xarray team member labels Feb 26, 2024
@max-sixty max-sixty changed the title assign_coords' behavior depends on input DataArray's assign_coords' behavior depends on input DataArrays Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants