Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy of stacked dim with strings renames underlying dims #3287

Closed
chrisroat opened this issue Sep 6, 2019 · 7 comments · Fixed by #3906
Closed

GroupBy of stacked dim with strings renames underlying dims #3287

chrisroat opened this issue Sep 6, 2019 · 7 comments · Fixed by #3906

Comments

@chrisroat
Copy link
Contributor

Names for dimensions are lost (renamed) when they are stacked and grouped, if one of the dimensions has string coordinates.

data = np.zeros((2,1,1))
dims = ['c', 'y', 'x']

d1 = xr.DataArray(data, dims=dims)
g1 = d1.stack(f=['c', 'x']).groupby('f').first()
print('Expected dim names:')
print(g1.coords)
print()

d2 = xr.DataArray(data, dims=dims, coords={'c': ['R', 'G']})
g2 = d2.stack(f=['c', 'x']).groupby('f').first()
print('Unexpected dim names:')
print(g2.coords)

Output

It is expected the 'f_level_0' and 'f_level_1' be 'c' and 'x', respectively in the second part below.

Expected dim names:
Coordinates:
  * f        (f) MultiIndex
  - c        (f) int64 0 1
  - x        (f) int64 0 0

Unexpected dim names:
Coordinates:
  * f          (f) MultiIndex
  - f_level_0  (f) object 'G' 'R'
  - f_level_1  (f) int64 0 0

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.4 (default, Jul 9 2019, 18:13:23) [Clang 10.0.1 (clang-1001.0.46.4)] python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.2 libnetcdf: 4.6.3

xarray: 0.12.3
pandas: 0.25.1
numpy: 1.17.1
scipy: 1.3.1
netCDF4: 1.5.2
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.1.1
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.2.0
pip: 19.2.3
conda: None
pytest: None
IPython: 7.8.0
sphinx: None

@spencerahill
Copy link
Contributor

I just bumped into this problem as well. xarray 0.15.0. Expected behavior? Bug?

@spencerahill
Copy link
Contributor

Same or different problem as #1483?

@spencerahill
Copy link
Contributor

Here's a quick and dirty workaround that works at least for my use case. arr_orig is the original DataArray from which arr_unstacked_bad was generated via a stack/groupby/apply/unstack chain yielding the _level_0 etc. dims, with the stack call having been arr_orig.stack(**{dim_of_stack: dims_stacked}). Likely excessively convoluted and YMMV.

def fix_unstacked_dims(arr_unstacked_bad, arr_orig, dim_of_stack, dims_stacked):
    """Workaround for xarray bug involving stacking str-based coords.
    
    C.f. https://github.com/pydata/xarray/issues/3287

    """
    dims_not_stacked = [dim for dim in arr_orig.dims if dim not in dims_stacked]
    stacked_dims_after_unstack = [dim for dim in arr_unstacked_bad.dims 
                                  if dim not in dims_not_stacked]
    dims_mapping = {d1: d2 for d1, d2 in zip(stacked_dims_after_unstack, dims_stacked)}
    arr_unstacked_bad = arr_unstacked_bad.rename(dims_mapping)

    arr_out = arr_orig.copy(deep=True)
    arr_out.values = arr_unstacked_bad.transpose(*arr_orig.dims).values
    return arr_out.assign_coords(arr_orig.coords)

@max-sixty
Copy link
Collaborator

This does look weird. A PR would be great.

@spencerahill
Copy link
Contributor

spencerahill commented Mar 25, 2020

Notice that the string coordinate also gets reordered alphabetically: in @chrisroat 's example above, the coord goes from ['R', 'G'] to ['G', 'R'].

@max-sixty I can't promise a PR anytime soon, but if/when I do manage, where would be a good starting point? Perhaps here where the _level_ names are introduced:

if isinstance(current_index, pd.MultiIndex):
names.extend(current_index.names)
codes.extend(current_index.codes)
levels.extend(current_index.levels)
else:
names.append("%s_level_0" % dim)

Edit: actually maybe here:

xarray/xarray/core/variable.py

Lines 2237 to 2249 in 9eec56c

def to_index(self):
"""Convert this variable to a pandas.Index"""
# n.b. creating a new pandas.Index from an old pandas.Index is
# basically free as pandas.Index objects are immutable
assert self.ndim == 1
index = self._data.array
if isinstance(index, pd.MultiIndex):
# set default names for multi-index unnamed levels so that
# we can safely rename dimension / coordinate later
valid_level_names = [
name or "{}_level_{}".format(self.dims[0], i)
for i, name in enumerate(index.names)
]

@max-sixty
Copy link
Collaborator

Re the reordering; that's the case, though it does reorder the dimension, not just the coord (i.e. it's still correctly aligned). Slight change to the original example to demonstrate.

In [18]: data = np.arange(2).reshape((2,1,1))
    ...: dims = ['c', 'y', 'x']
    ...:
    ...: d1 = xr.DataArray(data, dims=dims)
    ...: g1 = d1.stack(f=['c', 'x']).groupby('f').first()
    ...: print('Expected dim names:')
    ...: print(g1.coords)
    ...: print()
    ...:
    ...: d2 = xr.DataArray(data, dims=dims, coords={'c': ['R', 'G']})
    ...: g2 = d2.stack(f=['c', 'x']).groupby('f').first()
    ...: print('Unexpected dim names:')
    ...: print(g2.coords)
Expected dim names:
Coordinates:
  * f        (f) MultiIndex
  - c        (f) int64 0 1
  - x        (f) int64 0 0

Unexpected dim names:
Coordinates:
  * f          (f) MultiIndex
  - f_level_0  (f) object 'G' 'R'
  - f_level_1  (f) int64 0 0

In [19]: d2
Out[19]:
<xarray.DataArray (c: 2, y: 1, x: 1)>
array([[[0]],

       [[1]]])
Coordinates:
  * c        (c) <U1 'R' 'G'
Dimensions without coordinates: y, x

In [20]: g2
Out[20]:
<xarray.DataArray (y: 1, f: 2)>
array([[1, 0]])
Coordinates:
  * f          (f) MultiIndex
  - f_level_0  (f) object 'G' 'R'
  - f_level_1  (f) int64 0 0
Dimensions without coordinates: y

Yes that second reference looks like the place @spencerahill!

@spencerahill
Copy link
Contributor

Thanks @max-sixty. Contrary to my warning about not doing a PR, I couldn't help myself and dug in a bit. It turns out that string coordinates aren't the problem, it's when the coordinate isn't in sorted order. For example, @chrisroat's original example doesn't error if the coordinate is ["G", "R"] instead of ["R", "G"]. A more concrete WIP test:

def test_stack_groupby_unsorted_coord():
    data = [[0, 1], [2, 3]]
    data_flat = [0, 1, 2, 3]
    dims = ["y", "x"]
    y_vals = [2, 3]

    # "y" coord is in sorted order, and everything works
    arr = xr.DataArray(data, dims=dims, coords={"y": y_vals})
    actual1 = arr.stack(z=["y", "x"]).groupby("z").first()
    midx = pd.MultiIndex.from_product([[2, 3], [0, 1]], names=dims)
    expected1 = xr.DataArray(data_flat, dims=["z"], coords={"z": midx})
    xr.testing.assert_equal(actual1, expected1)
    
    # Now "y" coord is NOT in sorted order, and the bug appears
    arr = xr.DataArray(data, dims=dims, coords={"y": y_vals[::-1]})
    actual2 = arr.stack(z=["y", "x"]).groupby("z").first()
    midx = pd.MultiIndex.from_product([[3, 2], [0, 1]], names=dims)
    expected2 = xr.DataArray(data_flat, dims=["z"], coords={"z": midx})
    xr.testing.assert_equal(actual2, expected2)

test_stack_groupby_str_coords()

yields

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)

[...]

AssertionError: Left and right DataArray objects are not equal

Differing values:
L
    array([2, 3, 0, 1])
R
    array([0, 1, 2, 3])
Differing coordinates:
L * z        (z) MultiIndex
  - z_leve...(z) int64 2 2 3 3
  - z_leve...(z) int64 0 1 0 1
R * z        (z) MultiIndex
  - y        (z) int64 3 3 2 2
  - x        (z) int64 0 1 0 1

I'll return to this tomorrow, in the meantime if this triggers any thoughts about the best path forward, that would be much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants