Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask arrays and DataArray coords that share name with dimensions #1684

Closed
jhamman opened this issue Nov 2, 2017 · 3 comments
Closed

Dask arrays and DataArray coords that share name with dimensions #1684

jhamman opened this issue Nov 2, 2017 · 3 comments

Comments

@jhamman
Copy link
Member

jhamman commented Nov 2, 2017

First reported by @mrocklin in here.

In [1]: import xarray 

In [2]: import dask.array as da

In [3]:     coord = da.arange(8, chunks=(4,))
   ...:     data = da.random.random((8, 8), chunks=(4, 4)) + 1
   ...:     array = xarray.DataArray(data,
   ...:                       coords={'x': coord, 'y': coord},
   ...:                       dims=['x', 'y'])
   ...: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-b90a33ebf436> in <module>()
      3 array = xarray.DataArray(data,
      4                   coords={'x': coord, 'y': coord},
----> 5                   dims=['x', 'y'])

/home/mrocklin/workspace/xarray/xarray/core/dataarray.py in __init__(self, data, coords, dims, name, attrs, encoding, fastpath)
    227 
    228             data = as_compatible_data(data)
--> 229             coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
    230             variable = Variable(dims, data, attrs, encoding, fastpath=True)
    231 

/home/mrocklin/workspace/xarray/xarray/core/dataarray.py in _infer_coords_and_dims(shape, coords, dims)
     68     if utils.is_dict_like(coords):
     69         for k, v in coords.items():
---> 70             new_coords[k] = as_variable(v, name=k)
     71     elif coords is not None:
     72         for dim, coord in zip(dims, coords):

/home/mrocklin/workspace/xarray/xarray/core/variable.py in as_variable(obj, name)
     94                             '{}'.format(obj))
     95     elif utils.is_scalar(obj):
---> 96         obj = Variable([], obj)
     97     elif getattr(obj, 'name', None) is not None:
     98         obj = Variable(obj.name, obj)

/home/mrocklin/workspace/xarray/xarray/core/variable.py in __init__(self, dims, data, attrs, encoding, fastpath)
    275         """
    276         self._data = as_compatible_data(data, fastpath=fastpath)
--> 277         self._dims = self._parse_dimensions(dims)
    278         self._attrs = None
    279         self._encoding = None

/home/mrocklin/workspace/xarray/xarray/core/variable.py in _parse_dimensions(self, dims)
    439             raise ValueError('dimensions %s must have the same length as the '
    440                              'number of data dimensions, ndim=%s'
--> 441                              % (dims, self.ndim))
    442         return dims
    443 

ValueError: dimensions () must have the same length as the number of data dimensions, ndim=1

or a similiar setup that computes the coordinates imediately

In [18]: x = xr.Variable('x', da.arange(8, chunks=(4,)))
    ...: y = xr.Variable('y', da.arange(8, chunks=(4,)) * 2)
    ...: data = da.random.random((8, 8), chunks=(4, 4)) + 1
    ...: array = xr.DataArray(data,
    ...:                      dims=['x', 'y'])
    ...: array.coords['x'] = x
    ...: array.coords['y'] = y
    ...:

In [19]: array
Out[19]:
<xarray.DataArray 'add-7d8ed340e5dd8fe107ea681573c72e87' (x: 8, y: 8)>
dask.array<shape=(8, 8), dtype=float64, chunksize=(4, 4)>
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7
  * y        (y) int64 0 2 4 6 8 10 12 14

Problem description

I think we have two, possiblely related problems with using dask arrays as DataArray coordinates.

  1. As the first snippet shows, the constructor fails when coordinates are specified as raw dask arrays. This does not occur when coord is a numpy array.
  2. When coordinates are specified as dask arrays via the coords attribute, they are computed immediately.

Expected Output

Output of xr.show_versions()

In [23]: xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

xarray: 0.10.0rc1
pandas: 0.20.3
numpy: 1.13.1
scipy: 0.19.1
netCDF4: None
h5netcdf: 0.3.1
Nio: None
bottleneck: 1.2.0
cyordereddict: None
dask: 0.15.4
matplotlib: 2.0.2
cartopy: 0.15.1
seaborn: 0.8.1
setuptools: 36.6.0
pip: 9.0.1
conda: 4.3.29
pytest: 3.0.5
IPython: 5.1.0
sphinx: 1.5.1

@shoyer
Copy link
Member

shoyer commented Nov 2, 2017

Case 1 looks like a bug.

Case 2 is intentional: we always evaluate coordinates corresponding to dimensions and store them as pandas.Index objects.

@jhamman
Copy link
Member Author

jhamman commented Nov 3, 2017

@shoyer - I had forgotten about case 2. Thanks. I'll look into case 1.

@jhamman
Copy link
Member Author

jhamman commented Nov 3, 2017

fix is up in #1685

@jhamman jhamman mentioned this issue Nov 3, 2017
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants