Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.__repr__ computes dask variables #1522

Closed
crusaderky opened this issue Aug 24, 2017 · 8 comments
Closed

Dataset.__repr__ computes dask variables #1522

crusaderky opened this issue Aug 24, 2017 · 8 comments

Comments

@crusaderky
Copy link
Contributor

DataArray.__repr__ and Variable.__repr__ print a placeholder if the data uses the dask backend.
Not so Dataset.__repr__, which tries computing the data before printing a tiny preview of it.
This issue is extremely annoying when working in Jupyter, and particularly acute if the chunks are very big or are at the end of a very long chain of computation.

For data variables, the expected behaviour is to print a placeholder just like DataArray does.
For coords, we could either

@shoyer
Copy link
Member

shoyer commented Aug 24, 2017

I think we should probably just print the placeholders for dask arrays in all cases. The current behavior should require calling a .preview() method of some sort.

@crusaderky
Copy link
Contributor Author

working on this now

@crusaderky
Copy link
Contributor Author

crusaderky commented Aug 28, 2017

This is in Jupyter:

def kernel():
    print("Kernel invoked!")
    return numpy.array([100, 200])

data = dask.array.Array(name='foo', dask={('foo', 0): (kernel, )}, chunks=((2,),), dtype=float)

ds = xarray.Dataset(
    data_vars={
        'foo': xarray.DataArray(data, dims=['x'], coords={'x': [10, 20]}),
        'bar': xarray.DataArray([1, 2], dims=['x'], coords={'x': [10, 20]}),
    })
ds.coords['y'] = ('y', [3, 4])
ds.coords['x2'] = ('x', data)
ds

Output:

Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Kernel invoked!
Out[2]:
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) int64 10 20
  * y        (y) int64 3 4
    x2       (x) float64 100 200
Data variables:
    foo      (x) float64 100 200
    bar      (x) int64 1 2

YIKES!
Here you have two separate issues: one is with repr() as expected. The other is the fact that Jupyter invokes getattr() on a plethora of _ipython_* attributes that xarray doesn't define, and there was an issue where all non-index coords were blindly being converted to indexVariable (which loads the data) every. single. time.

@crusaderky
Copy link
Contributor Author

There is a separate problem where index coords are computed twice. Didn't fix it yet and I am afraid of a domino effect. The problem is in merge.py:merge_coords():

    _assert_compat_valid(compat)
    coerced = coerce_pandas_values(objs)
    aligned = deep_align(coerced, join=join, copy=False, indexes=indexes)
    expanded = expand_variable_dicts(aligned)
    priority_vars = _get_priority_vars(aligned, priority_arg, compat=compat)
    variables = merge_variables(expanded, priority_vars, compat=compat)
    assert_unique_multiindex_level_names(variables)

Here, both expand_variable_dicts() and _get_priority_vars() compute the dask array.

@shoyer
Copy link
Member

shoyer commented Aug 28, 2017

There is a separate problem where index coords are computed twice.

Index coordinates (i.e., IndexVariable) are supposed to be calculated once, when they are added to an xarray object.

If they are getting computed via as_variable in expand_variable_dicts (which is called from _get_priority_vars()) then something has already gone wrong.

@crusaderky
Copy link
Contributor Author

crusaderky commented Aug 28, 2017

Given this:

def kernel():
    print("Kernel invoked!")
    return numpy.array([100, 200])

data = dask.array.Array(name='foo', dask={('foo', 0): (kernel, )}, chunks=((2,),), dtype=float)

This correctly computes the coord once:

ds = xarray.Dataset(coords={'z': ('z', data)})

While this computes it twice:

ds = xarray.Dataset()
ds.coords['z'] = ('z', data)

@shoyer
Copy link
Member

shoyer commented Aug 28, 2017

Hmm. My guess (untested) is that ds.coords['z'] = ('z', data) may be creating a Variable object instead of an IndexVariable by mistake. I opened another issue for this one: #1533

@crusaderky
Copy link
Contributor Author

Travis is failing in a few environments, but I just tested that I get the exact same errors in the master branch

crusaderky pushed a commit to crusaderky/xarray that referenced this issue Sep 21, 2017
jhamman pushed a commit that referenced this issue Oct 9, 2017
* Load non-index coords to memory ahead of concat

* Update unit test after #1522

* Minimise loads on concat. Extend new concat logic to data_vars.

* Trivial tweaks

* Added unit tests

Fix loads when vars are found different halfway through

* Add xfail for #1586

* Revert "Add xfail for #1586"

This reverts commit f99313c.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants