Avoid computing dask variables on repr and getattr #1532

crusaderky · 2017-08-28T14:37:20Z

Fixes Dataset.__repr__ computes dask variables #1522
Tests added / passed
Passes git diff upstream/master | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

Stop dataset data vars and non-index dataset/dataarray coords from being loaded by repr() and getattr(). The latter is particularly acute when working in Jupyter, which does a dozen or so getattr() when printing an object.

Also stop resolving non-index coords in Datasets

crusaderky · 2017-08-28T14:37:42Z

Fixes #1522

jhamman

This looks good to me. Glad you were able to track down what was going wrong here.

rabernat · 2017-09-01T16:59:14Z

Just wanted to say that I'm so glad someone took this on!

Will this also fix the very slow tab-autocomplete on dask-backed xarray objects? This is something I hit frequently in interactive work with big datasets. I assume it's related.

shoyer · 2017-09-01T17:05:57Z

xarray/tests/test_dask.py

+            y        (x) int64 ...
+        Dimensions without coordinates: x
+        Data variables:
+            a        (x) int64 ...""")


Something to consider: could we show an abbreviated version of the dask array repr instead of ...?

e.g., if the dask repr is dask.array<add, shape=(10,), dtype=float64, chunksize=(5,)>, maybe dask.array<add, chunksize=(5,)> or dask.array<add, shape=(10,) chunksize=(5,)>?

@shoyer fixed. Now it's the same as in Variable and in the DataArray data var.

shoyer · 2017-09-01T18:25:32Z

xarray/core/formatting.py

-    show_values = _not_remote(var)
-    return _summarize_var_or_coord(name, var, col_width, show_values)
+def summarize_datavar(name, var, col_width):
+    show_values = var._in_memory


Our current heuristic uses the _not_remote() helper function, so it doesn't display arrays loaded over a network (via opendap), which can often be quite slow. But it does display a summary of values from netCDF files on disk, which I do think is generally helpful and for which I haven't noticed any performance issues.

Based on the current definition of _in_memory, we wouldn't display any of these arrays:

xarray/xarray/core/variable.py

Lines 285 to 289 in 4a15cfa

@property

def _in_memory(self):

return (isinstance(self._data, (np.ndarray, PandasIndexAdapter)) or

(isinstance(self._data, indexing.MemoryCachedArray) and

isinstance(self._data.array, np.ndarray)))

So instead of using _in_memory, I would suggest something like _not_remote(var) and not isinstance(var._data, dask_array_type) as the condition for showing values.

@shoyer loading a NetCDF variable from disk every time you do __repr__ is a terrible idea if that variable has been compressed without chunking. If the variable is a single block of 100MB of zlib-compressed data, you will have to read it and decompress it every time.

@shoyer also, your netcdf array might be sitting on a network file system on the opposite side of a narrowband VPN.

That's certainly possible, but in my experience very few people writes 100MB chunks -- those are very large.

Let's summarize our options:

Always show a preview of data from netCDF files with Dataset.__repr__

Never show a preview for data if it isn't already in memory

Show a preview depending on a global option (with default choice TBD).

Reasons to show data from disk in __repr__:

It's what we've always done.

"Most" of the time it's fast and convenient.

It provides a good experience for new users, who don't need to hunt for a separate preview() or load() command to see what's in a Dataset. You can simply print it at a console.

Reasons not to show data from disk in __repr__:

IO can be slow/expensive, especially if compression or networks are involved.

Heuristics to detect expensive IO are unreliable and somewhat distasteful.

Maybe we should solicit a few more opinions here before we change the default behavior?

Another possibility is to try loading data in a separate thread and timeout if it takes too long (say more than 0.5 seconds), but that might open up it's own set of performance issues (it's not easy to kill a thread, short of terminating a process).

I think my vote would be to only print a preview of data that is in memory. For my uses, I typically have fill values in the first 10-20 data points so the previous __repr__ didn't give me any information.

@pydata/xarray - anyone else have thoughts on this?

@shoyer - do we have results from your google poll on this issue yet?

Sounds like I was wrong -- the consensus is pretty clear that we should go ahead with this

I'm not sure this sample size is going to give us statistically significant results but I'm glad to see @delgadom and I are in agreement.

@crusaderky - are you up for implementing this?

I think the current implementation (in this PR) is actually already correct.

Yep - data is eagerly loaded from disk only for index coords on __init__ now.

crusaderky · 2017-09-02T14:14:30Z

@rabernat could you do an example? I never noticed. If the delay was caused by a miss on __getattr__, then yes, it will be solved.

crusaderky · 2017-09-02T14:52:32Z

@rabernat I reproduced your problem with tab completion and I'm happy to confirm that this fixes it!

shoyer

I have two small requests, but generally this looks good to me now.

shoyer · 2017-09-20T20:12:50Z

xarray/core/formatting.py

@@ -208,6 +208,9 @@ def _summarize_var_or_coord(name, var, col_width, show_values=True,
    front_str = u'%s%s%s ' % (first_col, dims_str, var.dtype)
    if show_values:
        values_str = format_array_flat(var, max_width - len(front_str))
+    elif isinstance(var.data, dask_array_type):
+        chunksize = tuple(c[0] for c in var.chunks)
+        values_str = 'dask.array<shape=%s, chunksize=%s>' % (var.shape, chunksize)


can you make a little helper function for this, e.g., dask_short_repr()?

shoyer · 2017-09-20T20:14:24Z

doc/whats-new.rst

@@ -131,7 +131,11 @@ Bug fixes
  ``rtol`` arguments when called on ``DataArray`` objects.
  By `Stephan Hoyer <https://github.com/shoyer>`_.

- Xarray ``quantile`` methods now properly raise a ``TypeError`` when applied to
+- Stop ``repr`` and the Jupyter Notebook from automatically computing dask


This should probably go under "breaking changes" instead, given that it changes existing behavior.

crusaderky · 2017-09-21T07:32:01Z

Will finalise over the weekend

crusaderky · 2017-09-21T20:00:46Z

All done

shoyer · 2017-09-21T20:55:50Z

Thanks @crusaderky !

crusaderky · 2017-09-21T22:28:59Z

You're welcome 👍

gimperiale added 5 commits August 28, 2017 14:37

stop repr(Dataset) from resolving dask variables

37d9135

Also stop resolving non-index coords in Datasets

stop DataArray.__getattr__('NOTEXIST') from computing dask variables

885d5cd

PEP8 fixes

df9b57b

PEP8 fixes

426b12c

Tests for __repr__, __getattr__ and __getstate__

5d9062e

gimperiale added 2 commits August 28, 2017 16:03

int is coerced to int64 or int32 on different systems

555d2ca

What's New

1633a4a

jhamman modified the milestone: 0.10 Aug 28, 2017

jhamman mentioned this pull request Aug 28, 2017

v0.10 Release #1535

Closed

13 tasks

jhamman approved these changes Sep 1, 2017

View reviewed changes

shoyer reviewed Sep 1, 2017

View reviewed changes

gimperiale added 4 commits September 2, 2017 15:16

Merge remote-tracking branch 'upstream/master'

e626a9b

Merge branch 'master' into dataset_repr

2717ae9

flake8 tweaks

8c72da0

print summary of dask arrays

4e2d081

gimperiale added 2 commits September 2, 2017 16:45

More compact printing for dask arrays

05e02be

Merge remote-tracking branch 'upstream/master' into dataset_repr

9fc6751

shoyer reviewed Sep 20, 2017

View reviewed changes

gimperiale added 3 commits September 21, 2017 20:00

Merge remote-tracking branch 'upstream/master' into dataset_repr

963e26a

Cleared changelog and moved to breaking changes

9bc1c74

Deduplicate code

7b1b265

jhamman approved these changes Sep 21, 2017

View reviewed changes

shoyer merged commit 7611ed9 into pydata:master Sep 21, 2017

crusaderky deleted the dataset_repr branch September 21, 2017 22:30

jhamman mentioned this pull request Sep 22, 2017

Excessive memory usage when printing multi-file Dataset #1481

Closed

shoyer mentioned this pull request Nov 17, 2017

Possible regression with PyNIO data not being lazily loaded #1720

Closed

rabernat mentioned this pull request Nov 18, 2017

Switch our lazy array classes to use Dask instead? #1725

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid computing dask variables on repr and getattr #1532

Avoid computing dask variables on repr and getattr #1532

crusaderky commented Aug 28, 2017 •

edited

Loading

crusaderky commented Aug 28, 2017

jhamman left a comment

rabernat commented Sep 1, 2017

shoyer Sep 1, 2017

crusaderky Sep 2, 2017

shoyer Sep 1, 2017

crusaderky Sep 2, 2017 •

edited

Loading

crusaderky Sep 2, 2017

shoyer Sep 2, 2017

jhamman Sep 5, 2017

jhamman Sep 20, 2017

shoyer Sep 20, 2017

jhamman Sep 20, 2017

shoyer Sep 20, 2017

crusaderky Sep 21, 2017

crusaderky commented Sep 2, 2017 •

edited

Loading

crusaderky commented Sep 2, 2017 •

edited

Loading

shoyer left a comment

shoyer Sep 20, 2017

shoyer Sep 20, 2017

crusaderky commented Sep 21, 2017

crusaderky commented Sep 21, 2017

shoyer commented Sep 21, 2017

crusaderky commented Sep 21, 2017

	@property
	def _in_memory(self):
	return (isinstance(self._data, (np.ndarray, PandasIndexAdapter)) or
	(isinstance(self._data, indexing.MemoryCachedArray) and
	isinstance(self._data.array, np.ndarray)))

Avoid computing dask variables on __repr__ and __getattr__ #1532

Avoid computing dask variables on __repr__ and __getattr__ #1532

Conversation

crusaderky commented Aug 28, 2017 • edited Loading

crusaderky commented Aug 28, 2017

jhamman left a comment

Choose a reason for hiding this comment

rabernat commented Sep 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Sep 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Sep 2, 2017 • edited Loading

crusaderky commented Sep 2, 2017 • edited Loading

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Sep 21, 2017

crusaderky commented Sep 21, 2017

shoyer commented Sep 21, 2017

crusaderky commented Sep 21, 2017

Avoid computing dask variables on repr and getattr #1532

Avoid computing dask variables on repr and getattr #1532

crusaderky commented Aug 28, 2017 •

edited

Loading

crusaderky Sep 2, 2017 •

edited

Loading

crusaderky commented Sep 2, 2017 •

edited

Loading

crusaderky commented Sep 2, 2017 •

edited

Loading