Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid computing dask variables on __repr__ and __getattr__ #1532

Merged
merged 16 commits into from
Sep 21, 2017

Conversation

crusaderky
Copy link
Contributor

@crusaderky crusaderky commented Aug 28, 2017

Stop dataset data vars and non-index dataset/dataarray coords from being loaded by repr() and getattr(). The latter is particularly acute when working in Jupyter, which does a dozen or so getattr() when printing an object.

@crusaderky
Copy link
Contributor Author

Fixes #1522

@jhamman jhamman modified the milestone: 0.10 Aug 28, 2017
@jhamman jhamman mentioned this pull request Aug 28, 2017
13 tasks
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Glad you were able to track down what was going wrong here.

@rabernat
Copy link
Contributor

rabernat commented Sep 1, 2017

Just wanted to say that I'm so glad someone took this on!

Will this also fix the very slow tab-autocomplete on dask-backed xarray objects? This is something I hit frequently in interactive work with big datasets. I assume it's related.

y (x) int64 ...
Dimensions without coordinates: x
Data variables:
a (x) int64 ...""")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider: could we show an abbreviated version of the dask array repr instead of ...?

e.g., if the dask repr is dask.array<add, shape=(10,), dtype=float64, chunksize=(5,)>, maybe dask.array<add, chunksize=(5,)> or dask.array<add, shape=(10,) chunksize=(5,)>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer fixed. Now it's the same as in Variable and in the DataArray data var.

show_values = _not_remote(var)
return _summarize_var_or_coord(name, var, col_width, show_values)
def summarize_datavar(name, var, col_width):
show_values = var._in_memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current heuristic uses the _not_remote() helper function, so it doesn't display arrays loaded over a network (via opendap), which can often be quite slow. But it does display a summary of values from netCDF files on disk, which I do think is generally helpful and for which I haven't noticed any performance issues.

Based on the current definition of _in_memory, we wouldn't display any of these arrays:

@property
def _in_memory(self):
return (isinstance(self._data, (np.ndarray, PandasIndexAdapter)) or
(isinstance(self._data, indexing.MemoryCachedArray) and
isinstance(self._data.array, np.ndarray)))

So instead of using _in_memory, I would suggest something like _not_remote(var) and not isinstance(var._data, dask_array_type) as the condition for showing values.

Copy link
Contributor Author

@crusaderky crusaderky Sep 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer loading a NetCDF variable from disk every time you do __repr__ is a terrible idea if that variable has been compressed without chunking. If the variable is a single block of 100MB of zlib-compressed data, you will have to read it and decompress it every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer also, your netcdf array might be sitting on a network file system on the opposite side of a narrowband VPN.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's certainly possible, but in my experience very few people writes 100MB chunks -- those are very large.

Let's summarize our options:

  1. Always show a preview of data from netCDF files with Dataset.__repr__
  2. Never show a preview for data if it isn't already in memory
  3. Show a preview depending on a global option (with default choice TBD).

Reasons to show data from disk in __repr__:

  • It's what we've always done.
  • "Most" of the time it's fast and convenient.
  • It provides a good experience for new users, who don't need to hunt for a separate preview() or load() command to see what's in a Dataset. You can simply print it at a console.

Reasons not to show data from disk in __repr__:

  • IO can be slow/expensive, especially if compression or networks are involved.
  • Heuristics to detect expensive IO are unreliable and somewhat distasteful.

Maybe we should solicit a few more opinions here before we change the default behavior?

Another possibility is to try loading data in a separate thread and timeout if it takes too long (say more than 0.5 seconds), but that might open up it's own set of performance issues (it's not easy to kill a thread, short of terminating a process).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my vote would be to only print a preview of data that is in memory. For my uses, I typically have fill values in the first 10-20 data points so the previous __repr__ didn't give me any information.

@pydata/xarray - anyone else have thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - do we have results from your google poll on this issue yet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like I was wrong -- the consensus is pretty clear that we should go ahead with this

screen shot 2017-09-20 at 12 22 32 pm

screen shot 2017-09-20 at 12 22 38 pm

screen shot 2017-09-20 at 12 22 41 pm

screen shot 2017-09-20 at 12 22 48 pm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this sample size is going to give us statistically significant results but I'm glad to see @delgadom and I are in agreement.

@crusaderky - are you up for implementing this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current implementation (in this PR) is actually already correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - data is eagerly loaded from disk only for index coords on __init__ now.

@crusaderky
Copy link
Contributor Author

crusaderky commented Sep 2, 2017

@rabernat could you do an example? I never noticed. If the delay was caused by a miss on __getattr__, then yes, it will be solved.

@crusaderky
Copy link
Contributor Author

crusaderky commented Sep 2, 2017

@rabernat I reproduced your problem with tab completion and I'm happy to confirm that this fixes it!

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two small requests, but generally this looks good to me now.

@@ -208,6 +208,9 @@ def _summarize_var_or_coord(name, var, col_width, show_values=True,
front_str = u'%s%s%s ' % (first_col, dims_str, var.dtype)
if show_values:
values_str = format_array_flat(var, max_width - len(front_str))
elif isinstance(var.data, dask_array_type):
chunksize = tuple(c[0] for c in var.chunks)
values_str = 'dask.array<shape=%s, chunksize=%s>' % (var.shape, chunksize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a little helper function for this, e.g., dask_short_repr()?

@@ -131,7 +131,11 @@ Bug fixes
``rtol`` arguments when called on ``DataArray`` objects.
By `Stephan Hoyer <https://github.com/shoyer>`_.

- Xarray ``quantile`` methods now properly raise a ``TypeError`` when applied to
- Stop ``repr`` and the Jupyter Notebook from automatically computing dask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably go under "breaking changes" instead, given that it changes existing behavior.

@crusaderky
Copy link
Contributor Author

Will finalise over the weekend

@crusaderky
Copy link
Contributor Author

All done

@shoyer shoyer merged commit 7611ed9 into pydata:master Sep 21, 2017
@shoyer
Copy link
Member

shoyer commented Sep 21, 2017

Thanks @crusaderky !

@crusaderky
Copy link
Contributor Author

You're welcome 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants