-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid computing dask variables on __repr__ and __getattr__ #1532
Conversation
Also stop resolving non-index coords in Datasets
Fixes #1522 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Glad you were able to track down what was going wrong here.
Just wanted to say that I'm so glad someone took this on! Will this also fix the very slow tab-autocomplete on dask-backed xarray objects? This is something I hit frequently in interactive work with big datasets. I assume it's related. |
xarray/tests/test_dask.py
Outdated
y (x) int64 ... | ||
Dimensions without coordinates: x | ||
Data variables: | ||
a (x) int64 ...""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to consider: could we show an abbreviated version of the dask array repr instead of ...
?
e.g., if the dask repr is dask.array<add, shape=(10,), dtype=float64, chunksize=(5,)>
, maybe dask.array<add, chunksize=(5,)>
or dask.array<add, shape=(10,) chunksize=(5,)>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer fixed. Now it's the same as in Variable and in the DataArray data var.
show_values = _not_remote(var) | ||
return _summarize_var_or_coord(name, var, col_width, show_values) | ||
def summarize_datavar(name, var, col_width): | ||
show_values = var._in_memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our current heuristic uses the _not_remote()
helper function, so it doesn't display arrays loaded over a network (via opendap), which can often be quite slow. But it does display a summary of values from netCDF files on disk, which I do think is generally helpful and for which I haven't noticed any performance issues.
Based on the current definition of _in_memory
, we wouldn't display any of these arrays:
xarray/xarray/core/variable.py
Lines 285 to 289 in 4a15cfa
@property | |
def _in_memory(self): | |
return (isinstance(self._data, (np.ndarray, PandasIndexAdapter)) or | |
(isinstance(self._data, indexing.MemoryCachedArray) and | |
isinstance(self._data.array, np.ndarray))) |
So instead of using _in_memory
, I would suggest something like _not_remote(var) and not isinstance(var._data, dask_array_type)
as the condition for showing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer loading a NetCDF variable from disk every time you do __repr__ is a terrible idea if that variable has been compressed without chunking. If the variable is a single block of 100MB of zlib-compressed data, you will have to read it and decompress it every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer also, your netcdf array might be sitting on a network file system on the opposite side of a narrowband VPN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's certainly possible, but in my experience very few people writes 100MB chunks -- those are very large.
Let's summarize our options:
- Always show a preview of data from netCDF files with
Dataset.__repr__
- Never show a preview for data if it isn't already in memory
- Show a preview depending on a global option (with default choice TBD).
Reasons to show data from disk in __repr__
:
- It's what we've always done.
- "Most" of the time it's fast and convenient.
- It provides a good experience for new users, who don't need to hunt for a separate
preview()
orload()
command to see what's in a Dataset. You can simply print it at a console.
Reasons not to show data from disk in __repr__
:
- IO can be slow/expensive, especially if compression or networks are involved.
- Heuristics to detect expensive IO are unreliable and somewhat distasteful.
Maybe we should solicit a few more opinions here before we change the default behavior?
Another possibility is to try loading data in a separate thread and timeout if it takes too long (say more than 0.5 seconds), but that might open up it's own set of performance issues (it's not easy to kill a thread, short of terminating a process).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my vote would be to only print a preview of data that is in memory. For my uses, I typically have fill values in the first 10-20 data points so the previous __repr__
didn't give me any information.
@pydata/xarray - anyone else have thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer - do we have results from your google poll on this issue yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this sample size is going to give us statistically significant results but I'm glad to see @delgadom and I are in agreement.
@crusaderky - are you up for implementing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current implementation (in this PR) is actually already correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - data is eagerly loaded from disk only for index coords on __init__ now.
@rabernat could you do an example? I never noticed. If the delay was caused by a miss on __getattr__, then yes, it will be solved. |
@rabernat I reproduced your problem with tab completion and I'm happy to confirm that this fixes it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two small requests, but generally this looks good to me now.
xarray/core/formatting.py
Outdated
@@ -208,6 +208,9 @@ def _summarize_var_or_coord(name, var, col_width, show_values=True, | |||
front_str = u'%s%s%s ' % (first_col, dims_str, var.dtype) | |||
if show_values: | |||
values_str = format_array_flat(var, max_width - len(front_str)) | |||
elif isinstance(var.data, dask_array_type): | |||
chunksize = tuple(c[0] for c in var.chunks) | |||
values_str = 'dask.array<shape=%s, chunksize=%s>' % (var.shape, chunksize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make a little helper function for this, e.g., dask_short_repr()
?
doc/whats-new.rst
Outdated
@@ -131,7 +131,11 @@ Bug fixes | |||
``rtol`` arguments when called on ``DataArray`` objects. | |||
By `Stephan Hoyer <https://github.com/shoyer>`_. | |||
|
|||
- Xarray ``quantile`` methods now properly raise a ``TypeError`` when applied to | |||
- Stop ``repr`` and the Jupyter Notebook from automatically computing dask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably go under "breaking changes" instead, given that it changes existing behavior.
Will finalise over the weekend |
All done |
Thanks @crusaderky ! |
You're welcome 👍 |
git diff upstream/master | flake8 --diff
whats-new.rst
for all changes andapi.rst
for new APIStop dataset data vars and non-index dataset/dataarray coords from being loaded by repr() and getattr(). The latter is particularly acute when working in Jupyter, which does a dozen or so getattr() when printing an object.