-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid loading entire dataset by getting the nbytes in an array #7356
Conversation
353c9b8
to
4729a35
Compare
I personally do not even think the |
Using `.data` accidentally tries to load the whole lazy arrays into memory. Sad.
4729a35
to
8826c14
Compare
Looking into the history a little more. I seem to be proposing to revert: I think this is important since many users have arrays that are larger than memory. For me, I found this bug when trying to access the number of bytes in a 16GB dataset that I'm trying to load on my wimpy laptop. Not fun to start swapping. I feel like others might be hitting this too. |
5d3a5d3
to
1543c62
Compare
I think that at the very lease, the current implementation works as well as the old one for arrays that are defined by the |
It seems that checking hasattr on the |
1543c62
to
a04ba20
Compare
Is that test targetting your issue with RAM crashing the laptop? Shouldn't there be some check if the values were loaded? How did you import your data? self.data looks like this: xarray/xarray/core/variable.py Lines 420 to 435 in ed60c6c
I was expecting your data to be a duck_array? |
No explicit test was added to ensure that the data wasn't loaded. I just experienced this bug enough (we would accidentally load 100GB files in our code base) that I knew exactly how to fix it. If you want i can add a test to ensure that future optimizations to nbytes do not trigger a data load. I was hoping the 1 line fix would be a shoe in. |
The data is loaded from an NetCDF store through open_dataset |
I'm not really opposed to this change, shape and dtype uses Without using This test just looked so similar to the tests in #6797. I think you can do a similar lazy test taking inspiration from: xarray/xarray/tests/test_formatting.py Lines 715 to 727 in ed60c6c
|
Very smart test! |
Yes, without chunks of anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. thanks!
👍🏾 |
Any chance of a release, this is quite breaking for large datasets that can only be out of memory. |
* main: (41 commits) v2023.01.0 whats-new (pydata#7440) explain keep_attrs in docstring of apply_ufunc (pydata#7445) Add sentence to open_dataset docstring (pydata#7438) pin scipy version in doc environment (pydata#7436) Improve performance for backend datetime handling (pydata#7374) fix typo (pydata#7433) Add lazy backend ASV test (pydata#7426) Pull Request Labeler - Workaround sync-labels bug (pydata#7431) see also : groupby in resample doc and vice-versa (pydata#7425) Some alignment optimizations (pydata#7382) Make `broadcast` and `concat` work with the Array API (pydata#7387) remove `numbagg` and `numba` from the upstream-dev CI (pydata#7416) [pre-commit.ci] pre-commit autoupdate (pydata#7402) Preserve original dtype when accessing MultiIndex levels (pydata#7393) [pre-commit.ci] pre-commit autoupdate (pydata#7389) [pre-commit.ci] pre-commit autoupdate (pydata#7360) COMPAT: Adjust CFTimeIndex.get_loc for pandas 2.0 deprecation enforcement (pydata#7361) Avoid loading entire dataset by getting the nbytes in an array (pydata#7356) `keep_attrs` for pad (pydata#7267) Bump pypa/gh-action-pypi-publish from 1.5.1 to 1.6.4 (pydata#7375) ...
This came up in the xarray office hours today, and I'm confused why this PR made any difference to the behavior at all? The |
Because we have lazy data reading functionality import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")
var = ds.air.variable
print(type(var._data)) # memory cached array
print(type(var._data.array.array)) # ah that's wrapping a lazy array, no data read in yet
print(var._data.size) # can access size
print(type(var._data.array.array)) # still a lazy array
#.data forces a disk load
print(type(var.data)) # oops disk-load
print(type(var._data)) # "still memory cached array"
print(type(var._data.array.array)) # but that's wrapping numpy data in memory
|
Using
.data
accidentally tries to load the whole lazy arrays into memory.Sad.
whats-new.rst
api.rst