-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update contains_cftime_datetimes to avoid loading entire variable array #7494
Update contains_cftime_datetimes to avoid loading entire variable array #7494
Conversation
Thanks for the PR. However, does that actually make a difference? To me it looks like Lines 1779 to 1780 in b451558
|
This isn't actually the line of code that's causing the performance bottleneck, it's the access to import numpy as np
import xarray as xr
str_array = np.arange(100000000).astype(str)
ds = xr.DataArray(dims=('x',), data=str_array).to_dataset(name='str_array')
ds = ds.chunk(x=10000)
ds['str_array'] = ds.str_array.astype('O') # Needs to actually be object dtype to show the problem
ds.to_zarr('str_array.zarr')
%time xr.open_zarr('str_array.zarr') |
@agoodm, what you think of this version? Using xr.Variable directly seems a little easier to work with than trying to guess which type of array (cupy, dask, pint, backendarray, etc) is in the variable. |
@Illviljan I gave your update a quick test, it seems to work well enough and still maintains the performance improvement. It looks fine to me though I guess it looks like you still need to fix this failing mypy stuff now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought da.data
passes the array along as is - but you can learn something everyday. Thanks @Illviljan for taking over and sorry @agoodm for not properly reading the issue...
Co-authored-by: Mathias Hauser <[email protected]>
xarray/core/common.py
Outdated
if var.dtype == np.dtype("O") and var.size > 0: | ||
first_idx = (0,) * var.ndim | ||
sample = var[first_idx] | ||
return isinstance(sample.to_numpy().item(), cftime.datetime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very clean. It'd be nice to add some sort of test like DuckBackendArrayWrapper in https://github.com/pydata/xarray/pull/6874/files . __getitem__
should raise if it will return more than one value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look at this. I am a little confused for what you are suggesting here. Are you looking for a simple test in test_variable.py
that applies the same logic in this block to extract the very first element via Variable.__getitem__
here and check that it returns one value, a more general contains_cftime_datetimes
test, or both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sotry that was a bit complicated and intended for IIlviljan.
I pushed a commit with a test. I also changed the code to account for those lazily indexed backend arrays explicitly.
remove _variable_contains_cftime_datetimes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Some simple test of the new functionality would be nice.
If you are really motivated, you can think about adding an asv benchmark here: https://github.com/pydata/xarray/blob/main/asv_bench/benchmarks/dataset_io.py
Seems like we're passing a DataArray instead of a Variable somewhere. |
I think that's in the tests themselves xarray/xarray/tests/test_coding_times.py Line 778 in 6531b57
|
Thanks @agoodm this work prompted a bunch of internal cleanup! |
Thanks @Illviljan and @dcherian for helping to see this through. |
* main: Preserve `base` and `loffset` arguments in `resample` (pydata#7444) ignore the `pkg_resources` deprecation warning (pydata#7594) Update contains_cftime_datetimes to avoid loading entire variable array (pydata#7494) Support first, last with dask arrays (pydata#7562) update the docs environment (pydata#7442) Add xCDAT to list of Xarray related projects (pydata#7579) [pre-commit.ci] pre-commit autoupdate (pydata#7565) fix nczarr when libnetcdf>4.8.1 (pydata#7575) use numpys SupportsDtype (pydata#7521)
whats-new.rst
This PR greatly improves the performance for opening datasets with large arrays of object type (typically string arrays) since
contains_cftime_datetimes
was triggering the entire array to be read from the file just to check the very first element in the entire array.@Illviljan continuing our discussion from the issue thread, I did try to pass in
var._data
to_contains_cftime_datetimes
, but I had a lot of trouble finding a way to generalize how to index the first array element. The best I could do wasvar._data.array.get_array()
, but I don't thinkget_array
is implemented for every backend. So for now I am leaving my original proposed solution.