Opening datasets with large object dtype arrays is very slow #7484

agoodm · 2023-01-29T23:31:51Z

What is your issue?

Opening a dataset with a very large array with object dtype is much slower than it should be. I initially noticed this when working with a dataset spanned by around 24000 netcdf files. I have been using kerchunk references to load them with a consolidated metadata key so I was expecting it to be fairly quick to open, but it actually took several minutes. I realized that all this time was spent on one variable consisting of strings, so when dropping it the whole dataset opens up in seconds. Sharing this would be a bit difficult so instead I will illustrate this with a simple easy to reproduce example with the latest released versions of xarray and zarr installed:

str_array = np.arange(100000000).astype(str)
ds = xr.DataArray(dims=('x',), data=str_array).to_dataset(name='str_array')
ds['str_array'] = ds.str_array.astype('O') # Needs to actually be object dtype to show the problem
ds.to_zarr('str_array.zarr')

%time xr.open_zarr('str_array.zarr/')
CPU times: user 8.24 s, sys: 5.23 s, total: 13.5 s
Wall time: 12.9 s

I did some digging and found that pretty much all the time was spent on the check being done by contains_cftime_datetimes in

xarray/xarray/core/common.py

Line 1793 in d385e20

return _contains_cftime_datetimes(var.data)

This operation is not lazy and ends up requiring every single chunk for this variable to be opened, all for the sake of checking the very first element in the entire array. A quick fix I tried is updating contains_cftime_datetimes to do the following:

def contains_cftime_datetimes(var) -> bool:
    """Check if an xarray.Variable contains cftime.datetime objects"""
    if var.dtype == np.dtype("O") and var.size > 0:
        ndims = len(var.shape)
        first_idx = np.zeros(ndims, dtype='int32')
        array = var[*first_idx].data
        return _contains_cftime_datetimes(array)
    else:
        return False

This drastically reduced the time to open the dataset as expected:

%time xr.open_zarr('str_array.zarr/')
CPU times: user 384 ms, sys: 502 ms, total: 887 ms
Wall time: 985 ms

I would like to make a PR with this change but I realize that this change could effect every backend, and although I have been using xarray for many years this would be my first contribution and so I would like to briefly discuss it in case there are better ways to address the issue. Thanks!

The text was updated successfully, but these errors were encountered:

Illviljan · 2023-01-30T19:53:03Z

Feel free to start working on that PR. 👍 It looks like _contains_cftime_datetimes tries to do a similar thing as your solution so I think the change should be there.

agoodm · 2023-01-30T20:36:46Z

Great, thanks! It's actually the var.data attribute access itself that's triggering the loading so that's why I needed to put the change there, but I see your point that I should probably update contains_cftime_datetimes as well since selecting the first element again is stylistically redundant. In any case, I'll go ahead and quickly get to work at preparing a PR for this.

Illviljan · 2023-01-30T20:59:36Z

You can do var._data instead of var.data. There's been a few cases recently where self.data doesn't play so nice when reading from a file.

agoodm added the needs triage Issue that has not been reviewed by xarray team member label Jan 29, 2023

TomNicholas added topic-backends topic-performance topic-cftime and removed needs triage Issue that has not been reviewed by xarray team member labels Jan 30, 2023

agoodm mentioned this issue Jan 30, 2023

Update contains_cftime_datetimes to avoid loading entire variable array #7494

Merged

2 tasks

dcherian closed this as completed in #7494 Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opening datasets with large object dtype arrays is very slow #7484

Opening datasets with large object dtype arrays is very slow #7484

agoodm commented Jan 29, 2023 •

edited

Loading

Illviljan commented Jan 30, 2023

agoodm commented Jan 30, 2023

Illviljan commented Jan 30, 2023 •

edited

Loading

Opening datasets with large object dtype arrays is very slow #7484

Opening datasets with large object dtype arrays is very slow #7484

Comments

agoodm commented Jan 29, 2023 • edited Loading

What is your issue?

Illviljan commented Jan 30, 2023

agoodm commented Jan 30, 2023

Illviljan commented Jan 30, 2023 • edited Loading

agoodm commented Jan 29, 2023 •

edited

Loading

Illviljan commented Jan 30, 2023 •

edited

Loading