Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow-up to #507: Lazy loading of chunked xarray Datasets #544

Closed
peanutfun opened this issue Oct 7, 2022 · 5 comments · Fixed by #578
Closed

Follow-up to #507: Lazy loading of chunked xarray Datasets #544

peanutfun opened this issue Oct 7, 2022 · 5 comments · Fixed by #578

Comments

@peanutfun
Copy link
Member

peanutfun commented Oct 7, 2022

During the review of #507 the missing support for chunked (lazily-evaluated) datasets was considered a major issue because the entire dataset has to fit into memory (see #507 (comment)).

I am currently investigating a way to load the data lazily with dask arrays (a feature automatically supported by xarray, see https://docs.xarray.dev/en/stable/user-guide/dask.html). Dask arrays are distributed into smaller blocks or "chunks", which are designed to fit into memory. By default, such a block is implemented as numpy.ndarray.

Xarray offers functions that are agnostic to whether the underlying array type is a dask.array or a numpy.ndarray. This enables writing a function that takes the dask chunks and returns "sparse" array chunks:

import xarray as xr
import scipy
da = xr.open_dataarray("file.nc", chunks="auto")  # Chunks are dask.array
da = xr.apply_ufunc(scipy.sparse.csr_matrix, da, dask="parallelized", output_dtypes=[da.dtype])
da = da.compute().data  # Load into memory, da is now a scipy.sparse.csr_matrix

However, the above example does not work because scipy.sparse.csr_matrix does not support the required numpy.ndarray API for this operation. One workaround would be to use the sparse library:

import xarray as xr
import sparse
da = xr.open_dataarray("file.nc", chunks="auto")  # Chunks are dask.array
da = xr.apply_ufunc(sparse.COO.from_numpy, da, dask="parallelized", output_dtypes=[da.dtype])
da = da.compute().data  # Load into memory, da is now a sparse.COO array
da = da.tocsr()  # Convert sparse.COO to scipy.sparse.csr_matrix

If we don't want a new dependency, we would need to operate on the xarray data types directly, which probably means stitching the chunks together ourselves. Is it okay to add sparse as new dependency only for this particular use case or should I dig further to see if we can work around it?

@peanutfun
Copy link
Member Author

@emanuel-schmid, @chahank, you do you feel about adding sparse as dependency?

@chahank
Copy link
Member

chahank commented Nov 18, 2022

Hmm I do not know this package. The call is with @emanuel-schmid .

But, if yes, what we should be careful in how it is imported, as we often import scipy.stats.sparse too. Having twice the name sparse could cause conflicts / non-clear code.

@peanutfun
Copy link
Member Author

Having twice the name sparse could cause conflicts / non-clear code.

Yes, but it only affects climada/hazard/base.py. Importing should not be an issue:

import sparse as sp
import scipy.stats.sparse as sparse

@peanutfun
Copy link
Member Author

To hopefully ease your mind a bit: Using the sparse package will only be a temporary solution. scipy.sparse have added array-classes in recent versions, which supersede the matrix-classes and will be fully compatible with Numpy arrays in the future. At this point, we will likely be able to remove the dependency on sparse again.

See the note on the latest scipy.sparse docs

@peanutfun peanutfun linked a pull request Jan 16, 2023 that will close this issue
11 tasks
@emanuel-schmid
Copy link
Collaborator

closed by #578

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants