Follow-up to #507: Lazy loading of chunked xarray Datasets #544

peanutfun · 2022-10-07T10:20:31Z

During the review of #507 the missing support for chunked (lazily-evaluated) datasets was considered a major issue because the entire dataset has to fit into memory (see #507 (comment)).

I am currently investigating a way to load the data lazily with dask arrays (a feature automatically supported by xarray, see https://docs.xarray.dev/en/stable/user-guide/dask.html). Dask arrays are distributed into smaller blocks or "chunks", which are designed to fit into memory. By default, such a block is implemented as numpy.ndarray.

Xarray offers functions that are agnostic to whether the underlying array type is a dask.array or a numpy.ndarray. This enables writing a function that takes the dask chunks and returns "sparse" array chunks:

import xarray as xr
import scipy
da = xr.open_dataarray("file.nc", chunks="auto")  # Chunks are dask.array
da = xr.apply_ufunc(scipy.sparse.csr_matrix, da, dask="parallelized", output_dtypes=[da.dtype])
da = da.compute().data  # Load into memory, da is now a scipy.sparse.csr_matrix

However, the above example does not work because scipy.sparse.csr_matrix does not support the required numpy.ndarray API for this operation. One workaround would be to use the sparse library:

import xarray as xr
import sparse
da = xr.open_dataarray("file.nc", chunks="auto")  # Chunks are dask.array
da = xr.apply_ufunc(sparse.COO.from_numpy, da, dask="parallelized", output_dtypes=[da.dtype])
da = da.compute().data  # Load into memory, da is now a sparse.COO array
da = da.tocsr()  # Convert sparse.COO to scipy.sparse.csr_matrix

If we don't want a new dependency, we would need to operate on the xarray data types directly, which probably means stitching the chunks together ourselves. Is it okay to add sparse as new dependency only for this particular use case or should I dig further to see if we can work around it?

The text was updated successfully, but these errors were encountered:

peanutfun · 2022-11-17T16:57:24Z

@emanuel-schmid, @chahank, you do you feel about adding sparse as dependency?

chahank · 2022-11-18T14:10:28Z

Hmm I do not know this package. The call is with @emanuel-schmid .

But, if yes, what we should be careful in how it is imported, as we often import scipy.stats.sparse too. Having twice the name sparse could cause conflicts / non-clear code.

peanutfun · 2022-11-21T14:23:48Z

Having twice the name sparse could cause conflicts / non-clear code.

Yes, but it only affects climada/hazard/base.py. Importing should not be an issue:

import sparse as sp
import scipy.stats.sparse as sparse

peanutfun · 2022-11-29T09:03:58Z

To hopefully ease your mind a bit: Using the sparse package will only be a temporary solution. scipy.sparse have added array-classes in recent versions, which supersede the matrix-classes and will be fully compatible with Numpy arrays in the future. At this point, we will likely be able to remove the dependency on sparse again.

See the note on the latest scipy.sparse docs

emanuel-schmid · 2023-01-19T08:57:26Z

closed by #578

peanutfun added enhancement help wanted dependencies performance labels Oct 7, 2022

chahank assigned emanuel-schmid Oct 13, 2022

peanutfun mentioned this issue Oct 13, 2022

Add Hazard classmethod for loading xarray Datasets #507

Merged

11 tasks

peanutfun mentioned this issue Jan 16, 2023

Load hazard raster data lazily #578

Merged

11 tasks

peanutfun linked a pull request Jan 16, 2023 that will close this issue

Load hazard raster data lazily #578

Merged

11 tasks

emanuel-schmid closed this as completed Jan 19, 2023

peanutfun mentioned this issue Jan 26, 2023

Invalid import in hazard.base.py #632

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow-up to #507: Lazy loading of chunked xarray Datasets #544

Follow-up to #507: Lazy loading of chunked xarray Datasets #544

peanutfun commented Oct 7, 2022 •

edited

Loading

peanutfun commented Nov 17, 2022

chahank commented Nov 18, 2022

peanutfun commented Nov 21, 2022

peanutfun commented Nov 29, 2022

emanuel-schmid commented Jan 19, 2023

Follow-up to #507: Lazy loading of chunked xarray Datasets #544

Follow-up to #507: Lazy loading of chunked xarray Datasets #544

Comments

peanutfun commented Oct 7, 2022 • edited Loading

peanutfun commented Nov 17, 2022

chahank commented Nov 18, 2022

peanutfun commented Nov 21, 2022

peanutfun commented Nov 29, 2022

emanuel-schmid commented Jan 19, 2023

peanutfun commented Oct 7, 2022 •

edited

Loading