Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load hazard raster data lazily #578

Merged
merged 14 commits into from
Jan 17, 2023
Merged

Conversation

peanutfun
Copy link
Member

@peanutfun peanutfun commented Nov 7, 2022

Changes proposed in this PR:

  • Use dask array chunks to load hazard raster data lazily
  • Add sparse as new dependency for that task

This PR fixes #544, an issue raised during the review of #507

PR Author Checklist

PR Reviewer Checklist

* Always load hazard raster data into chunked dask arrays.
* Support both dask and numpy DataArrays being passed as hazard data.
@timschmi95
Copy link
Collaborator

This change does enable to use the .from_raster_xarray method instead of my custom method for my hazard file without running into memory issues 👍
My hazard is a xarray dataset created from multiple netcdf files, and the following dimensions: time: 3660, chy: 225, chx: 352

@timschmi95
Copy link
Collaborator

This change does enable to use the .from_raster_xarray method instead of my custom method for my hazard file without running into memory issues 👍 My hazard is a xarray dataset created from multiple netcdf files, and the following dimensions: time: 3660, chy: 225, chx: 352

Actually, I realize that depending on the previous memory allocation I still run into a memory error. I'm not sure if the chunking could be made more efficient, or if it's just too much for my laptop anyways and I either pre-slice the data to a smaller extent (like in my custom function), or run it on euler if I need the full extent.

@peanutfun
Copy link
Member Author

Actually, I realize that depending on the previous memory allocation I still run into a memory error.

Xarray does not check how much memory in your system is already in use. I currently set chunks="auto", which simply ensures that one chunk can fit into the entire system RAM. However, from_raster_xarray explicitly supports passing Datasets that are already opened, so you can open it with custom chunking. The following would be a somewhat conservative choice:

ds = xr.open_dataset("file.nc", chunks=dict(time=100, chy=-1, chx=-1))
hazard = Hazard.from_raster_xarray(ds, **kwargs)

@peanutfun peanutfun marked this pull request as ready for review January 13, 2023 10:16
@peanutfun
Copy link
Member Author

@emanuel-schmid The sparse package still seems to be missing from the testing environment

@emanuel-schmid
Copy link
Collaborator

@emanuel-schmid The sparse package still seems to be missing from the testing environment

rather again than still ;)
But it's been re-installed just now.

@emanuel-schmid
Copy link
Collaborator

tests are running and I'm gonna include sparse in the develop branch

Co-authored-by: Emanuel Schmid <[email protected]>
climada/hazard/base.py Outdated Show resolved Hide resolved
@emanuel-schmid
Copy link
Collaborator

Let me quickly come back to a topic related to a previous PR, not really related to this one but by the method concerned:
Why did we chose 'time' as the default 'event' coordinate name, and not 'event'? 🧐

@peanutfun
Copy link
Member Author

Why did we chose 'time' as the default 'event' coordinate name, and not 'event'?

@emanuel-schmid In the Hazard class, we distinguish certain events, that can occur at the same or different times. However, most hazard datasets contain a time variable, but not an event variable. So the choice was made to use the most reasonable default value, but to still call the key event, and not time, to make clear that the variable will relate to different events in the Hazard class context. Does that make sense? 😵

@peanutfun peanutfun linked an issue Jan 16, 2023 that may be closed by this pull request
@emanuel-schmid
Copy link
Collaborator

Yes, that makes sense. 👍 I've added a note to this respect in the pydoc string in case somebody else is wondering. Feel free to revert if it isn't appropriate.

@peanutfun
Copy link
Member Author

Very good! 🤩

@peanutfun
Copy link
Member Author

@emanuel-schmid Is this ready to go from your perspective?

@emanuel-schmid
Copy link
Collaborator

🎉 Ready to go, from my perspective.

@emanuel-schmid
Copy link
Collaborator

🙇‍♂️ many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Follow-up to #507: Lazy loading of chunked xarray Datasets
3 participants