-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move to new cds #62
Move to new cds #62
Conversation
…version xarray_regrid v0.4.0
…n ingest function of land cover to avoid memory problems
@BSchilperoort when you have time, can you please review this PR? I explained most of the changes in this comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not try to run it yet, but here are some things I noticed in the changes.
src/zampy/datasets/fapar_lai.py
Outdated
@@ -150,8 +150,7 @@ def load( | |||
variable_names: list[str], | |||
) -> xr.Dataset: | |||
files = list((ingest_dir / self.name).glob("*.nc")) | |||
|
|||
ds = xr.open_mfdataset(files, parallel=True) | |||
ds = xr.open_mfdataset(files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parallel kwarg was to ensure the dataset being opened as Dask arrays, to avoid any memory issues. Will that go OK now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the parallel kwargs is True, the open and preprocess steps will be performed in parallel using dask. So, we need to configure Dask for the number of jobs or CPUs for parallel processing. Otherwise dask.distributed.Client()
will use all available cores by default and this causes a segmentation fault error. This is the case in cli.py
where n_workers
are not set, see here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I do think that making use of dask.distributed is quite important for performance. You can configure the defaults with a config file https://docs.dask.org/en/latest/configuration.html#specify-configuration
dask.distributed.Client()
will use all available cores by default and this causes a segmentation fault error
I guess on some systems? Which seems like something else should be causing the segfault. On my laptop it's no problem starting as many workers as cpu threads, and it's the default behavior of dask for a reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. I did some tests, and it looks like the Dask configuration didn’t fix the issue. When I added parallel=True
back, the segmentation fault errors showed up again on macOS and Linux, but not on Windows, see GA workflows below. We only use parallel=True
for the fapar_lai.py
and not for others. Do you know why it’s needed? After refactoring the code as below, the tests are passing locally. What do you think?
client = Client()
ds = xr.open_mfdataset(files, parallel=True)
client.close()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue #65 submitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why it’s needed? After refactoring the code as below, the tests are passing locally. What do you think?
Probably sped up the load step, or at least forced dask usage.
If you instead make sure a dask client is always available the loading should also be parallelized.
Co-authored-by: Bart Schilperoort <[email protected]>
|
@BSchilperoort Thanks for your comments. If there is no major issue, can we merge this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now, nice work 👍
This PR introduces the main changes:
The recipe
STEMMUS_SCOPE_input.yml
run was successful. Although I reduced the time in the recipe (1 month instead of 6 months), it seems that the CDS is fast now.closes #59
closes #63