Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to new cds #62

Merged
merged 42 commits into from
Nov 25, 2024
Merged

Move to new cds #62

merged 42 commits into from
Nov 25, 2024

Conversation

SarahAlidoost
Copy link
Member

@SarahAlidoost SarahAlidoost commented Oct 16, 2024

This PR introduces the main changes:

  • support new changes to CDS api
  • support new changes in xarray_regrid
  • add support to select spatial bound for land cover
  • update docs
  • fix a few issues

The recipe STEMMUS_SCOPE_input.yml run was successful. Although I reduced the time in the recipe (1 month instead of 6 months), it seems that the CDS is fast now.

closes #59
closes #63

@SarahAlidoost SarahAlidoost marked this pull request as ready for review October 22, 2024 14:23
@SarahAlidoost
Copy link
Member Author

@BSchilperoort when you have time, can you please review this PR? I explained most of the changes in this comment.

Copy link
Contributor

@BSchilperoort BSchilperoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not try to run it yet, but here are some things I noticed in the changes.

@@ -150,8 +150,7 @@ def load(
variable_names: list[str],
) -> xr.Dataset:
files = list((ingest_dir / self.name).glob("*.nc"))

ds = xr.open_mfdataset(files, parallel=True)
ds = xr.open_mfdataset(files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parallel kwarg was to ensure the dataset being opened as Dask arrays, to avoid any memory issues. Will that go OK now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the parallel kwargs is True, the open and preprocess steps will be performed in parallel using dask. So, we need to configure Dask for the number of jobs or CPUs for parallel processing. Otherwise dask.distributed.Client() will use all available cores by default and this causes a segmentation fault error. This is the case in cli.py where n_workers are not set, see here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I do think that making use of dask.distributed is quite important for performance. You can configure the defaults with a config file https://docs.dask.org/en/latest/configuration.html#specify-configuration

dask.distributed.Client() will use all available cores by default and this causes a segmentation fault error

I guess on some systems? Which seems like something else should be causing the segfault. On my laptop it's no problem starting as many workers as cpu threads, and it's the default behavior of dask for a reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. I did some tests, and it looks like the Dask configuration didn’t fix the issue. When I added parallel=True back, the segmentation fault errors showed up again on macOS and Linux, but not on Windows, see GA workflows below. We only use parallel=True for the fapar_lai.py and not for others. Do you know why it’s needed? After refactoring the code as below, the tests are passing locally. What do you think?

client = Client()
ds = xr.open_mfdataset(files, parallel=True)
client.close()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue #65 submitted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why it’s needed? After refactoring the code as below, the tests are passing locally. What do you think?

Probably sped up the load step, or at least forced dask usage.

If you instead make sure a dask client is always available the loading should also be parallelized.

@SarahAlidoost
Copy link
Member Author

@BSchilperoort Thanks for your comments. If there is no major issue, can we merge this PR?

Copy link
Contributor

@BSchilperoort BSchilperoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, nice work 👍

@SarahAlidoost SarahAlidoost merged commit d95d0a5 into main Nov 25, 2024
10 checks passed
@SarahAlidoost SarahAlidoost deleted the fix_59 branch November 25, 2024 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

spatial_bounds in cds_utils.cds_request_land_cover Move to new CDS and ADS
2 participants