Move to new cds #62

SarahAlidoost · 2024-10-16T13:54:00Z

This PR introduces the main changes:

support new changes to CDS api
support new changes in xarray_regrid
add support to select spatial bound for land cover
update docs
fix a few issues

The recipe STEMMUS_SCOPE_input.yml run was successful. Although I reduced the time in the recipe (1 month instead of 6 months), it seems that the CDS is fast now.

closes #59
closes #63

src/zampy/datasets/land_cover.py

…version xarray_regrid v0.4.0

…n ingest function of land cover to avoid memory problems

SarahAlidoost · 2024-10-22T14:25:48Z

@BSchilperoort when you have time, can you please review this PR? I explained most of the changes in this comment.

BSchilperoort

I did not try to run it yet, but here are some things I noticed in the changes.

docs/index.md

src/zampy/datasets/cds_utils.py

src/zampy/datasets/ecmwf_dataset.py

BSchilperoort · 2024-10-23T06:14:00Z

src/zampy/datasets/fapar_lai.py

@@ -150,8 +150,7 @@ def load(
        variable_names: list[str],
    ) -> xr.Dataset:
        files = list((ingest_dir / self.name).glob("*.nc"))
-
-        ds = xr.open_mfdataset(files, parallel=True)
+        ds = xr.open_mfdataset(files)


The parallel kwarg was to ensure the dataset being opened as Dask arrays, to avoid any memory issues. Will that go OK now?

if the parallel kwargs is True, the open and preprocess steps will be performed in parallel using dask. So, we need to configure Dask for the number of jobs or CPUs for parallel processing. Otherwise dask.distributed.Client() will use all available cores by default and this causes a segmentation fault error. This is the case in cli.py where n_workers are not set, see here.

OK. I do think that making use of dask.distributed is quite important for performance. You can configure the defaults with a config file https://docs.dask.org/en/latest/configuration.html#specify-configuration

dask.distributed.Client() will use all available cores by default and this causes a segmentation fault error

I guess on some systems? Which seems like something else should be causing the segfault. On my laptop it's no problem starting as many workers as cpu threads, and it's the default behavior of dask for a reason.

Thanks for explaining. I did some tests, and it looks like the Dask configuration didn’t fix the issue. When I added parallel=True back, the segmentation fault errors showed up again on macOS and Linux, but not on Windows, see GA workflows below. We only use parallel=True for the fapar_lai.py and not for others. Do you know why it’s needed? After refactoring the code as below, the tests are passing locally. What do you think?

client = Client() ds = xr.open_mfdataset(files, parallel=True) client.close()

issue #65 submitted.

Do you know why it’s needed? After refactoring the code as below, the tests are passing locally. What do you think?

Probably sped up the load step, or at least forced dask usage.

If you instead make sure a dask client is always available the loading should also be parallelized.

src/zampy/datasets/land_cover.py

tests/test_datasets/test_era5.py

Co-authored-by: Bart Schilperoort <[email protected]>

sonarqubecloud · 2024-11-25T09:53:34Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
92.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

SarahAlidoost · 2024-11-25T10:06:36Z

@BSchilperoort Thanks for your comments. If there is no major issue, can we merge this PR?

BSchilperoort

Looks good now, nice work 👍

SarahAlidoost added 3 commits October 16, 2024 15:52

edit the doc due to new cds api

2950aee

add a version to cdsapi in pyproject

2c3dac3

fix regrid.most_common due to changes in version xarray_regrid v0.4.0

bba68a9

BSchilperoort reviewed Oct 16, 2024

View reviewed changes

src/zampy/datasets/land_cover.py Outdated Show resolved Hide resolved

SarahAlidoost added 11 commits October 18, 2024 13:02

fix the use of regrid.most_common in land_cover.py due to changes in …

b63505f

…version xarray_regrid v0.4.0

fix ruff error

c17eb37

rename valid_time dim to time in ecmwf_dataset, see issue 59

1b99791

unify chunks before regrid

692bae5

fix calculating freq in recipe

dcb9bb0

fix version syntax in fapar_lai

5ce238c

fix version in land_cover cds request

47b57ce

fix version in test_fapar_lai

00aa356

fix version in test_land_cover

d9ab924

fix version of land_cover request in test_cds_utils

4e6fe3e

add spatial bounds for landcover, fix spatial bounds in target grid i…

ddb7c58

…n ingest function of land cover to avoid memory problems

SarahAlidoost mentioned this pull request Oct 21, 2024

Expose timeout to users #64

Open

SarahAlidoost added 14 commits October 21, 2024 15:42

increase timeout in cds request

868bc0e

fix test cds_utils

1459245

fix mypy error

d0d57a7

remove parallel from lai to avoid segmentation fault error

71c2224

update doc

75e5a67

reduce time range in recipe of stemmus_scope

5b82327

replace frequency H with h due to deprecation errors of pandas

75df274

add a few more checks in tests

f4a5882

fix linter error

540b9e7

fix ruff errors

3ae6603

fix a minor thing

5ba1f7f

add more tests for frequency in recipe

42edfc3

remove unused import from test_recipe

1eab581

fix ruff format

49637d1

refactor get_url_size

c24bbf7

SarahAlidoost marked this pull request as ready for review October 22, 2024 14:23

SarahAlidoost requested a review from BSchilperoort October 22, 2024 14:23

BSchilperoort reviewed Oct 23, 2024

View reviewed changes

SarahAlidoost and others added 13 commits October 25, 2024 14:51

Update docs/index.md

0cab2df

Co-authored-by: Bart Schilperoort <[email protected]>

update test dataset, and related tests

b3bdeb7

change freq M to ME in cds_utils

56c5a5b

add acheck before ds coarsen in eth_canopy_height

0f70d77

fix tests init

8b6070f

use dask unique instead of numpy unique

09ddc53

fix linter errors

d03b0f4

fix ruff format errors

ab1cbc3

add parallel=True to FaparLAI

09d38f6

fix get unique values in land cover

eca46d1

remove parallel from fapar_lai, add a comment about it

c292919

refactor test_camps and remove xr.load_dataset

ea34a42

refcator tests, replace xr load_dataset with open_dataset

933b053

SarahAlidoost mentioned this pull request Nov 25, 2024

Segmentation fault error in tests #65

Closed

SarahAlidoost requested a review from BSchilperoort November 25, 2024 10:05

BSchilperoort approved these changes Nov 25, 2024

View reviewed changes

SarahAlidoost merged commit d95d0a5 into main Nov 25, 2024
10 checks passed

SarahAlidoost deleted the fix_59 branch November 25, 2024 16:25

SarahAlidoost mentioned this pull request Dec 10, 2024

Move test data generation script to zampy and add documentation #31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to new cds #62

Move to new cds #62

SarahAlidoost commented Oct 16, 2024 •

edited

Loading

SarahAlidoost commented Oct 22, 2024

BSchilperoort left a comment

BSchilperoort Oct 23, 2024

SarahAlidoost Nov 15, 2024

BSchilperoort Nov 15, 2024

SarahAlidoost Nov 15, 2024

SarahAlidoost Nov 22, 2024

BSchilperoort Nov 25, 2024

sonarqubecloud bot commented Nov 25, 2024

SarahAlidoost commented Nov 25, 2024

BSchilperoort left a comment

Move to new cds #62

Move to new cds #62

Conversation

SarahAlidoost commented Oct 16, 2024 • edited Loading

SarahAlidoost commented Oct 22, 2024

BSchilperoort left a comment

Choose a reason for hiding this comment

BSchilperoort Oct 23, 2024

Choose a reason for hiding this comment

SarahAlidoost Nov 15, 2024

Choose a reason for hiding this comment

BSchilperoort Nov 15, 2024

Choose a reason for hiding this comment

SarahAlidoost Nov 15, 2024

Choose a reason for hiding this comment

SarahAlidoost Nov 22, 2024

Choose a reason for hiding this comment

BSchilperoort Nov 25, 2024

Choose a reason for hiding this comment

sonarqubecloud bot commented Nov 25, 2024

Quality Gate passed

SarahAlidoost commented Nov 25, 2024

BSchilperoort left a comment

Choose a reason for hiding this comment

SarahAlidoost commented Oct 16, 2024 •

edited

Loading