Multiple Zarr to Rule them All #54

sadamov · 2024-06-06T14:04:52Z

Describe your changes

This PR is massive and was heavily discussed in the dev-team. It introduces a shift towards .zarr based data storage and .yaml based interaction with them using config files. The model itself should now be fully domain agnostic. All user input is either defined in aforementioned yaml files or via the train_model.py flags.

Please have a look at #24 to learn more about the main reasoning behind these changes. In short:

Simplify setup and collaboration across met services, academia and other insitutions with different input data
Fast, lazy and parallel datahandling out-of-the-box with xarray, zarr and dask
Easier combination of multiple data-sources for state, forcing and boundaries
Efficient compression choices for data storages using zarr

New dependencies: xarray, dask, zarr, pandas

BREAKING
While this PR tries to implement all functionalities present in main there is one major setup that is not supported anylonger. The training is based on (re)-analysis now and the reforecast based training previously used with e.g. MEPS is not supported. Especially the subsample_step logic has been removed from the code. This functionality can be re-introduced in "zarr-workflow" with a later PR. This will also break the tests!

MISSING
The whole handling of the boundary data is only implemented in the dataloading part but not actually used in the model. This will be added in a later PR. Parameter Weights are currently not used in the model since @joeloskarsson added probabilistic loss functions.

Issue Links

Solves #24

File by file changes

neural_lam/config.py This is the main workhorse and since I am not an expert in OOP, please double-check the decorators carefully (e.g. @leifdenby). I tried to set this up in a modular way where each function can be called independently. The main method called process_dataset() combines most of the individual steps and makes the data ready for the Torch Dataset Class. I tried to capture some errors and print some useful information without going all in on perfect software engineering. Since that seemed to match with the repo in other places. I am certain that other people will have different needs and am more than happy to add/change functionality already in this PR.
docs/download_danra.py This should be executed to run the tests and by anyone using the default config_data.py
neural_lam/models/ar_model.py Update the handling of static fields that are now supplied in zarr archives. Also, handling of batch_times mostly for plotting.
neural_lam/data_config.yaml Zarr based dataloading and specifications fully implemented. This is now documented in detail in README.md
neural_lam/utils.py Loader functions not required anylonger as this is now handled by the config.Config class.
neural_lam/weather_dataset.py This is now a classical Torch Dataset and PL DataLoader class based on xarray. All additional features like forcing calculation and windowing were placed in separate scripts.
tests The tests are broken, as mentioned below I will need some help here to fix them. I added my own test based on the zarr-yaml workflow. Currently 2/3 work okay.
calculate_statistics.py is now purely based on xarray and therefore natively fast, lazy and executed in parallel. This replaces create_parameter_weights.py. @leifdenby Which makes my midnight-merge last week kind of useless ^^. @joeloskarsson Is my understanding right that we don't require parameter_weights anylonger, given that people are using the right loss functions? I have removed all weighting (even vertical levels) from the code should these be re-introduced somewhere?
create_boundary_mask.py is one of many "helper"-scripts to generate static and forcing features if the user doesn't provide them themselves. As the name states, this one creates the old type of "in-domain" bordermask.
create_forcings.py stores datetime forcings previously calculated in the WeatherDataset in a zarr archive. This and previous helper-scripts replace create_grid_features.py that was previously used to load pre-computed and stored numpy arrays.
create_mesh.py and plot_graph.py the loading of xy-grid is now streamlined (and hopefully in the right order @bet20ICL) using the config.Config.get_xy()
train_model.py with the new dataloader class everything is neat and tidy

Remarks

I did some thorough testing of this code, but given the sheer size of the PR it was not easy. I am open to both in-depth reviews and/or splitting the PR up into multiple smaller ones.

@SimonKamuk the tests don't work anymore and I tried my best to re-implement them for DANRA. But the pooch does not seem to work with Danra. I am always getting access rights issues, unless I access them directly with xr.open_zarr("..."). Could you help me fix them? Currently, your tests are disabled as well as my last test.

That being said, I am pretty proud to say that it is now possible to fully train Neural-LAM on DANRA data (and COSMO btw) and here I want to show some technical results as proof:

This is the graph created by create_mesh.py and the new get_xy() logic:

This is the validation error_map after very few training steps and a subset of variables defined in data_config.yaml

This is the resulting map of a 10step * 3hour forecast using the model in --eval "test" mode:

Please, test this code with your own data over your own domain, this should help confirming that the model is truly domain-agnostic now.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have added tests that prove my fix is effective or that my feature works
I have requested a reviewer and an assignee (assignee is responsible for merging)

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code readable
the code well tested
the code documented
the code easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change (in section
reflecting type of change, for example "bug fixes", add section where
missing)

Checklist for assignee

PR is up to date with the base branch
the tests passing
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR ready to be merged, squash commits and merge the PR.

user configs are retrieved either from data_config.yaml or they are set as flags to train_model.py Capabilities of ConfigLoader class extended

lat lon specifications make the code more flexible

Zarr is registered to model buffer Normalization happens on device on_after_batch_transfer

…on/neural-lam into feature_dataset_yaml

joeloskarsson · 2024-06-07T09:34:59Z

This is amazing stuff Simon! 😄 Looking forward to sink my teeth into all of this. I don't mind this being one big PR, as these changes are so interconnected.

I'll do a proper review of this later, but here's a short answer to the question about variable weights for now: I think we really want to keep the option to have a fixed variable weighting. There are in many cases good reasons to want to specify such weights manually rather than letting the model handle the balancing. I liked how it was handled before, with everything converted to standard-deviations that were either output by the model or derived from manual weights. That being said, I don't mind if we get rid of it with this PR and just re-introduce it later. As you rightly point out the logic for how to create and use parameter weights has to be reworked anyhow. We can make a new issue about that, but my initial thoughts would be to have such weights specified in the config yaml, either per-variable or using some generic strategy names (e.g. variable_weighting: uniform or variable_weighting: graphcast/pressure_proportional).

the behavior of xr.Dataset.dims will change soon

leifdenby · 2024-06-10T08:30:37Z

Yes, amazing work @sadamov! I will read through this over the next days and then we can may schedule a call to discuss in more detail? There is some of the reasoning in the structure I don't quite follow, but that is probably just because I have thought about some things slightly differently :)

mpvginde · 2024-06-13T07:20:44Z

HI @sadamov,
I was wondering if, with the current implementation here, it possible to specify different levels and/or level types for different upper air variables?
For example if I want u and v at 850 and 500 hPa but t at 925, 500 and 250 hPa.
Or more extreme, if I wanted u and v at 500 hPa and 100 m and t at 925 hPa and 150 m.

It seems that your example yaml-file, the levels keyword is defined at the category -level, but maybe it is possible and I need to read the code in more detail.

Kind regards,
Michiel

sadamov · 2024-06-13T10:06:17Z

@mpvginde That is currently not possible. I just discussed it with @leifdenby and this functionality will be added back before the PR is merged. Good point, thanks!

mpvginde · 2024-06-13T10:48:08Z

Thanks for the response.
Let me know if I can help!

sadamov · 2024-09-06T09:17:44Z

This PR is superceded by #66

Simon Adamov and others added 30 commits May 5, 2024 20:21

yaml_config for cosmo data

e5c245f

initial version of single zarr dataset

33e7ecf

handling None zarrs

9936e3b

removed all dependencies on constants.py

774d16a

user configs are retrieved either from data_config.yaml or they are set as flags to train_model.py Capabilities of ConfigLoader class extended

fix linter

7bb139b

Fixed calls to new WeatherDataModule Class

af076fe

fix linter

147caec

upload data config to wandb for history logs

2b65416

Improved handling of static data

ed9ed69

dask and zarr are required backends to xarray

0b69f4e

Implements windowed forcing and boundary

b76d078

Some project related stuff (simple setup to pip install -e .)

5d27a4c

introducing realistic boundaries

4dadf29

Adapted nwp_xy related code to new data loading procedure

7524c4d

only state requires units for plotting

45fd375

lat lon specifications make the code more flexible

small bugfixes and improvements

812323d

Calculate stats and store in zarr archive

500f2fb

Zarr is registered to model buffer Normalization happens on device on_after_batch_transfer

latex support

9293fe1

ar_steps for training and eval

e80aa58

smaller ammendments

a86fc07

Dummy mask was inverted - fixed

7ae9c87

replace hardcoded normalization path

93674a2

create 3D mesh objects for schol AR

f2cbc44

fixed math writing

c0e7529

Merge branch 'feature_dataset_yaml' of https://github.com/joeloskarss…

af7751a

…on/neural-lam into feature_dataset_yaml

cherry-pick with main

5f538f9

bugfixes

6685e94

pre_commits

6423fdf

Merge remote-tracking branch 'origin/main' into feature_dataset_yaml

59c4947

config.py is ready for danra

4e457ed

sadamov self-assigned this Jun 6, 2024

sadamov linked an issue Jun 6, 2024 that may be closed by this pull request

Pytorch Dataset Class that Reads From Zarr Archive #24

Closed

sadamov mentioned this pull request Jun 6, 2024

Creation of Static and Forcing Fields #34

Closed

sadamov modified the milestones: v0.2.0, v0.3.0 Jun 6, 2024

sadamov added 9 commits June 7, 2024 13:13

static dataset doesn't have time dim

124541b

making two complex functions more modular

6140fdb

chunk dataset by time

db6a912

create list first for performance

1aaa8dc

converting to_array is very slow

81856b2

the behavior of xr.Dataset.dims will change soon

allow for forcings to not be normalized

b3da818

allow non_normalized_vars to be null

7ee5398

fixed coastlines using new xy_extent function

4782103

Some projections return inverted axes (rotatedPole)

e0ffc5b

sadamov mentioned this pull request Jun 9, 2024

25 normalize data on gpu #39

Open

leifdenby mentioned this pull request Jun 10, 2024

Designing a Datastore class (for handling different storage backends for datasets) #58

Closed

Docstrings added

c1f43b7

joeloskarsson mentioned this pull request Jun 18, 2024

Merge global forecasting capabilities from prob_model_global #63

Open

11 tasks

leifdenby mentioned this pull request Jun 24, 2024

Modifying WeatherDataset.__getitem__ to return timestamps #64

Closed

leifdenby mentioned this pull request Jul 17, 2024

Add "datastores" to represent input data from zarr, npy, etc #66

Merged

20 tasks

sadamov added 2 commits August 7, 2024 09:19

Merge branch 'main' into feature_dataset_yaml

1a1f5b2

Bugfix training without forcing data

5f72fe3

sadamov closed this Sep 6, 2024

sadamov deleted the feature_dataset_yaml branch September 6, 2024 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple Zarr to Rule them All #54

Multiple Zarr to Rule them All #54

sadamov commented Jun 6, 2024 •

edited by joeloskarsson

Loading

joeloskarsson commented Jun 7, 2024

leifdenby commented Jun 10, 2024

mpvginde commented Jun 13, 2024

sadamov commented Jun 13, 2024

mpvginde commented Jun 13, 2024

sadamov commented Sep 6, 2024

Multiple Zarr to Rule them All #54

Multiple Zarr to Rule them All #54

Conversation

sadamov commented Jun 6, 2024 • edited by joeloskarsson Loading

Describe your changes

Issue Links

File by file changes

Remarks

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

joeloskarsson commented Jun 7, 2024

leifdenby commented Jun 10, 2024

mpvginde commented Jun 13, 2024

sadamov commented Jun 13, 2024

mpvginde commented Jun 13, 2024

sadamov commented Sep 6, 2024

sadamov commented Jun 6, 2024 •

edited by joeloskarsson

Loading