Replace constants.py with data_config.yaml #31

sadamov · 2024-05-14T15:31:16Z

Summary
This PR replaces the constants.py file with a data_config.yaml file. Dataset related settings can be defined by the user in the new yaml file. Training specific settings were added as additional flags to the train_model.py routine. All respective calls to the old files were replaced.

Rationale

Using a Yaml file for data config gives much more flexibility for various datasets used in the community. It also facilitates the future use of forcing and boundary datasets. In a follow-up PR the dataset paths will be defined in the yaml file, removing the dependency on a pre-structured /data folder.
It is best practice to define user input in a yaml file, the usage of python scripts for that purpose is not common.
The old constants.py actually combined both constants and variables, many "constants" should rather be flags to train_models.py
The introduction of a new ConfigClass in utils.py allows for very specific queries of the yaml and calculations based thereon. This branch shows future possibilities of such a class https://github.com/joeloskarsson/neural-lam/tree/feature_dataset_yaml

Testing
Both training and evaluation of the model were succesfully tested with the meps_example dataset.

Note
@leifdenby Could you invite Thomas R. to this repo, in case he wanted to give his input on the yaml file? This PR should mostly serve as a basis for discussion. Maybe we should add more information to the yaml file as you outline in https://github.com/mllam/mllam-data-prep. I think we should always keep in mind how the repository will look like with realistic boundary conditions and zarr-archives as data-input.

This PR solves parts of #23

user configs are retrieved either from data_config.yaml or they are set as flags to train_model.py Capabilities of ConfigLoader class extended

lat lon specifications make the code more flexible

Zarr is registered to model buffer Normalization happens on device on_after_batch_transfer

test-case based on meps-example

neural_lam/utils.py

leifdenby · 2024-05-15T09:13:26Z

From a quick glance this looks simply amazing @sadamov! Thanks for doing this work. I will give a thorough review later today/tomorrow. Just tagging @SimonKamuk to have a read and give your thoughts too. I've added @ThomasRieutord to the organisation too. I'll also send Thomas an email so that he definitely sees the PR.

leifdenby

I really like this work @sadamov! You've really caught all the bits here (which I'm impressed you've done considering we don't have any tests right now!)

I have just made a few comments/suggestions. Let me know what you think :)

README.md

leifdenby · 2024-05-16T18:50:47Z

neural_lam/data_config.yaml

+    - wvint_entireAtmosphere_0_instant
+    - z_isobaricInhPa_1000_instant
+    - z_isobaricInhPa_500_instant
+  forcing_dim: 16


what does forcing_dim refer to? Is it number of forcing features? In that case maybe we should call this num_forcing_features instead? The current name implies to me that that "dimension 16" is used for forcing or that there are 16 forcing dimensions :)

Shouldn't the "forcing variables" be named too actually? We don't have to do this in this PR, but maybe we should consider that in future

In a future PR the forcings will be provided by a path to a zarr archive containing forcing features. Since in the current MEPS implementation the calculation of forcings is heavily integrated into the Dataset/Dataloader, I suggest to change the name to num_forcing_dim for now and implement the fundamental changes "naming forcing variables" once the zarr-based approach was merged into main. See https://github.com/mllam/neural-lam/tree/feature_dataset_yaml

sounds good to me! Happy to have the only change here be changed of name to num_forcing_dim

neural_lam/data_config.yaml

neural_lam/utils.py

neural_lam/weather_dataset.py

neural_lam/models/ar_model.py

neural_lam/data_config.yaml

sadamov · 2024-05-21T10:26:37Z

I implemented most requested changes in the latest commit and requested one more review. From my side we are clear to merge. The latest changes were again tested for model training and evaluation.

leifdenby · 2024-05-21T14:34:03Z

neural_lam/config.py

I'm not saying we should do this now, but I learnt more about the Meteo-France work on neural-lam this morning and they make quite heavy use of python dataclasses for configuration storage and schema. This could be something to consider when we want to make the config content more explicit.

leifdenby

Looks great! Thanks again @sadamov !

leifdenby · 2024-05-21T14:48:15Z

Remember to update the changelog before you merge @sadamov! I think I would call this a new (very useful!) feature. A few things also change here (where constants are stored and thus how they are accessed in the code)

leifdenby · 2024-05-22T10:13:02Z

Hurraay! 🥳

joeloskarsson · 2024-05-24T14:46:32Z

train_model.py

+    parser.add_argument(
+        "--var_leads_metrics_watch",
+        type=dict,
+        default={},


@sadamov Can you pass a dict as input on the command line? I could not figure out a way to use this option

Ah no you can't, I'll make a short PR to fix three bugs that I introduced in this PR. One of them the dictionary here.

joeloskarsson · 2024-05-24T14:47:07Z

Wonderful job with this @sadamov!

### Summary #31 introduced three minor bugs that are fixed with this PR: - r"" strings are not required in units of `data_config.yaml` - dictionaries cannot be passed as argsparse, rather JSON strings. This bug is related to the flag `var_leads_metrics_watch` --------- Co-authored-by: joeloskarsson <[email protected]>

Simon Adamov added 23 commits May 5, 2024 20:21

yaml_config for cosmo data

e5c245f

initial version of single zarr dataset

33e7ecf

handling None zarrs

9936e3b

removed all dependencies on constants.py

774d16a

user configs are retrieved either from data_config.yaml or they are set as flags to train_model.py Capabilities of ConfigLoader class extended

fix linter

7bb139b

Fixed calls to new WeatherDataModule Class

af076fe

fix linter

147caec

upload data config to wandb for history logs

2b65416

Improved handling of static data

ed9ed69

dask and zarr are required backends to xarray

0b69f4e

Implements windowed forcing and boundary

b76d078

Some project related stuff (simple setup to pip install -e .)

5d27a4c

introducing realistic boundaries

4dadf29

Adapted nwp_xy related code to new data loading procedure

7524c4d

only state requires units for plotting

45fd375

lat lon specifications make the code more flexible

small bugfixes and improvements

812323d

Calculate stats and store in zarr archive

500f2fb

Zarr is registered to model buffer Normalization happens on device on_after_batch_transfer

latex support

9293fe1

ar_steps for training and eval

e80aa58

smaller ammendments

a86fc07

Dummy mask was inverted - fixed

7ae9c87

replace hardcoded normalization path

93674a2

constants.py converted into yaml-file

244284c

test-case based on meps-example

sadamov added the enhancement New feature or request label May 14, 2024

sadamov requested a review from leifdenby May 14, 2024 15:31

sadamov self-assigned this May 14, 2024

joeloskarsson reviewed May 14, 2024

View reviewed changes

neural_lam/utils.py Outdated Show resolved Hide resolved

leifdenby added this to the v0.2.0 milestone May 15, 2024

sadamov mentioned this pull request May 15, 2024

Replace constants.py with data + region specification from yaml-file #23

Closed

leifdenby reviewed May 16, 2024

View reviewed changes

leifdenby mentioned this pull request May 16, 2024

Refactor codebase into a python package #32

Merged

Implementation PR-review feedback

0ba441b

sadamov requested a review from leifdenby May 21, 2024 10:25

fix linter

37bdf8f

leifdenby reviewed May 21, 2024

View reviewed changes

leifdenby approved these changes May 21, 2024

View reviewed changes

Simon Adamov added 2 commits May 22, 2024 09:58

Merge remote-tracking branch 'origin/main' into feature_yaml

a435382

Updated changelog for future references

5d10591

sadamov merged commit 4a97a12 into main May 22, 2024
1 check passed

sadamov deleted the feature_yaml branch May 22, 2024 08:22

joeloskarsson reviewed May 24, 2024

View reviewed changes

sadamov mentioned this pull request May 25, 2024

Three minor bugfixes for data_config.yaml workflow #40

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace constants.py with data_config.yaml #31

Replace constants.py with data_config.yaml #31

sadamov commented May 14, 2024

leifdenby commented May 15, 2024

leifdenby left a comment

leifdenby May 16, 2024 •

edited

Loading

leifdenby May 16, 2024 •

edited

Loading

sadamov May 21, 2024

leifdenby May 21, 2024

sadamov commented May 21, 2024

leifdenby May 21, 2024 •

edited

Loading

leifdenby left a comment

leifdenby commented May 21, 2024

leifdenby commented May 22, 2024

joeloskarsson May 24, 2024

sadamov May 25, 2024

joeloskarsson commented May 24, 2024

Replace constants.py with data_config.yaml #31

Replace constants.py with data_config.yaml #31

Conversation

sadamov commented May 14, 2024

leifdenby commented May 15, 2024

leifdenby left a comment

Choose a reason for hiding this comment

leifdenby May 16, 2024 • edited Loading

Choose a reason for hiding this comment

leifdenby May 16, 2024 • edited Loading

Choose a reason for hiding this comment

sadamov May 21, 2024

Choose a reason for hiding this comment

leifdenby May 21, 2024

Choose a reason for hiding this comment

sadamov commented May 21, 2024

leifdenby May 21, 2024 • edited Loading

Choose a reason for hiding this comment

leifdenby left a comment

Choose a reason for hiding this comment

leifdenby commented May 21, 2024

leifdenby commented May 22, 2024

joeloskarsson May 24, 2024

Choose a reason for hiding this comment

sadamov May 25, 2024

Choose a reason for hiding this comment

joeloskarsson commented May 24, 2024

leifdenby May 16, 2024 •

edited

Loading

leifdenby May 16, 2024 •

edited

Loading

leifdenby May 21, 2024 •

edited

Loading