add section for extra info in config (#18)

Add optional section to config called `extra` that allows the user to include arbitrary information that is ignored by mllam-data-prep. This is added so that this extra information can be used in downstream applications (for example neural-lam will use this for projection info for now, until we are able to complete setting projection info correctly in a CF-compliant manner).
mllam · Nov 20, 2024 · 1a01bc0 · 1a01bc0
1 parent 4eceee8
commit 1a01bc0
Show file tree

Hide file tree

Showing 10 changed files with 284 additions and 416 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,7 +5,22 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## v0.4.0
+## Unreleased
+
+[All changes](https://github.com/mllam/mllam-data-prep/compare/v0.4.0...HEAD)
+
+### Added
+
+- Add optional section called `extra` to config file to allow for user-defined extra information that is ignored by `mllam-data-prep` but can be used by downstream applications. ![\#18](https://github.com/mllam/mllam-data-prep/pull/18), @leifdenby
+
+### Changed
+
+- Schema version bumped to `v0.5.0` to match next expected release that will support optional `extra` section in config [\#18](https://github.com/mllam/mllam-data-prep/pull/18)
+
+
+## [v0.4.0](https://github.com/mllam/mllam-data-prep/releases/tag/v0.4.0)
+
+[All changes](https://github.com/mllam/mllam-data-prep/compare/v0.3.0...v0.4.0)
 
 This release adds support for defining the output path in the command line
 interface and addresses bugs around optional dependencies for

diff --git a/README.md b/README.md
@@ -112,7 +112,7 @@ ds = mdp.create_dataset(config=config)
 A full example configuration file is given in [example.danra.yaml](example.danra.yaml), and reproduced here for completeness:
 
 ```yaml
-schema_version: v0.2.0
+schema_version: v0.5.0
 dataset_version: v0.1.0
 
 output:
@@ -201,13 +201,25 @@ inputs:
         name_format: "{var_name}"
     target_output_variable: static
 
+extra:
+  projection:
+    class_name: LambertConformal
+    kwargs:
+      central_longitude: 25.0
+      central_latitude: 56.7
+      standard_parallels: [56.7, 56.7]
+      globe:
+        semimajor_axis: 6367470.0
+        semiminor_axis: 6367470.0
 ```
 
 Apart from identifiers to keep track of the configuration file format version and the dataset version (for you to keep track of changes that you make to the dataset), the configuration file is divided into two main sections:
 
 - `output`: defines the variables and dimensions of the output dataset produced by `mllam-data-prep`. These are the variables and dimensions that the input datasets will be mapped to. These output variables and dimensions should match the input variables and dimensions expected by the model architecture you are training.
 - `inputs`: a list of source datasets to extract data from. These are the datasets that will be mapped to the architecture defined in the `architecture` section.
 
+If you want to add any extra information to the configuration file you can add it to the `extra` section. This section is not used or validated by `mllam-data-prep` but can be used to store any extra information you want to keep track of (for example when using `mllam-data-prep` with [neural-lam](https://github.com/mllam/neural-lam) the `extra` section is used to store the projection information).
+
 ### The `output` section
 
 ```yaml
@@ -308,3 +320,10 @@ The `inputs` section defines the source datasets to extract data from. Each sour
   - `rename`: simply rename the dimension to the new name
   - `stack`: stack the listed dimension to create the dimension in the output
   - `stack_variables_by_var_name`: stack the dimension into the new dimension, and also stack the variable name into the new variable name. This is useful when you have multiple variables with the same dimensions that you want to stack into a single variable.
+
+
+### Config schema versioning
+
+The schema version of the configuration file is defined by the `schema_version` attribute at the top of the configuration file. This is used to keep track of changes to the configuration file format. The schema version is used to check that the configuration file is compatible with the version of `mllam-data-prep` that you are using. If the schema version of the configuration file is not compatible with the version of `mllam-data-prep` that you are using you will get an error message telling you that the schema version is not compatible.
+
+The schema version is updated whenever the configuration format changes, with the new schema version matching the minimum version of `mllam-data-prep` that is required to use the new configuration format. As `mllam-data-prep` is still in rapid development (and hasn't reached version `v1.0.0` yet) we unfortunately make no gaurantee about backward compatibility. However, the [CHANGELOG.md](CHANGELOG.md) will always contain migration instructions when the config format changes.
diff --git a/example.danra.yaml b/example.danra.yaml
@@ -1,4 +1,4 @@
-schema_version: v0.2.0
+schema_version: v0.5.0
 dataset_version: v0.1.0
 
 output:
@@ -86,3 +86,14 @@ inputs:
         method: stack_variables_by_var_name
         name_format: "{var_name}"
     target_output_variable: static
+
+extra:
+  projection:
+    class_name: LambertConformal
+    kwargs:
+      central_longitude: 25.0
+      central_latitude: 56.7
+      standard_parallels: [56.7, 56.7]
+      globe:
+        semimajor_axis: 6367470.0
+        semiminor_axis: 6367470.0
diff --git a/mllam_data_prep/config.py b/mllam_data_prep/config.py
@@ -282,6 +282,10 @@ class Config(dataclass_wizard.JSONWizard, dataclass_wizard.YAMLWizard):
     inputs: Dict[str, InputDataset]
         Input datasets for the model. The keys are the names of the datasets and the values are
         the input dataset configurations.
+    extra: Dict[str, Any]
+        Extra information to include in the config file. This will be ignored by the
+        `mllam_data_prep` library, but can be used to include additional information
+        that is useful for the user.
     schema_version: str
         Version string for the config file schema.
     dataset_version: str
@@ -292,6 +296,7 @@ class Config(dataclass_wizard.JSONWizard, dataclass_wizard.YAMLWizard):
     inputs: Dict[str, InputDataset]
     schema_version: str
     dataset_version: str
+    extra: Dict[str, Any] = None
 
     class _(JSONWizard.Meta):
         raise_on_unknown_json_key = True

diff --git a/mllam_data_prep/create_dataset.py b/mllam_data_prep/create_dataset.py
@@ -15,6 +15,10 @@
 from .ops.selection import select_by_kwargs
 from .ops.statistics import calc_stats
 
+# the `extra` field in the config that was added between v0.2.0 and v0.5.0 is
+# optional, so we can support both v0.2.0 and v0.5.0
+SUPPORTED_CONFIG_VERSIONS = ["v0.2.0", "v0.5.0"]
+
 
 def _check_dataset_attributes(ds, expected_attributes, dataset_name):
     # check that the dataset has the expected attributes with the expected values
@@ -102,6 +106,18 @@ def create_dataset(config: Config):
         The dataset created from the input datasets with a variable for each output
         as defined in the config file.
     """
+    if not config.schema_version in SUPPORTED_CONFIG_VERSIONS:
+        raise ValueError(
+            f"Unsupported schema version {config.schema_version}. Only schema versions "
+            f" {', '.join(SUPPORTED_CONFIG_VERSIONS)} are supported by mllam-data-prep "
+            f"v{__version__}."
+        )
+    if config.schema_version == "v0.2.0" and config.extra is not None:
+        raise ValueError(
+            "Config schema version v0.2.0 does not support the `extra` field. Please "
+            "update the schema version used in your config to v0.5.0."
+        )
+
     output_config = config.output
     output_coord_ranges = output_config.coord_ranges
 
@@ -241,10 +257,6 @@ def create_dataset_zarr(fp_config, fp_zarr: str = None):
     """
     config = Config.from_yaml_file(file=fp_config)
 
-    assert (
-        config.schema_version == "v0.2.0"
-    ), f"Expected schema version v0.2.0, got {config.schema_version}"
-
     ds = create_dataset(config=config)
 
     logger.info("Writing dataset to zarr")