Skip to content

Commit

Permalink
add section for extra info in config (#18)
Browse files Browse the repository at this point in the history
Add optional section to config called `extra` that allows the user to include arbitrary information that is ignored by mllam-data-prep. This is added so that this extra information can be used in downstream applications (for example neural-lam will use this for projection info for now, until we are able to complete setting projection info correctly in a CF-compliant manner).
  • Loading branch information
leifdenby authored Nov 20, 2024
1 parent 4eceee8 commit 1a01bc0
Show file tree
Hide file tree
Showing 10 changed files with 284 additions and 416 deletions.
17 changes: 16 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,22 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## v0.4.0
## Unreleased

[All changes](https://github.com/mllam/mllam-data-prep/compare/v0.4.0...HEAD)

### Added

- Add optional section called `extra` to config file to allow for user-defined extra information that is ignored by `mllam-data-prep` but can be used by downstream applications. ![\#18](https://github.com/mllam/mllam-data-prep/pull/18), @leifdenby

### Changed

- Schema version bumped to `v0.5.0` to match next expected release that will support optional `extra` section in config [\#18](https://github.com/mllam/mllam-data-prep/pull/18)


## [v0.4.0](https://github.com/mllam/mllam-data-prep/releases/tag/v0.4.0)

[All changes](https://github.com/mllam/mllam-data-prep/compare/v0.3.0...v0.4.0)

This release adds support for defining the output path in the command line
interface and addresses bugs around optional dependencies for
Expand Down
21 changes: 20 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ ds = mdp.create_dataset(config=config)
A full example configuration file is given in [example.danra.yaml](example.danra.yaml), and reproduced here for completeness:

```yaml
schema_version: v0.2.0
schema_version: v0.5.0
dataset_version: v0.1.0

output:
Expand Down Expand Up @@ -201,13 +201,25 @@ inputs:
name_format: "{var_name}"
target_output_variable: static

extra:
projection:
class_name: LambertConformal
kwargs:
central_longitude: 25.0
central_latitude: 56.7
standard_parallels: [56.7, 56.7]
globe:
semimajor_axis: 6367470.0
semiminor_axis: 6367470.0
```
Apart from identifiers to keep track of the configuration file format version and the dataset version (for you to keep track of changes that you make to the dataset), the configuration file is divided into two main sections:
- `output`: defines the variables and dimensions of the output dataset produced by `mllam-data-prep`. These are the variables and dimensions that the input datasets will be mapped to. These output variables and dimensions should match the input variables and dimensions expected by the model architecture you are training.
- `inputs`: a list of source datasets to extract data from. These are the datasets that will be mapped to the architecture defined in the `architecture` section.

If you want to add any extra information to the configuration file you can add it to the `extra` section. This section is not used or validated by `mllam-data-prep` but can be used to store any extra information you want to keep track of (for example when using `mllam-data-prep` with [neural-lam](https://github.com/mllam/neural-lam) the `extra` section is used to store the projection information).

### The `output` section

```yaml
Expand Down Expand Up @@ -308,3 +320,10 @@ The `inputs` section defines the source datasets to extract data from. Each sour
- `rename`: simply rename the dimension to the new name
- `stack`: stack the listed dimension to create the dimension in the output
- `stack_variables_by_var_name`: stack the dimension into the new dimension, and also stack the variable name into the new variable name. This is useful when you have multiple variables with the same dimensions that you want to stack into a single variable.


### Config schema versioning

The schema version of the configuration file is defined by the `schema_version` attribute at the top of the configuration file. This is used to keep track of changes to the configuration file format. The schema version is used to check that the configuration file is compatible with the version of `mllam-data-prep` that you are using. If the schema version of the configuration file is not compatible with the version of `mllam-data-prep` that you are using you will get an error message telling you that the schema version is not compatible.

The schema version is updated whenever the configuration format changes, with the new schema version matching the minimum version of `mllam-data-prep` that is required to use the new configuration format. As `mllam-data-prep` is still in rapid development (and hasn't reached version `v1.0.0` yet) we unfortunately make no gaurantee about backward compatibility. However, the [CHANGELOG.md](CHANGELOG.md) will always contain migration instructions when the config format changes.
13 changes: 12 additions & 1 deletion example.danra.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
schema_version: v0.2.0
schema_version: v0.5.0
dataset_version: v0.1.0

output:
Expand Down Expand Up @@ -86,3 +86,14 @@ inputs:
method: stack_variables_by_var_name
name_format: "{var_name}"
target_output_variable: static

extra:
projection:
class_name: LambertConformal
kwargs:
central_longitude: 25.0
central_latitude: 56.7
standard_parallels: [56.7, 56.7]
globe:
semimajor_axis: 6367470.0
semiminor_axis: 6367470.0
5 changes: 5 additions & 0 deletions mllam_data_prep/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,10 @@ class Config(dataclass_wizard.JSONWizard, dataclass_wizard.YAMLWizard):
inputs: Dict[str, InputDataset]
Input datasets for the model. The keys are the names of the datasets and the values are
the input dataset configurations.
extra: Dict[str, Any]
Extra information to include in the config file. This will be ignored by the
`mllam_data_prep` library, but can be used to include additional information
that is useful for the user.
schema_version: str
Version string for the config file schema.
dataset_version: str
Expand All @@ -292,6 +296,7 @@ class Config(dataclass_wizard.JSONWizard, dataclass_wizard.YAMLWizard):
inputs: Dict[str, InputDataset]
schema_version: str
dataset_version: str
extra: Dict[str, Any] = None

class _(JSONWizard.Meta):
raise_on_unknown_json_key = True
Expand Down
20 changes: 16 additions & 4 deletions mllam_data_prep/create_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@
from .ops.selection import select_by_kwargs
from .ops.statistics import calc_stats

# the `extra` field in the config that was added between v0.2.0 and v0.5.0 is
# optional, so we can support both v0.2.0 and v0.5.0
SUPPORTED_CONFIG_VERSIONS = ["v0.2.0", "v0.5.0"]


def _check_dataset_attributes(ds, expected_attributes, dataset_name):
# check that the dataset has the expected attributes with the expected values
Expand Down Expand Up @@ -102,6 +106,18 @@ def create_dataset(config: Config):
The dataset created from the input datasets with a variable for each output
as defined in the config file.
"""
if not config.schema_version in SUPPORTED_CONFIG_VERSIONS:
raise ValueError(
f"Unsupported schema version {config.schema_version}. Only schema versions "
f" {', '.join(SUPPORTED_CONFIG_VERSIONS)} are supported by mllam-data-prep "
f"v{__version__}."
)
if config.schema_version == "v0.2.0" and config.extra is not None:
raise ValueError(
"Config schema version v0.2.0 does not support the `extra` field. Please "
"update the schema version used in your config to v0.5.0."
)

output_config = config.output
output_coord_ranges = output_config.coord_ranges

Expand Down Expand Up @@ -241,10 +257,6 @@ def create_dataset_zarr(fp_config, fp_zarr: str = None):
"""
config = Config.from_yaml_file(file=fp_config)

assert (
config.schema_version == "v0.2.0"
), f"Expected schema version v0.2.0, got {config.schema_version}"

ds = create_dataset(config=config)

logger.info("Writing dataset to zarr")
Expand Down
Loading

0 comments on commit 1a01bc0

Please sign in to comment.