Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add section for extra info in config #18

Merged
merged 14 commits into from
Nov 20, 2024
Merged
17 changes: 16 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,22 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## v0.4.0
## Unreleased

[All changes](https://github.com/mllam/mllam-data-prep/compare/v0.4.0...HEAD)

### Added

- Add optional section called `extra` to config file to allow for user-defined extra information that is ignored by `mllam-data-prep` but can be used by downstream applications. ![\#18](https://github.com/mllam/mllam-data-prep/pull/18), @leifdenby

### Changed

- Schema version bumped to `v0.5.0` to match next expected release that will support optional `extra` section in config [\#18](https://github.com/mllam/mllam-data-prep/pull/18)


## [v0.4.0](https://github.com/mllam/mllam-data-prep/releases/tag/v0.4.0)

[All changes](https://github.com/mllam/mllam-data-prep/compare/v0.3.0...v0.4.0)

This release adds support for defining the output path in the command line
interface and addresses bugs around optional dependencies for
Expand Down
21 changes: 20 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ ds = mdp.create_dataset(config=config)
A full example configuration file is given in [example.danra.yaml](example.danra.yaml), and reproduced here for completeness:

```yaml
schema_version: v0.2.0
schema_version: v0.5.0
dataset_version: v0.1.0

output:
Expand Down Expand Up @@ -201,13 +201,25 @@ inputs:
name_format: "{var_name}"
target_output_variable: static

extra:
projection:
class_name: LambertConformal
kwargs:
central_longitude: 25.0
observingClouds marked this conversation as resolved.
Show resolved Hide resolved
central_latitude: 56.7
standard_parallels: [56.7, 56.7]
globe:
semimajor_axis: 6367470.0
semiminor_axis: 6367470.0
```

Apart from identifiers to keep track of the configuration file format version and the dataset version (for you to keep track of changes that you make to the dataset), the configuration file is divided into two main sections:

- `output`: defines the variables and dimensions of the output dataset produced by `mllam-data-prep`. These are the variables and dimensions that the input datasets will be mapped to. These output variables and dimensions should match the input variables and dimensions expected by the model architecture you are training.
- `inputs`: a list of source datasets to extract data from. These are the datasets that will be mapped to the architecture defined in the `architecture` section.

If you want to add any extra information to the configuration file you can add it to the `extra` section. This section is not used or validated by `mllam-data-prep` but can be used to store any extra information you want to keep track of (for example when using `mllam-data-prep` with [neural-lam](https://github.com/mllam/neural-lam) the `extra` section is used to store the projection information).

### The `output` section

```yaml
Expand Down Expand Up @@ -308,3 +320,10 @@ The `inputs` section defines the source datasets to extract data from. Each sour
- `rename`: simply rename the dimension to the new name
- `stack`: stack the listed dimension to create the dimension in the output
- `stack_variables_by_var_name`: stack the dimension into the new dimension, and also stack the variable name into the new variable name. This is useful when you have multiple variables with the same dimensions that you want to stack into a single variable.


### Config schema versioning

The schema version of the configuration file is defined by the `schema_version` attribute at the top of the configuration file. This is used to keep track of changes to the configuration file format. The schema version is used to check that the configuration file is compatible with the version of `mllam-data-prep` that you are using. If the schema version of the configuration file is not compatible with the version of `mllam-data-prep` that you are using you will get an error message telling you that the schema version is not compatible.

The schema version is updated whenever the configuration format changes, with the new schema version matching the minimum version of `mllam-data-prep` that is required to use the new configuration format. As `mllam-data-prep` is still in rapid development (and hasn't reached version `v1.0.0` yet) we unfortunately make no gaurantee about backward compatibility. However, the [CHANGELOG.md](CHANGELOG.md) will always contain migration instructions when the config format changes.
13 changes: 12 additions & 1 deletion example.danra.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
schema_version: v0.2.0
schema_version: v0.5.0
dataset_version: v0.1.0

output:
Expand Down Expand Up @@ -86,3 +86,14 @@ inputs:
method: stack_variables_by_var_name
name_format: "{var_name}"
target_output_variable: static

extra:
projection:
class_name: LambertConformal
kwargs:
central_longitude: 25.0
central_latitude: 56.7
standard_parallels: [56.7, 56.7]
globe:
semimajor_axis: 6367470.0
semiminor_axis: 6367470.0
5 changes: 5 additions & 0 deletions mllam_data_prep/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,10 @@ class Config(dataclass_wizard.JSONWizard, dataclass_wizard.YAMLWizard):
inputs: Dict[str, InputDataset]
Input datasets for the model. The keys are the names of the datasets and the values are
the input dataset configurations.
extra: Dict[str, Any]
Extra information to include in the config file. This will be ignored by the
`mllam_data_prep` library, but can be used to include additional information
that is useful for the user.
schema_version: str
Version string for the config file schema.
dataset_version: str
Expand All @@ -292,6 +296,7 @@ class Config(dataclass_wizard.JSONWizard, dataclass_wizard.YAMLWizard):
inputs: Dict[str, InputDataset]
schema_version: str
dataset_version: str
extra: Dict[str, Any] = None

class _(JSONWizard.Meta):
raise_on_unknown_json_key = True
Expand Down
20 changes: 16 additions & 4 deletions mllam_data_prep/create_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@
from .ops.selection import select_by_kwargs
from .ops.statistics import calc_stats

# the `extra` field in the config that was added between v0.2.0 and v0.5.0 is
# optional, so we can support both v0.2.0 and v0.5.0
SUPPORTED_CONFIG_VERSIONS = ["v0.2.0", "v0.5.0"]


def _check_dataset_attributes(ds, expected_attributes, dataset_name):
# check that the dataset has the expected attributes with the expected values
Expand Down Expand Up @@ -102,6 +106,18 @@ def create_dataset(config: Config):
The dataset created from the input datasets with a variable for each output
as defined in the config file.
"""
if not config.schema_version in SUPPORTED_CONFIG_VERSIONS:
raise ValueError(
f"Unsupported schema version {config.schema_version}. Only schema versions "
f" {', '.join(SUPPORTED_CONFIG_VERSIONS)} are supported by mllam-data-prep "
f"v{__version__}."
)
if config.schema_version == "v0.2.0" and config.extra is not None:
raise ValueError(
"Config schema version v0.2.0 does not support the `extra` field. Please "
"update the schema version used in your config to v0.5.0."
)

output_config = config.output
output_coord_ranges = output_config.coord_ranges

Expand Down Expand Up @@ -241,10 +257,6 @@ def create_dataset_zarr(fp_config, fp_zarr: str = None):
"""
config = Config.from_yaml_file(file=fp_config)

assert (
config.schema_version == "v0.2.0"
), f"Expected schema version v0.2.0, got {config.schema_version}"

ds = create_dataset(config=config)

logger.info("Writing dataset to zarr")
Expand Down
Loading
Loading