mllam · ealerskans · Oct 28, 2024 · Nov 6, 2024 · Nov 6, 2024 · Nov 6, 2024
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+- add ability to derive variables from input datasets [\#34](https://github.com/mllam/mllam-data-prep/pull/34), @ealerskans
 - add github PR template to guide development process on github [\#44](https://github.com/mllam/mllam-data-prep/pull/44), @leifdenby
 
 ## [v0.5.0](https://github.com/mllam/mllam-data-prep/releases/tag/v0.5.0)

diff --git a/README.md b/README.md
@@ -103,7 +103,7 @@ The package can also be used as a python module to create datasets directly, for
 import mllam_data_prep as mdp
 
 config_path = "example.danra.yaml"
-config = mdp.Config.from_yaml_file(config_path)
+config = mdp.Config.load_config(config_path)
 ds = mdp.create_dataset(config=config)
 ```
 
@@ -175,6 +175,18 @@ inputs:
     variables:
       # use surface incoming shortwave radiation as forcing
       - swavr0m
+    derived_variables:
+      # derive variables to be used as forcings
+      toa_radiation:
+        kwargs:
+          time: time
+          lat: lat
+          lon: lon
+        function: mllam_data_prep.ops.derived_variables.calculate_toa_radiation
+      hour_of_day:
+        kwargs:
+          time: time
+        function: mllam_data_prep.ops.derived_variables.calculate_hour_of_day
     dim_mapping:
       time:
         method: rename
@@ -286,15 +298,26 @@ inputs:
       grid_index:
         method: stack
         dims: [x, y]
-    target_architecture_variable: state
+    target_output_variable: state
 
   danra_surface:
     path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
     dims: [time, x, y]
     variables:
-      # shouldn't really be using sea-surface pressure as "forcing", but don't
-      # have radiation varibles in danra yet
-      - pres_seasurface
+      # use surface incoming shortwave radiation as forcing
+      - swavr0m
+    derived_variables:
+      # derive variables to be used as forcings
+      toa_radiation:
+        kwargs:
+          time: time
+          lat: lat
+          lon: lon
+        function: mllam_data_prep.derived_variables.calculate_toa_radiation
+      hour_of_day:
+        kwargs:
+          time: time
+        function: mllam_data_prep.derived_variables.calculate_hour_of_day
     dim_mapping:
       time:
         method: rename
@@ -305,7 +328,7 @@ inputs:
       forcing_feature:
         method: stack_variables_by_var_name
         name_format: "{var_name}"
-    target_architecture_variable: forcing
+    target_output_variable: forcing
 
   ...
 ```
@@ -315,11 +338,44 @@ The `inputs` section defines the source datasets to extract data from. Each sour
 - `path`: the path to the source dataset. This can be a local path or a URL to e.g. a zarr dataset or netCDF file, anything that can be read by `xarray.open_dataset(...)`.
 - `dims`: the dimensions that the source dataset is expected to have. This is used to check that the source dataset has the expected dimensions and also makes it clearer in the config file what the dimensions of the source dataset are.
 - `variables`: selects which variables to extract from the source dataset. This may either be a list of variable names, or a dictionary where each key is the variable name and the value defines a dictionary of coordinates to do selection on. When doing selection you may also optionally define the units of the variable to check that the units of the variable match the units of the variable in the model architecture.
-- `target_architecture_variable`: the variable in the model architecture that the source dataset should be mapped to.
+- `target_output_variable`: the variable in the model architecture that the source dataset should be mapped to.
 - `dim_mapping`: defines how the dimensions of the source dataset should be mapped to the dimensions of the model architecture. This is done by defining a method to apply to each dimension. The methods are:
   - `rename`: simply rename the dimension to the new name
   - `stack`: stack the listed dimension to create the dimension in the output
   - `stack_variables_by_var_name`: stack the dimension into the new dimension, and also stack the variable name into the new variable name. This is useful when you have multiple variables with the same dimensions that you want to stack into a single variable.
+- `derived_variables`: defines the variables to be derived from the variables available in the source dataset. This should be a dictionary where each key is the variable to be derived and the value defines a dictionary with the following additional information. See the 'Derived Variables' section for more details.
+  - `function`: the function to be used to derive a variable. This should be a string and may either be the full namespace of the function (e.g. `mllam_data_prep.ops.derived_variables.calculate_toa_radiation`) or in case the function is included in the `mllam_data_prep.ops.derived_variables` module it is enough with the function name only.
+  - `kwargs`: arguments for the function used to derive a variable. This is a dictionary where each key is the name of the variables to select from the source dataset and each value is the named argument to `function`.
+
+#### Derived Variables
+Variables that are not part of the source dataset but can be derived from variables in the source dataset can also be included. They should be defined in their own section, called `derived_variables` as illustrated in the example config above and in the `example.danra.yaml` config file.
+
+To derive the variables, the function to be used to derive the variable (`function`) and the arguments to this function (`kwargs`) need to be specified, as explained above. In addition, an optional section called `attrs` can be added. In this section, the user can add attributes to the derived variable, as illustrated below.
+```yaml
+    derived_variables:
+      toa_radiation:
+        kwargs:
+          time: time
+          lat: lat
+          lon: lon
+        function: mllam_data_prep.derived_variables.calculate_toa_radiation
+        attrs:
+          units: W*m**-2
+          long_name: top-of-atmosphere incoming radiation
+```
+
+Note that the attributes `units` and `long_name` are required. This means that if the function used to derive a variable does not set these attributes they are **required** to be set in the config file. If using a function defined in `mllam_data_prep.ops.derived_variables` the `attrs` section is optional as the attributes should already be defined. In this case, adding the `units` and `long_name` attributes to the `attrs` section of the derived variable in config file will overwrite the already-defined attributes from the function.
+
+Currently, the following derived variables are included as part of `mllam-data-prep`:
+- `toa_radiation`:
+  - Top-of-atmosphere incoming radiation
+  - function: `mllam_data_prep.ops.derived_variables.calculate_toa_radiation`
+- `hour_of_day`:
+  - Hour of day (cyclically encoded)
+  - function: `mllam_data_prep.ops.derived_variables.calculate_hour_of_day`
+- `day_of_year`:
+  - Day of year (cyclically encoded)
+  - function: `mllam_data_prep.ops.derived_variables.calculate_day_of_year`
 
 
 ### Config schema versioning

diff --git a/example.danra.yaml b/example.danra.yaml
@@ -61,6 +61,18 @@ inputs:
     variables:
       # use surface incoming shortwave radiation as forcing
       - swavr0m
+    derived_variables:
+      # derive variables to be used as forcings
+      toa_radiation:
+        kwargs:
+          time: time
+          lat: lat
+          lon: lon
+        function: mllam_data_prep.ops.derived_variables.calculate_toa_radiation
+      hour_of_day:
+        kwargs:
+          time: time
+        function: mllam_data_prep.ops.derived_variables.calculate_hour_of_day
     dim_mapping:
       time:
         method: rename

diff --git a/mllam_data_prep/config.py b/mllam_data_prep/config.py
@@ -52,6 +52,28 @@ class ValueSelection:
     units: str = None
 
 
+@dataclass
+class DerivedVariable:
+    """
+    Defines a derived variables, where the kwargs (variables required
+    for the calculation) and the function (for calculating the variable)
+    are specified. Optionally, in case a function does not return an
+    `xr.DataArray` with the required attributes (`units` and `long_name`) set,
+    these should be specified in `attrs`, e.g.
+    {"attrs": "units": "W*m**-2, "long_name": "top-of-the-atmosphere radiation"}.
+    Additional attributes can also be set if desired.
+
+    Attributes:
+        kwargs: Variables required for calculating the derived variable.
+        function: Function used to calculate the derived variable.
+        attrs: Attributes (e.g. `units` and `long_name`) to set for the derived variable.
+    """
+
+    kwargs: Dict[str, str]
+    function: str
+    attrs: Optional[Dict[str, str]] = field(default_factory=dict)
+
+
 @dataclass
 class DimMapping:
     """
@@ -120,7 +142,8 @@ class InputDataset:
         1) the path to the dataset,
         2) the expected dimensions of the dataset,
         3) the variables to select from the dataset (and optionally subsection
-           along the coordinates for each variable) and finally
+           along the coordinates for each variable) or the variables to derive
+           from the dataset, and finally
         4) the method by which the dimensions and variables of the dataset are
            mapped to one of the output variables (this includes stacking of all
            the selected variables into a new single variable along a new coordinate,
@@ -134,11 +157,6 @@ class InputDataset:
     dims: List[str]
         List of the expected dimensions of the dataset. E.g. `["time", "x", "y"]`.
         These will be checked to ensure consistency of the dataset being read.
-    variables: Union[List[str], Dict[str, Dict[str, ValueSelection]]]
-        List of the variables to select from the dataset. E.g. `["temperature", "precipitation"]`
-        or a dictionary where the keys are the variable names and the values are dictionaries
-        defining the selection for each variable. E.g. `{"temperature": levels: {"values": [1000, 950, 900]}}`
-        would select the "temperature" variable and only the levels 1000, 950, and 900.
     dim_mapping: Dict[str, DimMapping]
         Mapping of the variables and dimensions in the input dataset to the dimensions of the
         output variable (`target_output_variable`). The key is the name of the output dimension to map to
@@ -151,14 +169,23 @@ class InputDataset:
         (e.g. two datasets that coincide in space and time will only differ in the feature dimension,
         so the two will be combined by concatenating along the feature dimension).
         If a single shared coordinate cannot be found then an exception will be raised.
+    variables: Union[List[str], Dict[str, Dict[str, ValueSelection]]]
+        List of the variables to select from the dataset. E.g. `["temperature", "precipitation"]`
+        or a dictionary where the keys are the variable names and the values are dictionaries
+        defining the selection for each variable. E.g. `{"temperature": levels: {"values": [1000, 950, 900]}}`
+        would select the "temperature" variable and only the levels 1000, 950, and 900.
+    derived_variables: Dict[str, DerivedVariable]
+        Dictionary of variables to derive from the dataset, where the keys are the names variables will be given and
+        the values are `DerivedVariable` definitions that specify how to derive a variable.
     """
 
     path: str
     dims: List[str]
-    variables: Union[List[str], Dict[str, Dict[str, ValueSelection]]]
     dim_mapping: Dict[str, DimMapping]
     target_output_variable: str
-    attributes: Dict[str, Any] = None
+    variables: Optional[Union[List[str], Dict[str, Dict[str, ValueSelection]]]] = None
+    derived_variables: Optional[Dict[str, DerivedVariable]] = None
+    attributes: Optional[Dict[str, Any]] = field(default_factory=dict)
 
 
 @dataclass
@@ -258,7 +285,7 @@ class Output:
 
     variables: Dict[str, List[str]]
     coord_ranges: Dict[str, Range] = None
-    chunking: Dict[str, int] = None
+    chunking: Dict[str, int] = field(default_factory=dict)
     splitting: Splitting = None
 
 
@@ -301,6 +328,54 @@ class Config(dataclass_wizard.JSONWizard, dataclass_wizard.YAMLWizard):
     class _(JSONWizard.Meta):
         raise_on_unknown_json_key = True
 
+    @staticmethod
+    def load_config(*args, **kwargs):
+        """
+        Wrapper function for `from_yaml_file` to load config file and validate that:
+        - either `variables` or `derived_variables` are present in the config
+        - if both `variables` and `derived_variables` are present, that they don't
+          add the same variables to the dataset
+
+        Parameters
+        ----------
+        *args: Positional arguments for `from_yaml_file`
+        **kwargs: Keyword arguments for `from_yaml_file`
+
+        Returns
+        -------
+        config: Config
+        """
+
+        # Load the config
+        config = Config.from_yaml_file(*args, **kwargs)
+
+        for input_dataset in config.inputs.values():
+            if not input_dataset.variables and not input_dataset.derived_variables:
+                raise InvalidConfigException(
+                    "At least one of the keys `variables` and `derived_variables` must be included"
+                    " in the input dataset."
+                )
+            elif input_dataset.variables and input_dataset.derived_variables:
+                # Check so that there are no overlapping variables
+                if isinstance(input_dataset.variables, list):
+                    variable_vars = input_dataset.variables
+                elif isinstance(input_dataset.variables, dict):
+                    variable_vars = input_dataset.variables.keys()
+                else:
+                    raise TypeError(
+                        f"Expected an instance of list or dict, but got {type(input_dataset.variables)}."
+                    )
+                derived_variable_vars = input_dataset.derived_variables.keys()
+                common_vars = list(set(variable_vars) & set(derived_variable_vars))
+                if len(common_vars) > 0:
+                    raise InvalidConfigException(
+                        "Both `variables` and `derived_variables` include the following variables name(s):"
+                        f" '{', '.join(common_vars)}'. This is not allowed. Make sure that there"
+                        " are no overlapping variable names between `variables` and `derived_variables`,"
+                        f" either by renaming or removing '{', '.join(common_vars)}' from one of them."
+                    )
+        return config
+
 
 if __name__ == "__main__":
     import argparse
@@ -311,7 +386,7 @@ class _(JSONWizard.Meta):
     )
     args = argparser.parse_args()
 
-    config = Config.from_yaml_file(args.f)
+    config = Config.load_config(args.f)
     import rich
 
     rich.print(config)
diff --git a/mllam_data_prep/create_dataset.py b/mllam_data_prep/create_dataset.py
@@ -10,10 +10,12 @@
 
 from . import __version__
 from .config import Config, InvalidConfigException
-from .ops.loading import load_and_subset_dataset
+from .ops.derived_variables import derive_variables
+from .ops.loading import load_input_dataset
 from .ops.mapping import map_dims_and_variables
 from .ops.selection import select_by_kwargs
 from .ops.statistics import calc_stats
+from .ops.subsetting import subset_dataset
 
 # the `extra` field in the config that was added between v0.2.0 and v0.5.0 is
 # optional, so we can support both v0.2.0 and v0.5.0
@@ -30,11 +32,14 @@ def _check_dataset_attributes(ds, expected_attributes, dataset_name):
 
     # check for attributes having the wrong value
     incorrect_attributes = {
-        k: v for k, v in expected_attributes.items() if ds.attrs[k] != v
+        key: val for key, val in expected_attributes.items() if ds.attrs[key] != val
     }
     if len(incorrect_attributes) > 0:
         s_list = "\n".join(
-            [f"{k}: {v} != {ds.attrs[k]}" for k, v in incorrect_attributes.items()]
+            [
+                f"{key}: {val} != {ds.attrs[key]}"
+                for key, val in incorrect_attributes.items()
+            ]
         )
         raise ValueError(
             f"Dataset {dataset_name} has the following incorrect attributes: {s_list}"
@@ -120,23 +125,50 @@ def create_dataset(config: Config):
 
     output_config = config.output
     output_coord_ranges = output_config.coord_ranges
+    chunking_config = config.output.chunking
 
     dataarrays_by_target = defaultdict(list)
 
     for dataset_name, input_config in config.inputs.items():
         path = input_config.path
         variables = input_config.variables
+        derived_variables = input_config.derived_variables
         target_output_var = input_config.target_output_variable
-        expected_input_attributes = input_config.attributes or {}
+        expected_input_attributes = input_config.attributes
         expected_input_var_dims = input_config.dims
 
         output_dims = output_config.variables[target_output_var]
 
         logger.info(f"Loading dataset {dataset_name} from {path}")
         try:
-            ds = load_and_subset_dataset(fp=path, variables=variables)
+            ds_input = load_input_dataset(fp=path)
         except Exception as ex:
             raise Exception(f"Error loading dataset {dataset_name} from {path}") from ex
+
+        # Initialize the output dataset and add dimensions
+        ds = xr.Dataset()
+        ds.attrs.update(ds_input.attrs)
+        for dim in ds_input.dims:
+            ds = ds.assign_coords({dim: ds_input.coords[dim]})
+
+        if variables:
+            logger.info(f"Subsetting dataset {dataset_name}")
+            ds = subset_dataset(
+                ds_subset=ds,
+                ds_input=ds_input,
+                variables=variables,
+                chunking=chunking_config,
+            )
+
+        if derived_variables:
+            logger.info(f"Deriving variables from {dataset_name}")
+            ds = derive_variables(
+                ds=ds,
+                ds_input=ds_input,
+                derived_variables=derived_variables,
+                chunking=chunking_config,
+            )
+
         _check_dataset_attributes(
             ds=ds,
             expected_attributes=expected_input_attributes,
@@ -191,9 +223,8 @@ def create_dataset(config: Config):
 
     # default to making a single chunk for each dimension if chunksize is not specified
     # in the config
-    chunking_config = config.output.chunking or {}
     logger.info(f"Chunking dataset with {chunking_config}")
-    chunks = {d: chunking_config.get(d, int(ds[d].count())) for d in ds.dims}
+    chunks = {dim: chunking_config.get(dim, int(ds[dim].count())) for dim in ds.dims}
     ds = ds.chunk(chunks)
 
     splitting = config.output.splitting
@@ -255,7 +286,7 @@ def create_dataset_zarr(fp_config, fp_zarr: str = None):
         The path to the zarr file to write the dataset to. If not provided, the zarr file will be written
         to the same directory as the config file with the extension changed to '.zarr'.
     """
-    config = Config.from_yaml_file(file=fp_config)
+    config = Config.load_config(file=fp_config)
 
     ds = create_dataset(config=config)