Add ability to derive variables and add selected derived forcings #34

ealerskans · 2024-11-06T12:20:52Z

Implements the ability to derive fields from the input datasets, as discussed in Deriving forcings #29.

At the moment, I have only added the possibility to derive the following forcings:

top-of-atmosphere radiation
hour of day (cyclically encoded)
day of year (cyclically encoded)
~~time of year (cyclically encoded)~~

But additional variables, such as boundary and land-sea masks, should be added. But I think that is for another PR.

- Update the configuration file so that we list the dependencies and the method used to calculate the derived variable instead of having a flag to say that the variables should be derived. This approach is temporary and might be revised soon. - Add a new class in mllam_data_prep/config.py for derived variables to distinguish them from non-derived variables. - Updates to mllam_data_prep/ops/loading.py to distinguish between derived and non-derived variables. - Move all functions related to forcing derivations to a new and renamed function (mllam_data_prep/ops/forcings.py).

…lated

… dataset

…eck the attributes of the derived variable data-array

…le individually

…lon'

…derived_variables'

ealerskans · 2024-12-13T12:22:49Z

Consider to generalize this into a load_dataset function and re-used in the load_and_subset_dataset function - maybe not very important.

I think this is a good idea and have actually done that. So I have added a load_datasetfunction and now the dataset is loaded outside both load_and_subset_dataset (which I have renamed to subset_dataset) and derived_variables.

However, I have not used error handling a lot so I just added the Exception raises to load_dataset, subset_dataset and derive_variables. It looks kind of messy now I think... So I would really like some feedback/suggestion for how to do this instead :)

mafdmi · 2024-12-13T13:30:23Z

mllam_data_prep/derived_variables.py

+
+    # Get module and function names
+    function_namespace_list = function_namespace.rsplit(".")
+    if len(function_namespace_list) > 1:


You can actually avoid this whole if-statement if you do
module_name, _, function_name = function_namespace.rpartition(".")

I didnt know of .rpartition() on strings before, cool!

Rather than using globals() which would require that the module is already imported I think it might be better to use the importlib package.

I think you would write:

module = importlib.import_module(module_name) fn = getattr(module, function_name)

The if loop is because if the module name is empty, i.e. the functions are included in the same module as the rest of the workflow for deriving variables (which they are at the moment), then the use of the importlib package doesn't work. Also, I didn't think it made sense to import the mllam_data_prep.ops.derived_variables since this is the module the call is being made from so that is why I added the globals() part for these 2 cases. For all other cases (the else statement) I am using the importlib package approach already.

However, since we're anyway splitting up the functions from the rest (in one way or another) this approach will need to change as well and then I will just go with the importlib package solution, as I have for the else statement part here.

mafdmi · 2024-12-13T13:34:29Z

Consider to generalize this into a load_dataset function and re-used in the load_and_subset_dataset function - maybe not very important.

I think this is a good idea and have actually done that. So I have added a load_datasetfunction and now the dataset is loaded outside both load_and_subset_dataset (which I have renamed to subset_dataset) and derived_variables.

However, I have not used error handling a lot so I just added the Exception raises to load_dataset, subset_dataset and derive_variables. It looks kind of messy now I think... So I would really like some feedback/suggestion for how to do this instead :)

Very nice with the load_dataset function. Yes, agree it becomes a bit messy with all the exceptions. I'll wait to see what @leifdenby says about the reason for having them.

leifdenby

I've made a few more suggestions. Hope it makes sense. Looking good! 🚀

README.md

mllam_data_prep/config.py

mllam_data_prep/create_dataset.py

leifdenby · 2024-12-12T21:17:54Z

mllam_data_prep/derived_variables.py

+    try:
+        ds = xr.open_zarr(fp)
+    except ValueError:
+        ds = xr.open_dataset(fp)
+
+    ds_subset = xr.Dataset()
+    ds_subset.attrs.update(ds.attrs)


Yes, I think that would be better too. So maybe the "load" part of load_and_subset_dataset should be split out into its own function so that we have load_input_dataset which is called once in create_dataset in https://github.com/ealerskans/mllam-data-prep/blob/feature/derive_forcings/mllam_data_prep/create_dataset.py#L137. And then rename the current load_and_subset_dataset function just subset_dataset. When calling load_input_dataset, maybe call what is returned ds_input and that can then be passed into both subset_dataset and derive_variables

leifdenby · 2024-12-13T14:06:50Z

mllam_data_prep/derived_variables.py

+        Dataset with chunking applied
+    """
+    # Define the memory limit check
+    memory_limit_check = 1 * 1024**3  # 1 GB


Yes. I would put this constant into a global variable, something like CHUNK_MAX_SIZE_WARNING, but I would also put this "chunk size check into its own function somewhere outside of ops, maybe mllam_data_prep.chunking, or maybe mllam_data_prep.ops.chunking makes sense? I think this is something we might want to reuse :)

leifdenby · 2024-12-13T14:11:02Z

mllam_data_prep/derived_variables.py

+
+    # Get module and function names
+    function_namespace_list = function_namespace.rsplit(".")
+    if len(function_namespace_list) > 1:


Rather than using globals() which would require that the module is already imported I think it might be better to use the importlib package.

I think you would write:

module = importlib.import_module(module_name) fn = getattr(module, function_name)

leifdenby · 2024-12-13T14:12:22Z

mllam_data_prep/derived_variables.py

+    return function
+
+
+def _check_attributes(field, field_attributes):


Maybe call this _check_for_required_attributes and pass the "expected_attributes" as an argument? I think the exception text could be clearer too :) Maybe you could guide the user to what they should to resolve the issue?

I have updated this now. Hopefully the exception message is clearer now and tells the user what to do :)

Also, I updated the README with an example and some more text in its own "Derived Variables" section.

leifdenby · 2024-12-13T14:17:45Z

mllam_data_prep/derived_variables.py

+    return day_of_year_cos, day_of_year_sin
+
+
+def cyclic_encoding(data, data_max):


This file is getting quite long. Should we maybe split the implementation of the individual functions into submodules? I also think this should maybe sit in mllam_data_prep.ops instead of in the root of the package.

We could maybe do something like

mllam_data_prep .ops .derive_variable .__init__ <-- you could import the functions that actually carry out the work from the modules mentioned below: .dispatcher <-- this would contain the function to derive single variable .physical_field <-- this could contain the toa field for example .time_components <-- the time components could go here

Does this makes sense? :)

I agree that it is getting long and it would be nice to split it up. The main reason for not adding it to mllam_data_prep.ops is that I don't know what "ops" stands for and what can/should be included here.

The idea and structure makes sense but the implementation is not very clear, probably because I am not very familiar with object oriented programming.

I guess it is mostly in connection with the other suggestion that we should loop over the derived variables in a call external to derive_variable (in create_dataset() I am assuming then, or somewhere else?) and what should be called from there.

I have also been a bit puzzeled about what the "ops" stands for. I think I have interpreted is as "operations", but not sure if this is correct. Do we document the meaning somewhere? Otherwise, we could maybe consider to rename it to make it more self-explainable (for another PR then).

mllam_data_prep/ops/loading.py

…Variable.attrs'

…ed attributes

…doesn't have all dimensions. This way we don't need to broadcast these variables explicitly to all dimensions.

dataset - Output dataset is created in 'create_dataset' instead of in the 'subset_dataset' and 'derive_variables' functions. - Rename dataset variables to make it clearer what they are and also make them more consistent between 'subset_dataset' and 'derive_variables'. - Add function for aligning the derived variables to the correct output dimensions. - Move the 'derived_variables' from their own dataset in the example config file to the 'danra_surface' dataset, as it is now possible to combine them.

…hat either 'variables' or 'derived_variables' are included and that if both are included, they don't contain the same variable names

ealerskans added 12 commits October 28, 2024 14:34

First attempt at adding derived forcings

981d676

Add derivation of cyclic encoded hour of day and day of year

f37161c

Add derivation of cyclic encoded time of year

71afd3a

Update and add docstrings

abb626b

Remove time_of_year

8b1f18e

Provide the full namespace of the function

7854013

Rename the module with derived variables

7fa90bf

Rename the function used for deriving variables

48c9e3e

Redefine the config file for derived variables and how they are calcu…

8de9404

…lated

Remove derived variables from 'load_and_subset_dataset'

ffc030c

Add try/except for derived variables when loading the dataset

692cdd3

leifdenby mentioned this pull request Nov 18, 2024

Roadmap #5

Open

13 tasks

leifdenby modified the milestones: v0.4.0, v0.6.0 Nov 18, 2024

ealerskans added 13 commits December 5, 2024 08:54

Chunk the input data with the defined output chunks

c0cd875

Update toa_radiation function name

55224f3

Correct kwargs usage, add back dropped coordinates and return correct…

678ea52

… dataset

Prepare for hour_of_day and day_of_year

9d2db07

Add optional 'attributes' to the config of 'derived_variables' and ch…

26455bc

…eck the attributes of the derived variable data-array

Add dummy function for getting lat,lon (preparation for mllam#33)

fbb6065

Add function for chunking data and checking the chunk size

3a12f48

Add back coordinates on the subset instead of for each derived variab…

3ace219

…le individually

Add 'hour_of_day' to example config

a6b61b0

Merge branch 'main' into feature/derive_forcings

1814297

Rename derived variables dataset section in the example config

9dcace6

Remove f-string from 'name_format'

aba6757

Update README

143edb6

ealerskans changed the title ~~WIP: Add selected derived forcings~~ Add ability to derive variables and add selected derived forcings Dec 10, 2024

ealerskans added 5 commits December 13, 2024 09:24

Update '_get_derived_variable_function'

2856c6b

Simplify checks of the derived fields

98673ee

Issue warning saying that we assume coordinates are named 'lat' and '…

8940e82

…lon'

Update README to make it clear that 'attributes' is associated with '…

e12e328

…derived_variables'

Indicate that 'variables' and 'derived_variables' are mutually exclusive

ecdea30

Update docstring of 'InputDataset' class

e3c0f22

mafdmi reviewed Dec 13, 2024

View reviewed changes

ealerskans added 2 commits December 13, 2024 14:10

Correct types in '_check_attributes' docstring

e907a6d

Use 'rpartition' to get 'module_name' and 'function_name'

bb9be13

leifdenby reviewed Dec 13, 2024

View reviewed changes

ealerskans added 18 commits December 13, 2024 14:23

Add some initial tests for 'derived_variables'

49de0b3

Update docstrings and rename 'DerivedVariable.attributes' to 'Derived…

b268f01

…Variable.attrs'

Do not add 'attributes' to docstring

dbd5bfd

Remove unnecessary exception handling

474a83d

Move 'subset_dataset' to 'ops.subsetting'

1da66e2

Move 'derived_variables' to 'ops'

dc7dc5e

Move chunk size check to 'chunking' module

c9e96af

Add module docstring

47b8411

Update tests

5ae772f

Add global REQUIRED_FIELD_ATTRIBUTES var and updated check for requir…

2c0bdf8

…ed attributes

Update long name for toa_radiation

f1ce6d1

Update README

58d8af6

Return dropped coordinates to the data-arrays instead

f87b954

Adds dims to the dataset to make it work with derived variables that …

80cf058

…doesn't have all dimensions. This way we don't need to broadcast these variables explicitly to all dimensions.

Update README

f61a3b6

Add 'load_config' function, which wraps 'from_yaml_file' and checks t…

554f869

…hat either 'variables' or 'derived_variables' are included and that if both are included, they don't contain the same variable names

Update README

085aae3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to derive variables and add selected derived forcings #34

Add ability to derive variables and add selected derived forcings #34

ealerskans commented Nov 6, 2024 •

edited

Loading

ealerskans commented Dec 13, 2024

mafdmi Dec 13, 2024 •

edited

Loading

leifdenby Dec 13, 2024

leifdenby Dec 13, 2024

ealerskans Dec 18, 2024

mafdmi commented Dec 13, 2024

leifdenby left a comment

leifdenby Dec 12, 2024

leifdenby Dec 13, 2024

leifdenby Dec 13, 2024

leifdenby Dec 13, 2024

ealerskans Dec 18, 2024

leifdenby Dec 13, 2024

ealerskans Dec 17, 2024

ealerskans Dec 17, 2024

mafdmi Dec 18, 2024

		return function


		def _check_attributes(field, field_attributes):

		return day_of_year_cos, day_of_year_sin


		def cyclic_encoding(data, data_max):

Add ability to derive variables and add selected derived forcings #34

Are you sure you want to change the base?

Add ability to derive variables and add selected derived forcings #34

Conversation

ealerskans commented Nov 6, 2024 • edited Loading

ealerskans commented Dec 13, 2024

mafdmi Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mafdmi commented Dec 13, 2024

leifdenby left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ealerskans commented Nov 6, 2024 •

edited

Loading

mafdmi Dec 13, 2024 •

edited

Loading