Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to derive variables and add selected derived forcings #34

Open
wants to merge 70 commits into
base: main
Choose a base branch
from

Conversation

ealerskans
Copy link
Contributor

@ealerskans ealerskans commented Nov 6, 2024

Implements the ability to derive fields from the input datasets, as discussed in Deriving forcings #29.

At the moment, I have only added the possibility to derive the following forcings:

  • top-of-atmosphere radiation
  • hour of day (cyclically encoded)
  • day of year (cyclically encoded)
  • time of year (cyclically encoded)

But additional variables, such as boundary and land-sea masks, should be added. But I think that is for another PR.

- Update the configuration file so that we list the dependencies
and the method used to calculate the derived variable instead
of having a flag to say that the variables should be derived.
This approach is temporary and might be revised soon.
- Add a new class in mllam_data_prep/config.py for derived
variables to distinguish them from non-derived variables.
- Updates to mllam_data_prep/ops/loading.py to distinguish
between derived and non-derived variables.
- Move all functions related to forcing derivations to a new
and renamed function (mllam_data_prep/ops/forcings.py).
@leifdenby leifdenby mentioned this pull request Nov 18, 2024
13 tasks
@leifdenby leifdenby modified the milestones: v0.4.0, v0.6.0 Nov 18, 2024
@ealerskans ealerskans changed the title WIP: Add selected derived forcings Add ability to derive variables and add selected derived forcings Dec 10, 2024
@ealerskans
Copy link
Contributor Author

Consider to generalize this into a load_dataset function and re-used in the load_and_subset_dataset function - maybe not very important.

I think this is a good idea and have actually done that. So I have added a load_datasetfunction and now the dataset is loaded outside both load_and_subset_dataset (which I have renamed to subset_dataset) and derived_variables.

However, I have not used error handling a lot so I just added the Exception raises to load_dataset, subset_dataset and derive_variables. It looks kind of messy now I think... So I would really like some feedback/suggestion for how to do this instead :)


# Get module and function names
function_namespace_list = function_namespace.rsplit(".")
if len(function_namespace_list) > 1:
Copy link

@mafdmi mafdmi Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can actually avoid this whole if-statement if you do
module_name, _, function_name = function_namespace.rpartition(".")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt know of .rpartition() on strings before, cool!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using globals() which would require that the module is already imported I think it might be better to use the importlib package.

I think you would write:

module = importlib.import_module(module_name)
fn = getattr(module, function_name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if loop is because if the module name is empty, i.e. the functions are included in the same module as the rest of the workflow for deriving variables (which they are at the moment), then the use of the importlib package doesn't work. Also, I didn't think it made sense to import the mllam_data_prep.ops.derived_variables since this is the module the call is being made from so that is why I added the globals() part for these 2 cases. For all other cases (the else statement) I am using the importlib package approach already.

However, since we're anyway splitting up the functions from the rest (in one way or another) this approach will need to change as well and then I will just go with the importlib package solution, as I have for the else statement part here.

@mafdmi
Copy link

mafdmi commented Dec 13, 2024

Consider to generalize this into a load_dataset function and re-used in the load_and_subset_dataset function - maybe not very important.

I think this is a good idea and have actually done that. So I have added a load_datasetfunction and now the dataset is loaded outside both load_and_subset_dataset (which I have renamed to subset_dataset) and derived_variables.

However, I have not used error handling a lot so I just added the Exception raises to load_dataset, subset_dataset and derive_variables. It looks kind of messy now I think... So I would really like some feedback/suggestion for how to do this instead :)

Very nice with the load_dataset function. Yes, agree it becomes a bit messy with all the exceptions. I'll wait to see what @leifdenby says about the reason for having them.

Copy link
Member

@leifdenby leifdenby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a few more suggestions. Hope it makes sense. Looking good! 🚀

README.md Outdated Show resolved Hide resolved
mllam_data_prep/config.py Outdated Show resolved Hide resolved
mllam_data_prep/config.py Outdated Show resolved Hide resolved
mllam_data_prep/create_dataset.py Outdated Show resolved Hide resolved
Comment on lines 33 to 39
try:
ds = xr.open_zarr(fp)
except ValueError:
ds = xr.open_dataset(fp)

ds_subset = xr.Dataset()
ds_subset.attrs.update(ds.attrs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that would be better too. So maybe the "load" part of load_and_subset_dataset should be split out into its own function so that we have load_input_dataset which is called once in create_dataset in https://github.com/ealerskans/mllam-data-prep/blob/feature/derive_forcings/mllam_data_prep/create_dataset.py#L137. And then rename the current load_and_subset_dataset function just subset_dataset. When calling load_input_dataset, maybe call what is returned ds_input and that can then be passed into both subset_dataset and derive_variables

Dataset with chunking applied
"""
# Define the memory limit check
memory_limit_check = 1 * 1024**3 # 1 GB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I would put this constant into a global variable, something like CHUNK_MAX_SIZE_WARNING, but I would also put this "chunk size check into its own function somewhere outside of ops, maybe mllam_data_prep.chunking, or maybe mllam_data_prep.ops.chunking makes sense? I think this is something we might want to reuse :)


# Get module and function names
function_namespace_list = function_namespace.rsplit(".")
if len(function_namespace_list) > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using globals() which would require that the module is already imported I think it might be better to use the importlib package.

I think you would write:

module = importlib.import_module(module_name)
fn = getattr(module, function_name)

return function


def _check_attributes(field, field_attributes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this _check_for_required_attributes and pass the "expected_attributes" as an argument? I think the exception text could be clearer too :) Maybe you could guide the user to what they should to resolve the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this now. Hopefully the exception message is clearer now and tells the user what to do :)

Also, I updated the README with an example and some more text in its own "Derived Variables" section.

return day_of_year_cos, day_of_year_sin


def cyclic_encoding(data, data_max):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is getting quite long. Should we maybe split the implementation of the individual functions into submodules? I also think this should maybe sit in mllam_data_prep.ops instead of in the root of the package.

We could maybe do something like

mllam_data_prep
   .ops
       .derive_variable
           .__init__  <-- you could import the functions that actually carry out the work from the modules mentioned below:
           .dispatcher <-- this would contain the function to derive single variable
           .physical_field <-- this could contain the toa field for example
           .time_components <-- the time components could go here

Does this makes sense? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is getting long and it would be nice to split it up. The main reason for not adding it to mllam_data_prep.ops is that I don't know what "ops" stands for and what can/should be included here.

The idea and structure makes sense but the implementation is not very clear, probably because I am not very familiar with object oriented programming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is mostly in connection with the other suggestion that we should loop over the derived variables in a call external to derive_variable (in create_dataset() I am assuming then, or somewhere else?) and what should be called from there.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also been a bit puzzeled about what the "ops" stands for. I think I have interpreted is as "operations", but not sure if this is correct. Do we document the meaning somewhere? Otherwise, we could maybe consider to rename it to make it more self-explainable (for another PR then).

mllam_data_prep/ops/loading.py Outdated Show resolved Hide resolved
…doesn't have all dimensions. This way we don't need to broadcast these variables explicitly to all dimensions.
dataset

- Output dataset is created in 'create_dataset' instead of in the
  'subset_dataset' and 'derive_variables' functions.
- Rename dataset variables to make it clearer what they are and also
  make them more consistent between 'subset_dataset' and
  'derive_variables'.
- Add function for aligning the derived variables to the correct
  output dimensions.
- Move the 'derived_variables' from their own dataset in the example
  config file to the 'danra_surface' dataset, as it is now possible to
  combine them.
…hat either 'variables' or 'derived_variables' are included and that if both are included, they don't contain the same variable names
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants