Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for writing more composite statistics (e.g. grid-point based mean of time-step differences) #42

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

mafdmi
Copy link

@mafdmi mafdmi commented Nov 21, 2024

Reworked the way we define statistical methods and the way we invoke them from the *.yaml file.

Added new statistical methods to calculate:

  • mean and std across time
  • mean and std across time and grid_index of difference in time (diff step is 1)
  • mean and std across time of difference in time (diff step is 1)
  • mean and std across grid_index and diurnal cycles of difference in time (diff step is 1)
  • mean and std across diurnal cycles of difference in time (diff step is 1)

Fixed typos in README. Removed duplicate and obsolete attributes section in docstrings.

The new way to specify compute statistics in the *.yaml file looks like the following:

compute_statistics:
  - mean
  - mean_per_gridpoint
  - std
  - std_per_gridpoint
  ...

i.e. you simply specify the statistical methods you want to calculate. The statistical methods are documented in the docstrings of the respective functions.

Every statistical function is defined in the statistics.py file. Each function takes in a xarray dataset, calls a general compute_statistic function, and returns a resulting dataset with the statistical variables.

Depending on the parameters supplied to the compute_statistic function, different operations are applied in a specific order:

  1. Difference along the diff_dim dimension
  2. Grouping by the groupby_dim index
  3. The xarray operation stats_op

Take the new diurnal_diff_mean_per_gridpoint function as an example:

def diurnal_diff_mean_per_gridpoint(ds: xr.Dataset):
    """Compute the diurnal mean across time of the difference in time for all
    variables.

    The data is grouped by time.hour to make the operator be applied accross
    diurnal cycles.
    The difference in time is computed over 1 time step.

    Args:
        ds (xr.Dataset): Input dataset

    Returns:
        xr.Dataset: Dataset with the computed statistical variables
    """
    return compute_pipeline_statistic(
        ds,
        groupby="time.hour",
        stats_op="mean",
        stats_dims="time",
        diff_dim="time",
        n_diff_steps=1,
    )

The function calls compute_pipeline_statistic, which first computes the difference in time (with step size 1), then groups the data by "time.hour", and finally computes the mean over the "time" dimension.

To be able to understand how a statistical variable has been computed, an attribute "cell_methods" is added to every statistical variable. The cell_methods attribute is a string, that lists the operations applied to the specific dimensions of the data in the order they were applied (semi-compliant to cfconventions: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#statistics-more-than-one-axis - [I'm not sure if e.g. the "groupby" and the "(interval: 3 hours)" parts are compliant]).
E.g. the diurnal_diff_std_per_gridpoint function adds the 'cell_methods' = 'time: diff (interval: 3 hours) time.hour: groupby time: std' attribute, which translates to

  1. Compute the difference in time with a step size of 1 (which corresponds to 3 hours in this dataset)
  2. Group the data by the hour of the day
  3. Compute the standard deviation over the time dimension.

Martin Frølund and others added 6 commits November 20, 2024 08:39
Each statistical method now have its own class with a calc_stats method. In the config yaml file, one specifies the name of the classes to use when computing statistics of the dataset. If the class is not defined in the statistics.py

Added classes for calculating Diurnal diff mean, and Diurnal diff std, and Diff time mean (where only the time is averaged - not the grid_index)
Reworked the compute statistics config and operators to be able to compute different versions of the same statistical operator, e.g. applying the mean over different dimensions.
@mafdmi
Copy link
Author

mafdmi commented Nov 21, 2024

Maybe var_name is not the best, since it is actually more a string that is appended to the full variable name to which the statistics is saved to in the dataset. What do you think?

@mafdmi mafdmi marked this pull request as ready for review November 21, 2024 15:03
@mafdmi mafdmi marked this pull request as draft November 21, 2024 15:03
@SimonKamuk
Copy link

Great functionality to add! I agree that var_name is not the most descriptive, maybe something more along the lines of statistic_name? Or it could be inferred from the supplied arguments? In any case, there still seems to be some tests failing.

I guess this will be a breaking change, since existing yaml files well need to be modified. What about the datasets outputted from mllam-data-prep, will they be changed? As long as var_name (or whatever you end up calling it) matches, then I guess the output stays the same?

@observingClouds
Copy link
Contributor

One might also checkout https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#statistics-more-than-one-axis and add the additional cell-method attribute so one knows over which dimensions the statistics have been taken.

@observingClouds
Copy link
Contributor

var_name is not the best, since it is actually more a string that is appended to the full variable name

The question for me would be how we can read these statistics values in neural-lam. Appending the variable names by a string can be tricky to disentangle again. A delimiter like . might help. Then I would suggest instead of var_name, varname_suffix or statistics_suffix

If the output dimensions of the statistics are the same across datasets, I could also imagine, statistic variables that have the actual variable names as dimension:

float mean(time, varname):
    ....
float RMSE(time, varname):
    ....

@mafdmi
Copy link
Author

mafdmi commented Dec 3, 2024

One might also checkout https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#statistics-more-than-one-axis and add the additional cell-method attribute so one knows over which dimensions the statistics have been taken.

Good suggestion! Would be good to use standards for this.

@observingClouds
Copy link
Contributor

@mafdmi here are even more useful resources on how to create the variables names: https://cfconventions.org/Data/cf-standard-names/docs/guidelines.html#id2798144

@leifdenby leifdenby changed the title Feature/add more statistics Add support for writing more composite statistics (e.g. grid-point based mean of time-step differences) Dec 10, 2024
- Made statical operators into functions instead of classes.
- Added extensive docstrings of the individual statistical operators.
- Added "cell_methods" attribute (semi-compliant with cfconventions) to each statistical variable, to make it clear how the variable was calculated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants