Add support for writing more composite statistics (e.g. grid-point based mean of time-step differences) #42

mafdmi · 2024-11-21T15:01:55Z

Reworked the way we define statistical methods and the way we invoke them from the *.yaml file.

Added new statistical methods to calculate:

mean and std across time
mean and std across time and grid_index of difference in time (diff step is 1)
mean and std across time of difference in time (diff step is 1)
mean and std across grid_index and diurnal cycles of difference in time (diff step is 1)
mean and std across diurnal cycles of difference in time (diff step is 1)

Fixed typos in README. Removed duplicate and obsolete attributes section in docstrings.

The new way to specify compute statistics in the *.yaml file looks like the following:

compute_statistics:
  - mean
  - mean_per_gridpoint
  - std
  - std_per_gridpoint
  ...

i.e. you simply specify the statistical methods you want to calculate. The statistical methods are documented in the docstrings of the respective functions.

Every statistical function is defined in the statistics.py file. Each function takes in a xarray dataset, calls a general compute_statistic function, and returns a resulting dataset with the statistical variables.

Depending on the parameters supplied to the compute_statistic function, different operations are applied in a specific order:

Difference along the diff_dim dimension
Grouping by the groupby_dim index
The xarray operation stats_op

Take the new diurnal_diff_mean_per_gridpoint function as an example:

def diurnal_diff_mean_per_gridpoint(ds: xr.Dataset):
    """Compute the diurnal mean across time of the difference in time for all
    variables.

    The data is grouped by time.hour to make the operator be applied accross
    diurnal cycles.
    The difference in time is computed over 1 time step.

    Args:
        ds (xr.Dataset): Input dataset

    Returns:
        xr.Dataset: Dataset with the computed statistical variables
    """
    return compute_pipeline_statistic(
        ds,
        groupby="time.hour",
        stats_op="mean",
        stats_dims="time",
        diff_dim="time",
        n_diff_steps=1,
    )

The function calls compute_pipeline_statistic, which first computes the difference in time (with step size 1), then groups the data by "time.hour", and finally computes the mean over the "time" dimension.

To be able to understand how a statistical variable has been computed, an attribute "cell_methods" is added to every statistical variable. The cell_methods attribute is a string, that lists the operations applied to the specific dimensions of the data in the order they were applied (semi-compliant to cfconventions: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#statistics-more-than-one-axis - [I'm not sure if e.g. the "groupby" and the "(interval: 3 hours)" parts are compliant]).
E.g. the diurnal_diff_std_per_gridpoint function adds the 'cell_methods' = 'time: diff (interval: 3 hours) time.hour: groupby time: std' attribute, which translates to

Compute the difference in time with a step size of 1 (which corresponds to 3 hours in this dataset)
Group the data by the hour of the day
Compute the standard deviation over the time dimension.

…ion in docstrings

Each statistical method now have its own class with a calc_stats method. In the config yaml file, one specifies the name of the classes to use when computing statistics of the dataset. If the class is not defined in the statistics.py Added classes for calculating Diurnal diff mean, and Diurnal diff std, and Diff time mean (where only the time is averaged - not the grid_index)

Reworked the compute statistics config and operators to be able to compute different versions of the same statistical operator, e.g. applying the mean over different dimensions.

mafdmi · 2024-11-21T15:02:41Z

Maybe var_name is not the best, since it is actually more a string that is appended to the full variable name to which the statistics is saved to in the dataset. What do you think?

SimonKamuk · 2024-11-25T13:11:39Z

Great functionality to add! I agree that var_name is not the most descriptive, maybe something more along the lines of statistic_name? Or it could be inferred from the supplied arguments? In any case, there still seems to be some tests failing.

I guess this will be a breaking change, since existing yaml files well need to be modified. What about the datasets outputted from mllam-data-prep, will they be changed? As long as var_name (or whatever you end up calling it) matches, then I guess the output stays the same?

observingClouds · 2024-11-26T09:29:50Z

One might also checkout https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#statistics-more-than-one-axis and add the additional cell-method attribute so one knows over which dimensions the statistics have been taken.

observingClouds · 2024-11-26T09:41:23Z

var_name is not the best, since it is actually more a string that is appended to the full variable name

The question for me would be how we can read these statistics values in neural-lam. Appending the variable names by a string can be tricky to disentangle again. A delimiter like . might help. Then I would suggest instead of var_name, varname_suffix or statistics_suffix

If the output dimensions of the statistics are the same across datasets, I could also imagine, statistic variables that have the actual variable names as dimension:

float mean(time, varname):
    ....
float RMSE(time, varname):
    ....

mafdmi · 2024-12-03T11:00:46Z

One might also checkout https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#statistics-more-than-one-axis and add the additional cell-method attribute so one knows over which dimensions the statistics have been taken.

Good suggestion! Would be good to use standards for this.

observingClouds · 2024-12-06T11:24:34Z

@mafdmi here are even more useful resources on how to create the variables names: https://cfconventions.org/Data/cf-standard-names/docs/guidelines.html#id2798144

- Made statical operators into functions instead of classes. - Added extensive docstrings of the individual statistical operators. - Added "cell_methods" attribute (semi-compliant with cfconventions) to each statistical variable, to make it clear how the variable was calculated.

…atistics

Martin Frølund and others added 6 commits November 20, 2024 08:39

Fixed typos in README. Removed duplicate and obsolete attributes sect…

8979225

…ion in docstrings

Formatting

e257056

Allow for computing multiple versions of the same statistical operator

36cb5d9

Reworked the compute statistics config and operators to be able to compute different versions of the same statistical operator, e.g. applying the mean over different dimensions.

Ignore .vscode dir

fa128ee

Changed name to var_name

7abcfbb

mafdmi marked this pull request as ready for review November 21, 2024 15:03

mafdmi marked this pull request as draft November 21, 2024 15:03

leifdenby assigned mafdmi Dec 10, 2024

leifdenby changed the title ~~Feature/add more statistics~~ Add support for writing more composite statistics (e.g. grid-point based mean of time-step differences) Dec 10, 2024

mafdmi added 3 commits December 10, 2024 14:07

Fixed tests

668adfb

Merge remote-tracking branch 'upstream/main' into feature/add-more-st…

68ae3be

…atistics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for writing more composite statistics (e.g. grid-point based mean of time-step differences) #42

Add support for writing more composite statistics (e.g. grid-point based mean of time-step differences) #42

mafdmi commented Nov 21, 2024 •

edited

Loading

mafdmi commented Nov 21, 2024

SimonKamuk commented Nov 25, 2024

observingClouds commented Nov 26, 2024

observingClouds commented Nov 26, 2024

mafdmi commented Dec 3, 2024

observingClouds commented Dec 6, 2024

Add support for writing more composite statistics (e.g. grid-point based mean of time-step differences) #42

Are you sure you want to change the base?

Add support for writing more composite statistics (e.g. grid-point based mean of time-step differences) #42

Conversation

mafdmi commented Nov 21, 2024 • edited Loading

mafdmi commented Nov 21, 2024

SimonKamuk commented Nov 25, 2024

observingClouds commented Nov 26, 2024

observingClouds commented Nov 26, 2024

mafdmi commented Dec 3, 2024

observingClouds commented Dec 6, 2024

mafdmi commented Nov 21, 2024 •

edited

Loading