-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for writing more composite statistics (e.g. grid-point based mean of time-step differences) #42
base: main
Are you sure you want to change the base?
Conversation
…ion in docstrings
Each statistical method now have its own class with a calc_stats method. In the config yaml file, one specifies the name of the classes to use when computing statistics of the dataset. If the class is not defined in the statistics.py Added classes for calculating Diurnal diff mean, and Diurnal diff std, and Diff time mean (where only the time is averaged - not the grid_index)
Reworked the compute statistics config and operators to be able to compute different versions of the same statistical operator, e.g. applying the mean over different dimensions.
Maybe |
Great functionality to add! I agree that var_name is not the most descriptive, maybe something more along the lines of statistic_name? Or it could be inferred from the supplied arguments? In any case, there still seems to be some tests failing. I guess this will be a breaking change, since existing yaml files well need to be modified. What about the datasets outputted from mllam-data-prep, will they be changed? As long as var_name (or whatever you end up calling it) matches, then I guess the output stays the same? |
One might also checkout https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#statistics-more-than-one-axis and add the additional |
The question for me would be how we can read these statistics values in neural-lam. Appending the variable names by a string can be tricky to disentangle again. A delimiter like If the output dimensions of the statistics are the same across datasets, I could also imagine, statistic variables that have the actual variable names as dimension:
|
Good suggestion! Would be good to use standards for this. |
@mafdmi here are even more useful resources on how to create the variables names: https://cfconventions.org/Data/cf-standard-names/docs/guidelines.html#id2798144 |
- Made statical operators into functions instead of classes. - Added extensive docstrings of the individual statistical operators. - Added "cell_methods" attribute (semi-compliant with cfconventions) to each statistical variable, to make it clear how the variable was calculated.
Reworked the way we define statistical methods and the way we invoke them from the *.yaml file.
Added new statistical methods to calculate:
Fixed typos in README. Removed duplicate and obsolete attributes section in docstrings.
The new way to specify compute statistics in the *.yaml file looks like the following:
i.e. you simply specify the statistical methods you want to calculate. The statistical methods are documented in the docstrings of the respective functions.
Every statistical function is defined in the statistics.py file. Each function takes in a xarray dataset, calls a general
compute_statistic
function, and returns a resulting dataset with the statistical variables.Depending on the parameters supplied to the
compute_statistic
function, different operations are applied in a specific order:diff_dim
dimensiongroupby_dim
indexstats_op
Take the new
diurnal_diff_mean_per_gridpoint
function as an example:The function calls compute_pipeline_statistic, which first computes the difference in time (with step size 1), then groups the data by "time.hour", and finally computes the mean over the "time" dimension.
To be able to understand how a statistical variable has been computed, an attribute "cell_methods" is added to every statistical variable. The cell_methods attribute is a string, that lists the operations applied to the specific dimensions of the data in the order they were applied (semi-compliant to cfconventions: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#statistics-more-than-one-axis - [I'm not sure if e.g. the "groupby" and the "(interval: 3 hours)" parts are compliant]).
E.g. the
diurnal_diff_std_per_gridpoint
function adds the'cell_methods' = 'time: diff (interval: 3 hours) time.hour: groupby time: std'
attribute, which translates to