-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding custom reduction methods #208
Comments
I agree, supporting custom reduction methods would be extremely useful; @spencerkclark and I were recently talking offline about exactly that. We could also consider adding a stacked average directly. I don't immediately understand the operation as you've described it though...maybe you'd be willing to share e.g. your Initial thoughts re: supporting custom methods:
calc_suite_specs=dict(
...,
output_time_regional_reductions=['av', 'std', ('my_reduc', my_custom_reduction)],
...) As the code currently stands, the function would have to accept an xr.DataArray as it's sole argument (and one in which the calculation and within-year averaging already performed).
class Reduction(object):
def __init__(self, name, function):
self.name = name
self.function = function So these would essentially become part of the user's object library and would be passed in as e.g. |
This is the function I use for selecting variables corresponding to specified percentiles of precipitation: def cond_pct_prcp(prcp_nc,prcp_c,var2cond):
"""Condition on precip percentile
Parameters
----------
prcp_nc: precipitation
"""
prcp_tot=total_precip(prcp_nc,prcp_c)
prcp_pct=pct_prcp(prcp_nc,prcp_c) #percentiles of precipitation
var2cond=var2cond.where(prcp_tot>prcp_pct)
var2cond=var2cond.where(prcp_tot<prcp_pct.shift(quantile=-1))
return var2cond If I'm understanding you right, 'my_reduc' is meant only as the label for the output file. 'my_reduc_obj' would be a way of tying 'my_reduc' and 'my_custom_reduction' together for the sake of consistency. Either seems fine, and it would be nice if the syntax were standardized between 'av'/'std' and the custom methods. The issue I faced is with the requirement that within-year averaging be performed. In my particular case, averaging within the year would be a time average, which defeats the purpose of stacking. In the workaround I mentioned, there are some changes to _make_full_mean_eddy_ts to also return the raw data for _apply_all_time_reductions to use in the stacked case. Barring a more general solution, I would suggest adding this extra case within the code. |
Yes, exactly.
Yes, the Calc pipeline inherently centers around first within-year averaging, and subsequently across-year time reductions, c.f. #207 . I agree that this is something we should relax: we should support arbitrary time and spatial reductions within each year (including no reduction, which is what you need), and then arbitrary time and spatial reductions across all years.
Your workaround could be a useful starting point, if you don't mind sharing it. We'll have to think about how best to go about it though, amidst a wider refactoring of Calc that is becoming more and more necessary. |
Well, more accurately, it's @spencerkclark's workaround, so I assume he wouldn't object to sharing it. It's in /nbhome/xrc/anaconda2/envs/py361/lib/python3.6/site-packages/aospy/calc.py, which you should both have read access to. |
I agree; I lean towards the object-based approach too for the same reason.
The cleanest intermediate fix would probably be to rewrite the pipeline slightly such that all time reductions started from the full time series. This would introduce some duplication of work (say if one wanted to compute the mean over annual means ( |
It is decided then :)
I think we should avoid this if possible. One idea would be for the (This is analogous to the eventual goal of retaining loaded data from disk across Calcs, i.e. #4, but I think is even simpler (no across-process communication to deal with). So it might be useful for that reason also.) |
FYI I'm planning on (finally) putting in a few cycles on this at some point this month. The initial sketch of the implementation I think still seems like a good starting point. |
Awesome! I'd be happy to devote some time as well to help push this forward. This would be a major improvement to aospy. |
OK, I have sketched out an initial design but could use some input on a few decisions. How to register reduction methodsC.f. comments above, I want to register all of the Reductions using their labels as keys, so that we can ensure that there are no duplicate labels. Otherwise it is possible that filenames would become ambiguous.
class Reduction(object):
instances = {}
def __init__(self, func, label):
self.instances.update({label: self})
....
...
time_avg = Reduction(lambda x: x.mean(YEAR_STR), 'av') Does that seem reasonable?
@classmethod
def register_reduction(cls, reduction):
label = reduction.label
if label in cls.instances:
raise KeyError("A Reduction object already exists with the desired "
"label '{0}', and duplicates are not allowed. "
"The existing object is: "
"{1}".format(label, cls.instances[label]))
else:
cls.instances[label] = reduction Call signature of reduction methodsWhat should the call signature of reduction functions be? The simplest would be for them to have a DataArray as their lone argument. But all but the most naive reductions also require the My initial thought is to continue doing this, with the time reductions occurring immediately afterward, so that the user can just count on their reduction function being applied to a DataArray whose TIME_WEIGHTS_STR coord has units days. They can convert back to higher frequency if needed, but otherwise won't have to worry about overflow or have to pass in the dt array separately. Are there time reductions that would require additional arguments? I personally don't have any, but @chuaxr's example above seems to require additional data. That becomes much more complicated; we'd have to think about how to support it. Nested reductionsThe whole point of this endeavor is to relax the current, rigid, first within-year average, then across-years-reduction approach to allow more general reductions. But what is the cleanest way of implementing multi-step reductions such as these? My initial idea for e.g. 'av': from .utils.times import yearly_average
def _in_year_avg(arr):
return yearly_average(arr, arr[TIME_WEIGHTS_STR])
time_avg = Reduction(lambda x: _in_year_avg(x).mean(YEAR_STR), 'av') EDIT: for a linear operation such as an average, this two-step procedure is unnecessary; the above could be replaced with I think that's enough for now...there are other decisions to be made w/r/t regional vs. point-by-point reductions that we can return to. |
The assumption seems to be that reduction operations will be performed on a time dimension of the data. At least for my use case, I often wish to simultaneously reduce both in time and space: perhaps by taking the 99th percentile of some value over the domain and last x days. It's not clear if this is something that is meant to be included, although it would be nice :) |
@chuaxr So long as you can express this logic in a function that accepts the DataArray containing the Calc's value computed at every timestep, this is fine. In other words, if you can express the reduction like this: def percentile99(arr):
arr_out = ... # insert desired logic here, which could include
# both spatial and temporal operations
return arr_out then, based on what I've proposed above, you should be able to represent it via e.g. |
Thanks for putting together these initial thoughts @spencerahill! I like your idea for registering reduction instances; it seems very clean. One question I have though is will the user need to import their Regarding call signatures - I agree that we should just convert the time weights to units of days as soon as we load them. The general call signature topic is going to be a challenge though. I'll describe perhaps my most tricky example below. Maybe we can try discussing things a bit to see if we can come up with a sane solution, but I'm fine if we elect to punt on some of the more complicated issues for now. I don't want to heap tons more work onto your plate and would be happy to contribute PR(s) of my own if we can figure out how we'd like to proceed.
This is a really tough one. Let me describe an example of a calculation that I would like to be able to do in aospy to illustrate the challenge. It may require re-thinking parts of the current pipeline (but could open the door to making things massively more flexible). ExampleA type of calculation that I find myself doing quite a bit these days is computing a regression pattern. This is a form of a time reduction (i.e. it removes the time dimension from the data). Without going too much into the details, this reduction requires a couple steps:
Let's say I wanted to compute regression patterns for the zonal and meridional winds onto a precipitation index (i.e. two separate calculations). At the minimum, this would require reduction operations to be able to take multiple arguments, because the reduction requires two variables (the index, and the variable we regress onto it). Thoughts
You might say this isn't so bad. Essentially we would need some way for the user to be able to specify how they would like the name of the reduction to appear in file names (that would be a function of the input arguments), and we would need some way for the user to specify the arguments to the reduction in the main script. Maybe the reduction class could look like this: class Reduction(object):
instances = {}
def __init__(self, func, label, label_append=None):
"""
Parameters
------------
func : function
Reduction operation
label : str
Label for reduction used in filenames
label_extend : function
Function to extend label based on input arguments (optional)
"""
self.instances.update({label: self})
self.func = func
self.label = label
self.label_append = label_append
def label(self, *args, **kwargs):
if self.label_append is None:
return self.label
else:
return '{}-{}'.format(self.label, self.label_append(*args, **kwargs)) We would call def regress(da, index):
"""
Parameters
------------
da : DataArray
Variable to regress onto index
index : DataArray
1D index in time
"""
# Compute the regression pattern ...
return result
def regress_label(da, index):
return 'onto-{}'.format(index.name)
Regression = Reduction(regress, 'regression', label_append=regress_label) And maybe in the main script we could allow for the specification of reductions as strings (which is the standard now) for reductions that don't require input arguments, or tuples, e.g. from custom_reductions import Regression
from variables import precipitation_index, ucomp, vcomp
calc_suite_specs = dict(
...,
output_reductions = ['av', ('regression', [precipitation_index], {})],
variables = [ucomp, vcomp],
...
) Here the implicit assumption is that if an Perhaps there could be some tweaks to that design, but that maybe doesn't seem too bad, right? There is a complication though... ComplicationWhat if the variable needed to compute the index requires slightly different This is perhaps somewhat provocative, but it would be cool if I could write something like: from custom_reductions import Regression
from variables import precipitation_index, ucomp, vcomp
calc_suite_specs = dict(
...,
output_reductions = ['av',
('regression', [precipitation_index.with_load_params(dtype_in_vert='sigma')], {})],
variables = [ucomp, vcomp],
...
) Here I'll leave things there for now, but let me know your thoughts and if you have any questions regarding my example. |
I think we're on the same page, here's a more precise example: Let's say I have a 10 by 10 (x,t) grid with the numbers 1 to 100. The 90th percentile of the whole grid is 90 (90th number from the start) But if we take the 90th percentile of x and t separately, we would get 89 (9,9 in grid). |
Thanks @spencerkclark for the detailed response. I understand your use case.
I think the answer is yes but I'll check.
But they only need import the ones they're using. I.e. if they're not being used, its irrelevant if they're registered or not. Right? Although I guess, c.f. further below in your comment, that if we want to retain the ability to pass them in as strings, this does become a little counter-intuitive, since then you'd have to import the object even though you don't ever explicitly use it (and thus would get warnings from your linter). Another idea: specify the module where your reductions are defined as an aospy config setting. E.g. in your main script, akin to xarray's set_options: And yet another option: drop string-based support; use the registry and require that reduction objects themselves be passed in, except perhaps for aospy's built-in reductions. Need to think more about all this.
Yes, I think that's a solid way of handling it. But your subsequent comment is an important one:
I have been frequently overestimating how difficult things will be to change in aospy as of late, but nevertheless I suspect this will be quite a challenge.
For this use case, would resolving #84 be sufficient? In that case, I am definitely convinced that having reductions accept other variables as args/kwargs is something we should do, and hopefully soon. But I can tell this is going to be a huge change even without that. Therefore, I'm suggesting we proceed with the single argument version of Reductions at present, and I'll open Issues to track generalizing this once we've implemented it. |
@chuaxr thanks for this, although I must confess I'm still not sure which page we're on! Just to be abundantly clear: are you saying that (1) yes, the reductions you want to perform only require the data already in the Calc, or (2) no, the reductions you want to perform would require additioinal data that would somehow need to be passed in. If the answer is yes, that's further motivation to implement the simple version first, and then probably do a 0.3 release, and then return to more involved cases like @spencerkclark's above. |
By all means; since I'm the one who needs this complexity, the burden shouldn't be on you to implement it! I agree let's hash out the single argument version first and then move forward with multiple arguments.
Exactly, this was the concern I had in mind. As I think about it more, anything that doesn't automatically load all reductions at once somewhat defeats the purpose of the registration process. It will guarantee that no one overwrites the built-in reductions, but will not guarantee that someone won't overwrite one of their own (admittedly I don't think this is likely). For that reason, maybe I lean toward your suggested Some more comments on multi-argument reductions/complications are below, but we can save those for future discussion.
I actually don't see any particular reason why it would be a major challenge 😄. We already do this for each variable specified in a computed Lines 391 to 394 in 9f9a1bf
The trickier case is if the variable needs different
We could potentially build in an implicit solution via that route. For example in Lines 578 to 580 in 9f9a1bf
I suppose for non-vertically-defined variables we could look in all possible dtype_in_vert locations?
|
@spencerahill yes, I mean (1) :) |
Thanks @chuaxr and @spencerkclark. All sounds good. We'll continue the discussion of multi-argument reductions in #286. Meanwhile I'll respond again here and/or open a PR once I've got more to share. |
For my current purposes, I am frequently interested in computing averages of a variable over some condition across the time and domain, treating each as an independent sample. For example, computing the average of vertical velocity over regions with precipitation exceeding a certain threshold. Since masking is required to isolate the samples of interest, stacking the data in (time, lat, lon) and then taking the mean (the 'stacked average') is not the same as taking the mean in time, then in (lat, lon).
One workaround is to simply define the stacked average as a user function. However, this leads to duplication of variables such as w_conditioned_on_precip and stacked_average_of_w_conditioned_on_precip. To avoid that, @spencerkclark helped me add a custom reduction method in my copy of aospy.
It would be ideal for a user to add custom reduction methods as desired. If that is unlikely to happen in the near future, though, it would be nice if this 'stacked average' were added in the next release.
The text was updated successfully, but these errors were encountered: