Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for monthly-resolution predictions #172

Closed
wants to merge 5 commits into from
Closed

Conversation

bradyrx
Copy link
Collaborator

@bradyrx bradyrx commented Jun 7, 2019

Description

This PR adds support for prediction ensembles at monthly resolution.

Fixes https://github.com/bradyrx/climpred/issues/44

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

# Load in global average SST for monthly DPLE 
hind = xr.open_dataset('/glade/u/home/rbrady/scratch/monthly_SST_global_avg.nc')
hind = hind.rename({'anom': 'SST'})
print(hind)
>>> <xarray.Dataset>
>>> Dimensions:  (init: 62, lead: 122)
>>> Coordinates:
>>>   * init     (init) int32 1955 1956 1957 1958 1959 ... 2012 2013 2014 2015 2016
>>>   * lead     (lead) int32 1 2 3 4 5 6 7 8 9 ... 115 116 117 118 119 120 121 122
>>>     member   int32 ...
>>> Data variables:
>>>     SST      (init, lead) float32 ...

reference = xr.open_dataset('/glade/u/home/rbrady/scratch/FOSI_global_SST.nc')
# Convert FOSI to anomalies
reference = reference - reference.sel(time=slice('1964', '2014')).mean('time')
reference = reference.groupby('time.month') - reference.groupby('time.month').mean('time')

>>>  <xarray.Dataset>
>>>  Dimensions:  (time: 840)
>>>  Coordinates:
>>>      z_t      float32 500.0
>>>    * time     (time) datetime64[ns] 1948-01-01 1948-02-01 ... 2017-12-01
>>>      month    (time) int64 1 2 3 4 5 6 7 8 9 10 11 ... 2 3 4 5 6 7 8 9 10 11 12
>>>  Data variables:
>>>      SST      (time) float32 0.099155456 0.042655587 ... 0.29147333 0.17652681

# Add actual initialization dates to DPLE. November initializations.
hind['init'] = np.arange('1954-11', '2016-11', dtype='datetime64[M]')[::12]
reference['time'] = np.arange('1948-01', '2018-01', dtype='datetime64[M]')

# Subtract one from hindcast lead. The first timestep is actually the November comparison, 
# since initialization was November 1 and we're looking at the full month of November.
hind['lead'] = hind['lead'] - 1
cp.prediction.compute_hindcast(hind, reference, resolution='M').SST.plot()

Screen Shot 2019-06-07 at 1 51 40 PM

Checklist (while developing)

  • I have added docstrings to all new functions.
  • I have commented my code, particularly in hard-to-understand areas
  • Tests added for pytest, if necessary.

Pre-Merge Checklist (final steps)

  • I have rebased onto master or develop (wherever I am merging) and dealt with any conflicts.
  • I have squashed commits to a reasonable amount, and force-pushed the squashed commits.
  • All notebooks run with this branch; tested via treon.
  • pytest runs without breaking.

Todo

  • Add testing for monthly resolution
  • Add temporal resolution to classes
  • Check functions outside of compute_hindcast
  • Add datetime module?

@bradyrx
Copy link
Collaborator Author

bradyrx commented Jun 7, 2019

The key here is that xarray only works with datetime64 for time coordinates and will only do this in [ns] (nanoseconds).

So, with datetime objects, this line:
imax = min(forecast.time.max(), reference.time.max() - nlags)

Will try to subtract 1 month but instead subtracts 1 nanosecond. So by forcing the following:
imax = min(forecast.time.max(), reference.time.max().values.astype('datetime64[M]') - nlags)

We actually subtract off the correct amount of time. [M] for month, [Y] for year.

This line then has to be changed to accomodate for adding the appropriate time as well:
a['time'] = [t + i for t in a.time.values.astype('datetime64[M]')]

The "real time" approach is great for this. However, I suggest then that we require all objects to have datetime coordinates and do not accept integers for instance. This ensures that we can use the 'real time' approach everywhere and can use datetime64 operations on the output. This means some more pre-processing will have to go in for the user, but it's good for them to have the actual initialization dates in.

We could add a datetime module to do some of what I did in the test code in the main PR text.

@bradyrx bradyrx closed this Jun 7, 2019
@bradyrx bradyrx reopened this Jun 7, 2019
@bradyrx
Copy link
Collaborator Author

bradyrx commented Jun 7, 2019

Note -- I plan to just implement monthly resolution with this PR. I think daily would actually be easier than seasonal here. But I want to get it working for monthly, then we can adapt for daily/seasonal?

@bradyrx
Copy link
Collaborator Author

bradyrx commented Jun 7, 2019

@aaronspring, can you test compute_perfect_model with monthly output? I imagine that this won't be a problem since it's just comparing back to the model/control run itself. And there's nowhere in that function we're currently doing slicing like in the compute_hindcast function.

@bradyrx
Copy link
Collaborator Author

bradyrx commented Jun 7, 2019

Another issue is that DPLE monthly has a first lead time of zero, truly. Since initialization occurs November 1st, the first lead is the month of November. I.e., a lead of zero in the eyes of this package. This works fine for compute_hindcast, but is an issue for compute_persistence due to this line:

ctrl_inits = reference.isel(time=slice(0, -lag))['time'].values
which returns an empty object with a lag of zero, of course.

@aaronspring
Copy link
Collaborator

aaronspring commented Jun 8, 2019

Regarding the initial comment. When computing monthly skill, what kind of inputs do we require? I naively thought of those:

Hindcast: lead int, init int or datetime
Reference: time same as init?

While it might be more tricky to implement, i think allowing ints is rather easy for the decidable prediction user to provide. When the user should provide datetime s/he has to think about whether to put the November init date there and so on, even if it’s only the decidable prediction thing.

@bradyrx to do you think it’s useful to have ints and datetime? Or rather require one in the calculation but before a conversion if ints are involved?

@aaronspring
Copy link
Collaborator

Can you please show the annual skill in your example also? Thanks

@aaronspring
Copy link
Collaborator

Regarding time format. My perfect model runs are in the year 3000. That is changeable but some control runs might be 1000 years. For that purpose I suggest we use cftime. It has all the groupby features as datetime.


Returns:
skill (xarray object): Predictability with main dimension `lag`.
"""
nlags = hind.lead.size
if resolution not in ['Y', 'M']:
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe NotImplementedError


Returns:
skill (xarray object): Predictability with main dimension `lag`.
"""
nlags = hind.lead.size
if resolution not in ['Y', 'M']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe when constants.py is added in my PR, add TEMPORAL_RES = ['Y', 'M']

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add this if ... to checks.py since I see it used in multiple places

@aaronspring
Copy link
Collaborator

monthly data added: https://github.com/bradyrx/climpred-data

control_mm
<xarray.Dataset>
Dimensions:  (area: 2, time: 3612)
Coordinates:
  * area     (area) object 'global' 'North_Atlantic'
  * time     (time) object 3000-01-31 00:00:00 ... 3300-12-31 00:00:00
Data variables:
    tos      (area, time) float32 ...
    sos      (area, time) float32 ...

ds_mm
<xarray.Dataset>
Dimensions:  (area: 2, init: 12, lead: 252, member: 10)
Coordinates:
  * area     (area) object 'global' 'North_Atlantic'
  * lead     (lead) int64 1 2 3 4 5 6 7 8 9 ... 245 246 247 248 249 250 251 252
  * init     (init) object 3014-01-31 00:00:00 ... 3257-01-31 00:00:00
  * member   (member) int64 0 1 2 3 4 5 6 7 8 9

now I realize, init should maybe be one month earlier. well thats what the discussion is about. but this at least is data to test with. probably we will have to define more clearly how the data needs to be formatted.

@bradyrx
Copy link
Collaborator Author

bradyrx commented Jun 10, 2019

Will get back to the other comments, but on this:

Regarding time format. My perfect model runs are in the year 3000. That is changeable but some control runs might be 1000 years. For that purpose I suggest we use cftime. It has all the groupby features as datetime.

I agree. I think we can just force all xarray objects to have cftime to only have to anticipate one type of datetime. I'll look into this module.

@ahuang11
Copy link
Member

ahuang11 commented Jun 10, 2019 via email

@aaronspring
Copy link
Collaborator

I agree. I think we can just force all xarray objects to have cftime to only have to anticipate one type of datetime. I'll look into this module.

I am glad to have cftime. Maybe we can convert all ints to cftime. another question that I ask myself now:
If we have inits as cftime, what should be the init and what time of reference would match the first lead?

If initialization took place at the turn of the year jan 1st xxxx-01-01 00:00:00, then lets use this as init and mid-jan xxxx-01-15 00:00:00 matching the first lead? Or rather end of the month, because output is usually rather formatted as this? Better if we can somehow make it independent of timescales lower than a month when calculating monthly skill, so we can worry less.

Its difficult matching the best setup for us internally but also consistent and useful for the user.

@bradyrx
Copy link
Collaborator Author

bradyrx commented Jun 10, 2019

@aaronspring @ahuang11

using cftime in xarray

So xarray actually has a cftime_range function that we could use here. (http://xarray.pydata.org/en/stable/generated/xarray.cftime_range.html?highlight=cftime)

I loaded in FOSI (spans monthly from 1948-2015) and can add a cftime object by the following:
ds['time'] = xr.cftime_range('1948-01', '2015-12', freq='MS', name='time')

You could also imagine setting up initializations on every November 1st:
xr.cftime_range('1948-11', '2015-11', freq='MS')[::12].

cftime for climpred

If we have inits as cftime, what should be the init and what time of reference would match the first lead? If initialization took place at the turn of the year jan 1st xxxx-01-01 00:00:00, then lets use this as init and mid-jan xxxx-01-15 00:00:00 matching the first lead? Or rather end of the month, because output is usually rather formatted as this? Better if we can somehow make it independent of timescales lower than a month when calculating monthly skill, so we can worry less.

I'm not sure here.

  1. inits should align with their actual initialization in my opinion. So DPLE is initialized on November 1 at midnight, I'd want cftime.datetime(2014, 1, 1).
  2. Other simulations (like a control run) tend to handle their cftime as month-end as you said. For example, FOSI month one comes out as cftime.DatetimeNoLeap(249, 1, 31, 23, 59, 59, 0, 6, 31). This makes sense, since it's the full month average.

I have to play around with your "real time" setup in our functions to see how we can make this align. We should implement a cftime/time/datetime module to handle some of the tricky bits here. Here's a thread on ideas of how to shift time in cftime format (pydata/xarray#2244).

Some questions:
a. How do we shift the init cftime appropriately to align with the referfences?
b. What does lead-1 tend to really mean for modeling centers?

On (b), some comments for DPLE. We initialize Nov. 1 of each year at midnight. At annual resolution, this means lead-1 from Nov. 2014 initialization corresponds to the 2015 average. At monthly resolution, it gets tricky. Since we initialize at midnight on Nov 1., lead-1 monthly is actually the full November average. So that would align with (2) from above.

Initialization: 2015-11-01 00:00:00
Month-average from reference: 2015-11-30 00:00:00

So here, lead-1 should bring 2015-11-01 00:00:00 to 2015-11-30 00:00:00.

Thoughts? Want to get this straightened out before doing any more work here. I should also read some cftime documentation (I mean, outside of the package). How do modeling centers handle netCDF time? There's all sorts of different calendars, but usually their attribute says if it's Gregorian, noleap, etc.

References

  1. cftime API for python: https://unidata.github.io/cftime/api.html
  2. xarray cftime_range: http://xarray.pydata.org/en/stable/generated/xarray.cftime_range.html?highlight=cftime
  3. xarray open_dataset has use_cftime flag: http://xarray.pydata.org/en/stable/generated/xarray.open_dataset.html#xarray.open_dataset
  4. xarray example notebook with use of cftime: http://xarray.pydata.org/en/stable/weather-climate.html?highlight=cftime

@aaronspring
Copy link
Collaborator

I agree with all of this.

Inits should be exactly when initialization happened. 2015-11-01 00:00:00.
For MPI prediction systems its the same: init 1st nov and then first annual lead the next calender year. for seasonal, DJF is the first seasonal lead. for monthly probably, november would be the first.

if we put init to the timestep 1 before the first lead, it might work sometimes, but not all. that would have been easier.

I dont care whether we choose end of the month or the mid of the month as timestep for reference and then internally to compare. probably 2015-11-15 is easier because its static. (month end might be tricky for the leap years)

So in principle I am in favor of your suggestion:

Initialization: 2015-11-01 00:00:00
Month-average from reference: 2015-11-30 00:00:00
So here, lead-1 should bring 2015-11-01 00:00:00 to 2015-11-30 00:00:00

Are there functions like endofmonth, startofmonth,addmonths(1)?

@bradyrx
Copy link
Collaborator Author

bradyrx commented Jun 11, 2019

I agree with your thoughts, but we have to be careful due to sections of the code like this:

Screen Shot 2019-06-11 at 3 27 50 PM

Persistence finds common inits, based on the assumption that the monthly means for a reference are set as the first of the month, just as with inits.

@bradyrx
Copy link
Collaborator Author

bradyrx commented Jun 12, 2019

@aaronspring , @ahuang11

I have to focus on writing a manuscript over the next couple weeks. I've been putting it off. I won't have time to work on this til then. So if either of you have ideas, feel free to push to this branch and play around with cftime.

@ahuang11
Copy link
Member

ahuang11 commented Jun 19, 2019

Are there functions like endofmonth, startofmonth,addmonths(1)?

Woops, missed your message @aaronspring, but pandas is amazing for timeseries (and already a requirement of xarray)

pd.datetime(2015, 5, 15) + pd.offsets.MonthBegin(1)
pd.datetime(2015, 5, 15) + pd.offsets.MonthEnd(1)
pd.datetime(2015, 5, 15) + pd.DateOffset(months=1)
Timestamp('2015-06-01 00:00:00')
Timestamp('2015-05-31 00:00:00')
Timestamp('2015-06-15 00:00:00')

Unfortunately, it doesn't support cftime.

)
# Some lags force this into a numpy array for some reason.
imax = xr.DataArray(imax).rename('time')
forecast = forecast.where(forecast.time <= imax, drop=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you could do:

forecast.sel(time=slice(imin, imax))

@bradyrx
Copy link
Collaborator Author

bradyrx commented Jun 19, 2019

Woops, missed your message @aaronspring, but pandas is amazing for timeseries (and already a requirement of xarray)

Great point @ahuang11. I've played with this in the past and those addoffset functions are great.

@ahuang11
Copy link
Member

I think we should force everything to use cftime. Some decorator check to see if integers are provided; if so, convert to cftime. Once we use cftime, the temporal resolution (daily/monthly/yearly) should become irrelevant.

@bradyrx
Copy link
Collaborator Author

bradyrx commented Jun 25, 2019

Closing for now since we will address this as a primary goal of v2.

@bradyrx bradyrx closed this Jun 25, 2019
@bradyrx bradyrx mentioned this pull request Aug 15, 2019
4 tasks
@aaronspring
Copy link
Collaborator

when going for monthly I developed a function that generates uninitialized ensembles from a control, based on a start month to be specified. to be used in my perfect-model framework.

def bootstrap_uninit_pm_ensemble_from_control_mm(ds, control, init_month):
    """
    Create a pseudo-ensemble from control run.
    Note:
        Needed for block bootstrapping confidence intervals of a metric in perfect
        model framework. Takes randomly segments of length of ensemble dataset from
        control and rearranges them into ensemble and member dimensions.
    Args:
        ds (xarray object): ensemble simulation.
        control (xarray object): control simulation.
    Returns:
        ds_e (xarray object): pseudo-ensemble generated from control run.
    """
    nens = ds.init.size
    nmember = ds.member.size
    length = ds.lead.size
    c_start = control.time.dt.year[0]
    c_end = control.time.dt.year[-1]
    lead_time = ds['lead']

    def sel_years(control, year_s, length):
        new = control.sel(time=slice(str(year_s)+'-'+str(init_month).zfill(2),
                                     str(year_s + length//12 + 1)+'-'+str(init_month).zfill(2)))
        new = new.rename({'time': 'lead'}).isel(lead=slice(0, lead_time.size))
        new['lead'] = lead_time
        return new

    def create_pseudo_members(control):
        startlist = np.random.randint(c_start, c_end - length - 1, nmember)
        return xr.concat(
            (sel_years(control, start, length)
             for start in startlist), 'member'
        )

@aaronspring aaronspring reopened this Sep 21, 2019
@bradyrx bradyrx deleted the add_sub_annual branch September 23, 2019 19:54
@aaronspring aaronspring mentioned this pull request Nov 24, 2019
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

adapt for sub-annual predictions
3 participants