Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calendar utilities #5155

Closed
aulemahal opened this issue Apr 14, 2021 · 9 comments · Fixed by #5233
Closed

Calendar utilities #5155

aulemahal opened this issue Apr 14, 2021 · 9 comments · Fixed by #5233

Comments

@aulemahal
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Handling cftime and numpy time coordinates can sometimes be exhausting. Here I am thinking of the following common problems:

  1. Querying the calendar type from a time coordinate.
  2. Converting a dataset from a calendar type to another.
  3. Generating a time coordinate in the correct calendar.

Describe the solution you'd like

  1. ds.time.dt.calendar would be magic.
  2. xr.convert_calendar(ds, "new_cal") could be nice?
  3. xr.date_range(start, stop, calendar=cal), same as pandas' (see context below).

Describe alternatives you've considered
We have implemented all this in (xclim)[https://xclim.readthedocs.io/en/stable/api.html#calendar-handling-utilities] (and more). But it seems to make sense that some of the simplest things there could move to xarray? We had this discussion in xarray-contrib/cf-xarray#193 and suggestion was made to see what fits here before implementing this there.

Additional context
At xclim, to differentiate numpy datetime64 from cftime types, we call the former "default". This way a time coordinate using cftime's "proleptic_gregorian" calendar is distinct from one using numpy's datetime64.

  1. is easy (xclim function). If the datatype is numpy return "default", if cftime, look into the first non-null value and get the calendar.
  2. xclim function The calendar type of each time element is transformed to the new calendar. Our way is to drop any dates that do not exist in the new calendar (like Feb 29th when going to "noleap"). In the other direction, there is an option to either fill with some fill value of simply not include them. It can't be a DataArray method, but could be a Dataset one, or simply a top-level function. Related to Converting cftime.datetime objects to np.datetime64 values through astype #5107.

We also have an interp_calendar function that reinterps data on a yearly basis. This is a bit narrower, because it only makes sense on daily data (or coarser).

  1. With the definition of a "default" calendar, date_range and date_range_like simply chose between pd.date_range and xr.cftime_range according to the target calendar.

What do you think? I have time to move whatever code makes sense to move.

@aaronspring
Copy link
Contributor

  1. exists as ds.time.to_index().calendar
  2. would like to use

@aulemahal
Copy link
Contributor Author

@aaronspring Oh, I didn't think of that trick for 1, thanks! But again, this fails with numpy-backed time coordinates. With the definition of a "default" calendar, there could be a more general way.

@spencerkclark
Copy link
Member

spencerkclark commented Apr 16, 2021

Thanks for opening up this discussion @aulemahal! It's great to see the boundaries being pushed on what can be done with cftime dates in xarray.

  1. ds.time.dt.calendar would be magic

I think something like this would be cool too (see the end of #4092 (comment)). Attributes on the dt accessor normally return DataArrays with the same shape as the original, as opposed to scalar values, but it might be reasonable to make an exception in this case. Xarray, and CF conventions for that matter, are written in a way that assume that all dates in an array have the same calendar type, and therefore returning an array filled with the same calendar name feels far inferior to returning a scalar.

  1. xr.convert_calendar(ds, "new_cal") could be nice?

Both convert_calendar and interp_calendar seem like very nice utilities. I have been fortunate enough not to have been in a situation to need something like those, but both of those methods seem like sensible and general approaches to the problem of converting a dataset from one calendar to another. I have seen this as an issue for others, e.g. this SO question, so I think we could be open to adding both to xarray.

  1. xr.date_range(start, stop, calendar=cal), same as pandas' (see context below).

We actually considered something like this in the initial stages of writing cftime_range, but decided against it, #2301 (comment), #2301 (comment). Maybe it is worth re-opening discussion about a possible more generic xarray.date_range function, though.

Using the calendar argument, as opposed to the range of the dates, to decide what type to return is a slightly different twist than what was discussed previously. I don't know how I feel about that. On one hand I can see why it would be convenient, but on the other hand "default" is not a valid CF calendar name so it feels a little strange to allow that as a special option to signal you want NumPy dates as a result. I wonder if instead we handled this in a way similar to decoding times, where we have a use_cftime argument? Basically you would have xarray.date_range(..., use_cftime=use_cftime), where:

  • If use_cftime were set to False it would only return NumPy dates and error if this was not possible
  • If use_cftime were set to None (default) it would return NumPy dates if the calendar and time range allowed; otherwise it would return cftime dates
  • And if use_cftime were set to True it would only return cftime dates.

Would that work for your use-case?

@aulemahal
Copy link
Contributor Author

aulemahal commented Apr 21, 2021

Cool! I started a branch, will push a PR soon.

I understand the "default" issue and using use_cftime=None makes sense to me!

For dt.calendar and date_range, there remains the question on how we name numpy's calendar:
Python uses what CF conventions call proleptic_gregorian, but the default and most common calendar we see and use is CF's "standard". May be users would expect "standard"?
A solution would be to check if the earliest value in the array is before 1582-10-15. If yes, return "proleptic_gregorian", if not, return "standard".

@aulemahal
Copy link
Contributor Author

Edited the comment above as I realized that numpy /pandas / xarray use nanoseconds by default, so the earliest date possible is 1677-09-21 00:12:43.145225.
Thus, I suggest we return "standard" as the calendar of numpy-backed datetime indexes.

@spencerkclark
Copy link
Member

Great, thanks!

Thus, I suggest we return "standard" as the calendar of numpy-backed datetime indexes.

I might actually lean toward your original thought, i.e. having dt.calendar return "proleptic_gregorian" for NumPy dates. As you noted, this is technically the calendar that numpy.datetime64 values use with coarser precision. It is also what xarray currently chooses for the calendar encoding attribute for NumPy dates by default (i.e. if no "calendar" attribute exists in the variable's encoding).

(There is of course the complication that the NumPy dates could have been decoded from a file with a calendar attribute of "standard" or "gregorian", in which case, if we are being as true to the underlying data as possible, returning "proleptic_gregorian" as the calendar type would then technically be incorrect. I'm not sure if there is really anything we could do about that though, since the encoding attributes can be lost through things like timedelta arithmetic. Arguably too, if someone wanted to quibble about that, I suppose their first issue should be with decoding those values to NumPy dates to begin with.)

@aulemahal aulemahal mentioned this issue Apr 28, 2021
5 tasks
@shoyer
Copy link
Member

shoyer commented Oct 27, 2021

Given the increasing scope of cftime-related utilities, I wonder if it would make sense to try to eventually move the cftime-related utilities outside of Xarray into a separate package? After the current ongoing "explicit indexes" refactor, I believe we will have public APIs for all of the essential functionality.

@aulemahal
Copy link
Contributor Author

aulemahal commented Oct 27, 2021

I agree that it kinda crowds the xarray API. My first idea was to implement these calendar utilities in cf-xarray, but I was redirected here. Seems to me that if that move is made, it's the most logical place for all CF[time] related things.

@spencerkclark
Copy link
Member

Given the increasing scope of cftime-related utilities, I wonder if it would make sense to try to eventually move the cftime-related utilities outside of Xarray into a separate package? After the current ongoing "explicit indexes" refactor, I believe we will have public APIs for all of the essential functionality.

Yeah, I think that could be something good to strive for, and it's exciting that the explicit indexes refactor could facilitate that. To be clear, are you ok with merging #5233 and waiting until after the refactor is complete to discuss future plans there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants