-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Towards a (temporary?) workaround for datetime issues at the xarray-level #1084
Comments
To be clear, the subclassing question is whether we should subclass Index or not. This would entail overriding a lot of methods, and would probably break in the future (with pandas 2.0). I definitely do not recommend subclassing PeriodIndex -- that could be very fragile. |
@shoyer @jhamman this is my first time digging in to the internals of pandas / xarray, so please forgive me if this is off-track, but I've started experimenting with a subclass of As a proof of concept, it currently enables groupby operations through a crude version of the I have yet to look into serialization or how / if this could be integrated into xarray, but does this seem like an approach worth pursuing to address this issue? |
@spencerkclark This looks pretty sane to me, though of course it's still missing a few nice things you can do with |
@shoyer and others, is there a well-defined list of required features that the new index object would need to satisfy in order to be considred for inclusion in xarray? Are the two that you mentioned must-haves? Are there others? Wanting to make sure we're all on the same page in terms of what the target is. |
@spencerahill Those are absolutely not essential -- I think your basic example is probably already enough to be useful. Perhaps the most complete list of time series functionality in xarray can be found on this doc page: |
@shoyer brings up a good point regarding partial datetime string indexing. For instance in my basic example, indexing with truncated string dates of the form This would mean that the same string specification could have different behavior for For instance, using the current setup in my basic example with sub-daily resolution data, selecting a time using
but using a
I think if we were to include string-based indexing, it would be best if it were completely consistent with the So ultimately this raises the question, would we want to add just the field accessors to enable group-by operations for now and add string-based selection (and other features like |
Just to be sure we're not mixing Spencers, this was all @spencerkclark's great work! I had no hand in it.
I agree.
Maybe an interim solution is for NetCDFTimeIndex to only accept full datestrings, issuing an error message for partial strings explaining that this functionality is forthcoming? |
@shoyer, @jhamman , @spencerkclark, @spencerahill , and @darothen - I think I've got netcdftime sufficiently detatched from NetCDF4-Python that I think we can keep moving forward. Along the lines of what you're notebook showed last month, you should just be able to swap out |
Thanks @jhamman, that's great to hear. At this moment I don't have any concrete things I'm waiting on from you. I haven't had the chance to iterate more on this since last month, but as I've thought about it more, for the custom datetime index to really "feel" like the pandas That being said, before pushing too far ahead, I think it's probably important to put things in a form that's more amenable to code review, testing, and collaboration. From a mechanics perspective, do we want this sort of index to live totally inside |
The good news here is that almost all this logic lives in pandas's Python code and should be straightforward to duplicate. I can point you to the relevant locations if you're having trouble figuring this out.
It should be quite straightforward to integrate this into xarray's existing serialization logic.
We could go either way here, but for now I think it makes sense to keep this in xarray proper. Here's why:
|
I agree. |
@shoyer that all sounds good. I'll get to work over the next few weeks and see what I can do. Thanks for your help. |
@shoyer @jhamman outside of an upstream issue in If you think this is a path forward here, I suppose the next order of business would be for me to open up an issue in For instance, I would expect line 5 to produce a date with In [1]: import dateutil
In [2]: dateutil.__version__
Out[2]: '2.5.3'
In [3]: from dateutil import parser
In [4]: p = parser.parser()
In [5]: p._parse('0001-01-16')
Out[5]: (_result(year=16, month=1, day=1), None)
In [6]: p._parse('0016-01-01')
Out[6]: (_result(year=16, month=1, day=1), None) and here I would want line 7 to return a result with In [7]: p._parse('0001')
Out[7]: (_result(day=1), None)
In [8]: p._parse('0100')
Out[8]: (_result(year=100), None) I recognize that I'm using the private API version of the parse function; however this is also what pandas does in its core (to enable picking off the resolution of the string specified). It has the additional use for me here of allowing me to convert the Thanks again for your help. |
@spencerkclark It looks like I'm not super happy with using the private |
Thanks @shoyer. It appears using the In [9]: p._parse('0001-01-16', yearfirst=True, dayfirst=False)
Out[9]: (_result(year=1, month=1, day=16), None)
In [10]: p._parse('0016-01-01', yearfirst=True, dayfirst=False)
Out[10]: (_result(year=16, month=1, day=1), None) but not the second: In [11]: p._parse('0100', yearfirst=True, dayfirst=False)
Out[11]: (_result(year=100), None)
In [12]: p._parse('0001', yearfirst=True, dayfirst=False)
Out[12]: (_result(day=1), None) So we might need to alter the logic a little bit if we continue with the
I'm not particularly tied to one date parsing approach over another (here I was just mimicking pandas). In an ideal world, what would be your preference? |
Honestly, it's probably more reliable to just parse ISO-8601 ourselves, which is intentionally pretty simple. That will give us complete control. Here's something simple I wrote to parse ISO-8601 using a regex: import re
def named(name, pattern):
return '(?P<' + name + '>' + pattern + ')'
def optional(x):
return '(?:' + x + ')?'
def trailing_optional(xs):
if not xs:
return ''
return xs[0] + optional(trailing_optional(xs[1:]))
def build_pattern(date_sep='\-', datetime_sep='T', time_sep='\:'):
pieces = [(None, 'year', '\d{4}'),
(date_sep, 'month', '\d{2}'),
(date_sep, 'day', '\d{2}'),
(datetime_sep, 'hour', '\d{2}'),
(time_sep, 'minute', '\d{2}'),
(time_sep, 'second', '\d{2}' + optional('\.\d+'))]
pattern_list = []
for sep, name, sub_pattern in pieces:
pattern_list.append((sep if sep else '') + named(name, sub_pattern))
# TODO: allow timezone offsets?
return '^' + trailing_optional(pattern_list) + '$'
def parse_iso8601(datetime_string):
basic_pattern = build_pattern(date_sep='', time_sep='')
extended_pattern = build_pattern()
patterns = [basic_pattern, extended_pattern]
for pattern in patterns:
match = re.match(pattern, datetime_string)
if match:
return match.groupdict()
raise ValueError('no ISO-8601 match for string: %s' % datetime_string)
# test cases
print parse_iso8601('1999')
print parse_iso8601('19990101')
print parse_iso8601('1999-01-01')
print parse_iso8601('1999-01-01T12')
print parse_iso8601('1999-01-01T12:34')
print parse_iso8601('1999-01-01T12:34:56.78')
print parse_iso8601('19990101T123456.78')
|
As for the implementation in your gist, it looks pretty solid to me. One possible concern is that we are overwriting It might also be worth testing this index in a pandas DataFrame/Series. I don't expect things to work 100% but it might be enough to be useful. |
@shoyer many thanks for the ISO-8601 date parser! That should be pretty straightforward to swap with the logic I use for date parsing in the gist, and will be much more robust.
Would it be safer to name the implementation of
Yes, this would be interesting to try. I'll test things out tomorrow and see how it goes. |
If you want to take this approach, name it something different and then use to parse inputs to One advantage of having our parser is that it would be pretty easy to extend as necessary, e.g., to handle negative years as specified by Wikipedia:
|
Sure thing -- for now I've left things as they are, but I'll take this advice if we decide otherwise.
I hadn't thought about that; I agree it would be nice to have that flexibility if needed. Just to indicate some more progress here, I've updated the gist with @shoyer's date parser, which eliminates issues for years less than 100 (thanks again!), and some additional modifications needed to add support for use in Series and DataFrame objects. I've started to work on writing some unit tests offline; if there is nothing glaring in the updated gist, I may post a work-in-progress PR in the next week or so, where we can discuss the finer details of the implementation of the NetCDFTimeIndex (and its tests), and how we might want to integrate it into xarray. Does that sound like a reasonable idea? |
@spencerkclark I'm looking forward to the PR! I understand that pandas wants a
see here for discussion: |
just my 2c here. You are going to end up writing a huge amount of code to re-implement essentially |
@jreback - Unless I've totally missed something, |
Is there a way to add a new calendar to |
@jhamman you just need a different frequency, in fact this one is pretty close: https://github.com/pandas-dev/pandas/blob/master/pandas/tseries/offsets.py#L2257 just a matter of defining a fixed-day month frequency (numpy has this by default anyhow). PeriodIndex would then happily take this. |
@shoyer @jhamman I will defer to both of you on this issue. In light of recent discussion, what do you think is the preferred approach? It seems like three avenues have been discussed (here and in pandas-dev/pandas#15258):
It's not immediately obvious to me that avenues 2 and 3 would be clean and straightforward, but if there is a way to easily adapt them for custom calendars — even unusual ones like "all-leap" (29-day Februaries every year) and "360-day" (30-day Februaries every year) calendars, which cannot be fully represented by ordinary Are we back to the drawing board, or should we continue along the path of avenue 1? |
@spencerkclark I still think subclassing pandas.Index is the best path forward. Subclassing DatetimeIndex is a non-starter because it presumes the use of datetime64 internally. Subclassing PeriodIndex is also potentially viable, but would require hooking pretty deeply into pandas's internal machinery with an API that isn't really designed to be extended externally. For example, frequency conversion is done in C. From a data model perspective, it might work to use custom frequencies for different calendars, but it's not the cleanest abstraction -- really the frequency and the calendar are two separate notions. From a practical perspective, it's also less promising because it would require writing much of the logic from netcdftime to reimplement arithmetic and so on. And there would still be issues with datetime strings and converting back and forth to datetimes. The downside of using pandas.Index is that resampling won't work out of the box. But we can work around that in xarray, and potentially even in pandas if we add an a simple dispatching mechanism into |
@spencerahill as I said above, you should not need to subclass at all, just define a new frequency, maybe something like |
Not sure if this is related, but pandas commit pandas-dev/pandas@2310faa triggers xarray issue #1661 . Not sure if there exists an easy workaround for that one. |
Re: #789. The consensus is that upstream fixes in Pandas are not coming anytime soon, and there is an acute need amongst many xarray users for a workaround in the meantime. There are two separate issues: (1) date-range limitations due to nanosecond precision, and (2) support for non-standard calendars.
@shoyer, @jhamman , @spencerkclark, @darothen, and I briefly discussed offline a potential workaround that I am (poorly) summarizing here, with hope that others will correct/extend my snippet.
The idea is to extend either PeriodIndex or (more involved but potentially more robust) Int64Index, either through subclassing or composition, to implement all of the desired functionality: slicing, resampling, groupby, and serialization.
For reference, @spencerkclark nicely summarized the limitations of PeriodIndex and the netCDF4.datetime objects, which are often used as workarounds currently: spencerahill/aospy#98 (comment)
The text was updated successfully, but these errors were encountered: