-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate offset string handling into factory _from_str
#9709
Consolidate offset string handling into factory _from_str
#9709
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9709 +/- ##
===============================================
Coverage ? 10.47%
===============================================
Files ? 119
Lines ? 20334
Branches ? 0
===============================================
Hits ? 2130
Misses ? 18204
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question otherwise lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hang on... pandas date_range
considers lower case m
as "month end frequency" but the current change may make cudf recognize that as "minute frequency".
>>> import cudf, pandas as pd
>>> pd.date_range('2000-01-01', '2000-02-01', freq='m')
DatetimeIndex(['2000-01-31'], dtype='datetime64[ns]', freq='M')
>>> cudf.date_range('2000-01-01', '2000-02-01', freq='m')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:01:00',
'2000-01-01 00:02:00', '2000-01-01 00:03:00',
'2000-01-01 00:04:00', '2000-01-01 00:05:00',
'2000-01-01 00:06:00', '2000-01-01 00:07:00',
'2000-01-01 00:08:00', '2000-01-01 00:09:00',
...
'2000-01-31 23:50:00', '2000-01-31 23:51:00',
'2000-01-31 23:52:00', '2000-01-31 23:53:00',
'2000-01-31 23:54:00', '2000-01-31 23:55:00',
'2000-01-31 23:56:00', '2000-01-31 23:57:00',
'2000-01-31 23:58:00', '2000-01-31 23:59:00'],
dtype='datetime64[ns]', length=44640)
Investigating..
…pes in date_range
Synced with @brandon-b-miller offline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by the handling of L / ms. @isVoid let me know if I'm overlooking something here.
x in freq.upper() | ||
for x in {"Y", "A", "Q", "B", "SM", "SMS", "CBMS", "M"} | ||
) | ||
or "MS" in freq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like L
, ms
should be supported aliases. They're not listed in the docstring on line 765-766, and it looks like 3ms
would raise an error here because it contains M
when converted to uppercase? I feel like I'm missing something, since those values were added to _offset_alias_to_code
in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - the issue you pointed here is valid. The plan was to allow ms
but reject M
and MS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: ms
is currently not tested in date_range
tests because the tool we used to process datetimes from pandas does not support ms
yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved away from checking the unsupported ones explicitly here - because that's basically just creating the complement set of supported ones in _from_str
. Instead, I'm wrapping _from_str
in try-except blocks.
Co-authored-by: Bradley Dice <[email protected]>
…cudf into improvements/offset_str_handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions about supported strings. String parsing is always complicated -- if you think my suggestions are out of scope, we can merge this as-is and file an issue to document the edge cases I listed.
@@ -458,7 +459,7 @@ class DateOffset: | |||
"Y": "years", | |||
} | |||
|
|||
_FREQSTR_REGEX = re.compile("([0-9]*)([a-zA-Z]+)") | |||
_FREQSTR_REGEX = re.compile("(-)*([0-9]*)([a-zA-Z]+)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this might be the regex pandas uses, which handles a lot of cases this regex may miss. (This regex might not be for exactly the same purpose, but it demonstrates the types of cases we might need to handle.)
r"([+\-]?\d*|[+\-]?\d*\.\d*)\s*([A-Za-z]+([\-][\dA-Za-z\-]+)?)"
https://github.com/pandas-dev/pandas/blob/8fefaa5a9a7c3f3a1c35c36c1140117dab73c9c7/pandas/_libs/tslibs/offsets.pyx#L3506-L3508
Do we need to support frequencies like +10ms
? What about spaces in between like 3 h
? I'd check these cases against pandas to see what is supported -- or find the exact regex that pandas uses to make sure we hit the same cases.
_FREQSTR_REGEX = re.compile("(-)*([0-9]*)([a-zA-Z]+)") | |
_FREQSTR_REGEX = re.compile("([+\-])?(\d*)\s*([A-Za-z]+)") |
raise ValueError(f"Cannot interpret frequency str: {freqstr}") | ||
|
||
return cls(**{cls._CODES_TO_UNITS[freq_part]: int(numeric_part)}) | ||
# Handle sign and numerics | ||
sign = -1 if sign_part else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed for my other regex suggestion above.
sign = -1 if sign_part else 1 | |
sign = -1 if sign_part.startswith("-") else 1 |
if "M" in freq or "Y" in freq.upper(): | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if
clause is a bit of an awkward special-case for supporting ms
. If the DateOffset._from_str
will raise a ValueError
here, I would just remove the if: raise
logic that exits early here and let it be handled in the except block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DateOffset._from_str
does not raise for M
and Y
. Which is why it needs to get handled ahead here. It feels like _from_str
can only handle so many common paths. M
means months for a DateOffset but "Months End" in date_range
context. I meant to support constructing months with string like 1M
in #9710 but reject it here. The two functions handles M
in conflicting meanings so this is as far as I can do to consolidate the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment that clarifies which paths you're treating as special cases here? Something that specifies how DateOffset and date_range differ in functionality and that you're intending to catch month/year with this statement.
Currently, when
freq
is a string indate_range
, we first map the frequency part to time unit codes, then again map them toDateOffset
object via_from_freqstr
. However,freqstr
relates to frequency string in pandas, and does not always 1-1 map to time unit codes.This PR proposes to unify frequency string handling into factory
DateOffset._from_str
. This function performs 1 regex matching to extract the frequency part and polls through possible sets of abbreviations in_offset_alias_to_codes
and_CODES_TO_UNITS
to find matches.Besides, a test case is added for
date_range
make sure unsupported frequency types are raised with sensible error message.