Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate offset string handling into factory _from_str #9709

Closed

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Nov 17, 2021

Currently, when freq is a string in date_range, we first map the frequency part to time unit codes, then again map them to DateOffset object via _from_freqstr. However, freqstr relates to frequency string in pandas, and does not always 1-1 map to time unit codes.

This PR proposes to unify frequency string handling into factory DateOffset._from_str. This function performs 1 regex matching to extract the frequency part and polls through possible sets of abbreviations in _offset_alias_to_codes and _CODES_TO_UNITS to find matches.

Besides, a test case is added for date_range make sure unsupported frequency types are raised with sensible error message.

@isVoid isVoid requested a review from a team as a code owner November 17, 2021 00:50
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 17, 2021
@isVoid isVoid self-assigned this Nov 17, 2021
@isVoid isVoid added improvement Improvement / enhancement to an existing function 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Nov 17, 2021
@codecov
Copy link

codecov bot commented Nov 17, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.02@31f92d7). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head a4d6307 differs from pull request most recent head 561caca. Consider uploading reports for the commit 561caca to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-22.02    #9709   +/-   ##
===============================================
  Coverage                ?   10.47%           
===============================================
  Files                   ?      119           
  Lines                   ?    20334           
  Branches                ?        0           
===============================================
  Hits                    ?     2130           
  Misses                  ?    18204           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31f92d7...561caca. Read the comment docs.

Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question otherwise lgtm

Copy link
Contributor Author

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hang on... pandas date_range considers lower case m as "month end frequency" but the current change may make cudf recognize that as "minute frequency".

>>> import cudf, pandas as pd
>>> pd.date_range('2000-01-01', '2000-02-01', freq='m')
DatetimeIndex(['2000-01-31'], dtype='datetime64[ns]', freq='M')
>>> cudf.date_range('2000-01-01', '2000-02-01', freq='m')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:01:00',
               '2000-01-01 00:02:00', '2000-01-01 00:03:00',
               '2000-01-01 00:04:00', '2000-01-01 00:05:00',
               '2000-01-01 00:06:00', '2000-01-01 00:07:00',
               '2000-01-01 00:08:00', '2000-01-01 00:09:00',
               ...
               '2000-01-31 23:50:00', '2000-01-31 23:51:00',
               '2000-01-31 23:52:00', '2000-01-31 23:53:00',
               '2000-01-31 23:54:00', '2000-01-31 23:55:00',
               '2000-01-31 23:56:00', '2000-01-31 23:57:00',
               '2000-01-31 23:58:00', '2000-01-31 23:59:00'],
              dtype='datetime64[ns]', length=44640)

Investigating..

@isVoid
Copy link
Contributor Author

isVoid commented Nov 18, 2021

Synced with @brandon-b-miller offline. _from_str will reject m input since it's ambiguous. A new test case is added for date_range to make sure unsupported types are properly rejected and sensible error is raised.

Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by the handling of L / ms. @isVoid let me know if I'm overlooking something here.

python/cudf/cudf/tests/test_datetime.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_datetime.py Outdated Show resolved Hide resolved
x in freq.upper()
for x in {"Y", "A", "Q", "B", "SM", "SMS", "CBMS", "M"}
)
or "MS" in freq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like L, ms should be supported aliases. They're not listed in the docstring on line 765-766, and it looks like 3ms would raise an error here because it contains M when converted to uppercase? I feel like I'm missing something, since those values were added to _offset_alias_to_code in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - the issue you pointed here is valid. The plan was to allow ms but reject M and MS.

Copy link
Contributor Author

@isVoid isVoid Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: ms is currently not tested in date_range tests because the tool we used to process datetimes from pandas does not support ms yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved away from checking the unsupported ones explicitly here - because that's basically just creating the complement set of supported ones in _from_str. Instead, I'm wrapping _from_str in try-except blocks.

python/cudf/cudf/core/tools/datetimes.py Outdated Show resolved Hide resolved
@isVoid isVoid requested a review from bdice November 24, 2021 03:57
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions about supported strings. String parsing is always complicated -- if you think my suggestions are out of scope, we can merge this as-is and file an issue to document the edge cases I listed.

@@ -458,7 +459,7 @@ class DateOffset:
"Y": "years",
}

_FREQSTR_REGEX = re.compile("([0-9]*)([a-zA-Z]+)")
_FREQSTR_REGEX = re.compile("(-)*([0-9]*)([a-zA-Z]+)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this might be the regex pandas uses, which handles a lot of cases this regex may miss. (This regex might not be for exactly the same purpose, but it demonstrates the types of cases we might need to handle.)

r"([+\-]?\d*|[+\-]?\d*\.\d*)\s*([A-Za-z]+([\-][\dA-Za-z\-]+)?)"
https://github.com/pandas-dev/pandas/blob/8fefaa5a9a7c3f3a1c35c36c1140117dab73c9c7/pandas/_libs/tslibs/offsets.pyx#L3506-L3508

Do we need to support frequencies like +10ms? What about spaces in between like 3 h? I'd check these cases against pandas to see what is supported -- or find the exact regex that pandas uses to make sure we hit the same cases.

Suggested change
_FREQSTR_REGEX = re.compile("(-)*([0-9]*)([a-zA-Z]+)")
_FREQSTR_REGEX = re.compile("([+\-])?(\d*)\s*([A-Za-z]+)")

raise ValueError(f"Cannot interpret frequency str: {freqstr}")

return cls(**{cls._CODES_TO_UNITS[freq_part]: int(numeric_part)})
# Handle sign and numerics
sign = -1 if sign_part else 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed for my other regex suggestion above.

Suggested change
sign = -1 if sign_part else 1
sign = -1 if sign_part.startswith("-") else 1

Comment on lines +850 to +851
if "M" in freq or "Y" in freq.upper():
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if clause is a bit of an awkward special-case for supporting ms. If the DateOffset._from_str will raise a ValueError here, I would just remove the if: raise logic that exits early here and let it be handled in the except block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateOffset._from_str does not raise for M and Y. Which is why it needs to get handled ahead here. It feels like _from_str can only handle so many common paths. M means months for a DateOffset but "Months End" in date_range context. I meant to support constructing months with string like 1M in #9710 but reject it here. The two functions handles M in conflicting meanings so this is as far as I can do to consolidate the logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment that clarifies which paths you're treating as special cases here? Something that specifies how DateOffset and date_range differ in functionality and that you're intending to catch month/year with this statement.

@isVoid
Copy link
Contributor Author

isVoid commented Dec 4, 2021

After speaking with @bdice offline I was convinced pd.tseries.frequencies.to_offset is a better option if we want to convert a freqstr into a pd.DateOffset object. Thus I'm closing this and I'm looking to replace custom regex in date_range with the above.

Superseded by #9843

@isVoid isVoid closed this Dec 4, 2021
isVoid added a commit to isVoid/cudf that referenced this pull request Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants