Consolidate offset string handling into factory `_from_str` #9709

isVoid · 2021-11-17T00:50:01Z

Currently, when freq is a string in date_range, we first map the frequency part to time unit codes, then again map them to DateOffset object via _from_freqstr. However, freqstr relates to frequency string in pandas, and does not always 1-1 map to time unit codes.

This PR proposes to unify frequency string handling into factory DateOffset._from_str. This function performs 1 regex matching to extract the frequency part and polls through possible sets of abbreviations in _offset_alias_to_codes and _CODES_TO_UNITS to find matches.

Besides, a test case is added for date_range make sure unsupported frequency types are raised with sensible error message.

codecov · 2021-11-17T03:51:07Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.02@31f92d7). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head a4d6307 differs from pull request most recent head 561caca. Consider uploading reports for the commit 561caca to get more accurate results

@@               Coverage Diff               @@
##             branch-22.02    #9709   +/-   ##
===============================================
  Coverage                ?   10.47%           
===============================================
  Files                   ?      119           
  Lines                   ?    20334           
  Branches                ?        0           
===============================================
  Hits                    ?     2130           
  Misses                  ?    18204           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31f92d7...561caca. Read the comment docs.

python/cudf/cudf/core/tools/datetimes.py

brandon-b-miller

One question otherwise lgtm

isVoid

Hang on... pandas date_range considers lower case m as "month end frequency" but the current change may make cudf recognize that as "minute frequency".

>>> import cudf, pandas as pd
>>> pd.date_range('2000-01-01', '2000-02-01', freq='m')
DatetimeIndex(['2000-01-31'], dtype='datetime64[ns]', freq='M')
>>> cudf.date_range('2000-01-01', '2000-02-01', freq='m')
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:01:00',
               '2000-01-01 00:02:00', '2000-01-01 00:03:00',
               '2000-01-01 00:04:00', '2000-01-01 00:05:00',
               '2000-01-01 00:06:00', '2000-01-01 00:07:00',
               '2000-01-01 00:08:00', '2000-01-01 00:09:00',
               ...
               '2000-01-31 23:50:00', '2000-01-31 23:51:00',
               '2000-01-31 23:52:00', '2000-01-31 23:53:00',
               '2000-01-31 23:54:00', '2000-01-31 23:55:00',
               '2000-01-31 23:56:00', '2000-01-31 23:57:00',
               '2000-01-31 23:58:00', '2000-01-31 23:59:00'],
              dtype='datetime64[ns]', length=44640)

Investigating..

…pes in date_range

isVoid · 2021-11-18T01:15:35Z

Synced with @brandon-b-miller offline. _from_str will reject m input since it's ambiguous. A new test case is added for date_range to make sure unsupported types are properly rejected and sensible error is raised.

brandon-b-miller

LGTM.

bdice

I'm a little confused by the handling of L / ms. @isVoid let me know if I'm overlooking something here.

python/cudf/cudf/tests/test_datetime.py

bdice · 2021-11-22T19:03:30Z

python/cudf/cudf/core/tools/datetimes.py

+                x in freq.upper()
+                for x in {"Y", "A", "Q", "B", "SM", "SMS", "CBMS", "M"}
+            )
+            or "MS" in freq


It looks like L, ms should be supported aliases. They're not listed in the docstring on line 765-766, and it looks like 3ms would raise an error here because it contains M when converted to uppercase? I feel like I'm missing something, since those values were added to _offset_alias_to_code in this PR.

Yeah - the issue you pointed here is valid. The plan was to allow ms but reject M and MS.

Side note: ms is currently not tested in date_range tests because the tool we used to process datetimes from pandas does not support ms yet.

I moved away from checking the unsupported ones explicitly here - because that's basically just creating the complement set of supported ones in _from_str. Instead, I'm wrapping _from_str in try-except blocks.

python/cudf/cudf/core/tools/datetimes.py

Co-authored-by: Bradley Dice <[email protected]>

…cudf into improvements/offset_str_handling

bdice

Some questions about supported strings. String parsing is always complicated -- if you think my suggestions are out of scope, we can merge this as-is and file an issue to document the edge cases I listed.

bdice · 2021-11-29T19:26:10Z

python/cudf/cudf/core/tools/datetimes.py

@@ -458,7 +459,7 @@ class DateOffset:
        "Y": "years",
    }

-    _FREQSTR_REGEX = re.compile("([0-9]*)([a-zA-Z]+)")
+    _FREQSTR_REGEX = re.compile("(-)*([0-9]*)([a-zA-Z]+)")


It looks like this might be the regex pandas uses, which handles a lot of cases this regex may miss. (This regex might not be for exactly the same purpose, but it demonstrates the types of cases we might need to handle.)

r"([+\-]?\d*|[+\-]?\d*\.\d*)\s*([A-Za-z]+([\-][\dA-Za-z\-]+)?)"
https://github.com/pandas-dev/pandas/blob/8fefaa5a9a7c3f3a1c35c36c1140117dab73c9c7/pandas/_libs/tslibs/offsets.pyx#L3506-L3508

Do we need to support frequencies like +10ms? What about spaces in between like 3 h? I'd check these cases against pandas to see what is supported -- or find the exact regex that pandas uses to make sure we hit the same cases.

Suggested change

_FREQSTR_REGEX = re.compile("(-)*([0-9]*)([a-zA-Z]+)")

_FREQSTR_REGEX = re.compile("([+\-])?(\d*)\s*([A-Za-z]+)")

bdice · 2021-11-29T19:27:46Z

python/cudf/cudf/core/tools/datetimes.py

            raise ValueError(f"Cannot interpret frequency str: {freqstr}")

-        return cls(**{cls._CODES_TO_UNITS[freq_part]: int(numeric_part)})
+        # Handle sign and numerics
+        sign = -1 if sign_part else 1


Needed for my other regex suggestion above.

Suggested change

sign = -1 if sign_part else 1

sign = -1 if sign_part.startswith("-") else 1

bdice · 2021-11-29T19:30:13Z

python/cudf/cudf/core/tools/datetimes.py

+        if "M" in freq or "Y" in freq.upper():
+            raise e


This if clause is a bit of an awkward special-case for supporting ms. If the DateOffset._from_str will raise a ValueError here, I would just remove the if: raise logic that exits early here and let it be handled in the except block.

DateOffset._from_str does not raise for M and Y. Which is why it needs to get handled ahead here. It feels like _from_str can only handle so many common paths. M means months for a DateOffset but "Months End" in date_range context. I meant to support constructing months with string like 1M in #9710 but reject it here. The two functions handles M in conflicting meanings so this is as far as I can do to consolidate the logic.

Can you add a comment that clarifies which paths you're treating as special cases here? Something that specifies how DateOffset and date_range differ in functionality and that you're intending to catch month/year with this statement.

isVoid · 2021-12-04T05:13:25Z

After speaking with @bdice offline I was convinced pd.tseries.frequencies.to_offset is a better option if we want to convert a freqstr into a pd.DateOffset object. Thus I'm closing this and I'm looking to replace custom regex in date_range with the above.

Superseded by #9843

isVoid added 2 commits November 16, 2021 15:26

consolidate freqstr handling

107c08a

.

25da96e

isVoid requested a review from a team as a code owner November 17, 2021 00:50

isVoid requested review from brandon-b-miller and marlenezw November 17, 2021 00:50

github-actions bot added the Python Affects Python cuDF API. label Nov 17, 2021

isVoid self-assigned this Nov 17, 2021

isVoid mentioned this pull request Nov 17, 2021

Add first and last method to IndexedFrame #9710

Merged

isVoid added improvement Improvement / enhancement to an existing function 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Nov 17, 2021

brandon-b-miller reviewed Nov 17, 2021

View reviewed changes

python/cudf/cudf/core/tools/datetimes.py Outdated Show resolved Hide resolved

brandon-b-miller reviewed Nov 17, 2021

View reviewed changes

isVoid commented Nov 17, 2021

View reviewed changes

reject ambiguous input and add raise error message for unsupported ty…

cbbc061

…pes in date_range

isVoid requested a review from brandon-b-miller November 18, 2021 01:16

brandon-b-miller approved these changes Nov 18, 2021

View reviewed changes

bdice requested changes Nov 22, 2021

View reviewed changes

isVoid and others added 3 commits November 23, 2021 18:45

Apply suggestions from code review

0208eaf

Co-authored-by: Bradley Dice <[email protected]>

do not explicitly test for unsupported types

b3491e8

Merge branch 'improvements/offset_str_handling' of github.com:isVoid/…

561caca

…cudf into improvements/offset_str_handling

isVoid requested a review from bdice November 24, 2021 03:57

bdice requested changes Nov 29, 2021

View reviewed changes

isVoid closed this Dec 4, 2021

isVoid added a commit to isVoid/cudf that referenced this pull request Dec 16, 2021

revert changes introduced from closed PR rapidsai#9709

1076a7c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate offset string handling into factory `_from_str` #9709

Consolidate offset string handling into factory `_from_str` #9709

isVoid commented Nov 17, 2021 •

edited

Loading

codecov bot commented Nov 17, 2021 •

edited

Loading

brandon-b-miller left a comment

isVoid left a comment

isVoid commented Nov 18, 2021

brandon-b-miller left a comment

bdice left a comment

bdice Nov 22, 2021

isVoid Nov 24, 2021

isVoid Nov 24, 2021 •

edited

Loading

isVoid Nov 24, 2021

bdice left a comment •

edited

Loading

bdice Nov 29, 2021

bdice Nov 29, 2021

bdice Nov 29, 2021

isVoid Nov 30, 2021

bdice Nov 30, 2021

isVoid commented Dec 4, 2021

	_FREQSTR_REGEX = re.compile("(-)([0-9])([a-zA-Z]+)")
	_FREQSTR_REGEX = re.compile("([+\-])?(\d)\s([A-Za-z]+)")

	sign = -1 if sign_part else 1
	sign = -1 if sign_part.startswith("-") else 1

Consolidate offset string handling into factory _from_str #9709

Consolidate offset string handling into factory _from_str #9709

Conversation

isVoid commented Nov 17, 2021 • edited Loading

codecov bot commented Nov 17, 2021 • edited Loading

Codecov Report

brandon-b-miller left a comment

Choose a reason for hiding this comment

isVoid left a comment

Choose a reason for hiding this comment

isVoid commented Nov 18, 2021

brandon-b-miller left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isVoid Nov 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isVoid commented Dec 4, 2021

Consolidate offset string handling into factory `_from_str` #9709

Consolidate offset string handling into factory `_from_str` #9709

isVoid commented Nov 17, 2021 •

edited

Loading

codecov bot commented Nov 17, 2021 •

edited

Loading

isVoid Nov 24, 2021 •

edited

Loading

bdice left a comment •

edited

Loading