Add support for DatetimeTZDtype and tz_localize #13163

shwina · 2023-04-18T15:31:04Z

Description

TBD

Quick benchmark:

psr = pd.Series(pd.date_range("1970-01-01", "1980-01-01", freq="1T"))
sr = cudf.from_pandas(psr)

In [6]: %timeit psr.dt.tz_localize("America/New_York", ambiguous="NaT", nonexistent="NaT")
1.55 s ± 9.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit sr.dt.tz_localize("America/New_York", ambiguous="NaT", nonexistent="NaT")
9.98 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

This reverts commit 7c9f66e.

mroeschke · 2023-04-18T16:35:27Z

python/cudf/cudf/core/index.py

-        nanos = self._values.astype("datetime64[ns]")
+        if isinstance(self.dtype, pd.DatetimeTZDtype):
+            nanos = self._values.astype(
+                pd.DatetimeTZDtype("ns", self.dtype.tz)


As a pandas 2.0 note, would be good to keep "s", "ms", or "us" resolution here

Since this is currently targeting branch-23.06 which doesn't have pandas-2.0 support this change seems okay, what @mroeschke suggesting is taken care of in pandas_2.0_feature_branch, so perhaps a note or todo here would be useful while I sync these changes with pandas-2.0 support.

python/cudf/cudf/core/column/datetime.py

python/cudf/cudf/core/_internals/timezones.py

python/cudf/cudf/_lib/column.pyx

python/cudf/cudf/core/_internals/timezones.py

python/cudf/cudf/core/column/datetime.py

python/cudf/cudf/core/index.py

bdice · 2023-04-18T17:26:53Z

python/cudf/cudf/tests/series/test_datetimelike.py

+
+
+@pytest.fixture(
+    params=["America/New_York", "Asia/Tokyo", "CET", "Etc/GMT+1", "UTC"]


I'd like to see if we can test every time zone present on the system against pandas behavior for some function(s).

I've updated to include all time zones in the localize tests (except the two problematic ones indicated).

Note that we likely cannot do this for each and every timezone operation (e.g., in the future when we add tz_convert), given the sheer increase in test runtime.

Yes, this is a good solution. I think testing all timezones for a very limited subset of functions is a helpful way to make sure we're not making bad assumptions about the timezone database structure which may change over time or by system.

Co-authored-by: Bradley Dice <[email protected]>

…localize

mroeschke

Nice! My comments have been addressed

shwina · 2023-04-21T12:46:28Z

This PR is blocked on dropping Python 3.8 and adding Python 3.9 support in cuDF (the zoneinfo module was introduced in Python 3.9)

bdice · 2023-04-23T20:17:33Z

Waiting for #13196.

…ocalize

…localize

bdice

Mostly minor comments. I'll approve once you've taken a look and addressed some of them. This is great work -- nice job @shwina.

docs/cudf/source/api_docs/index_objects.rst

python/cudf/cudf/core/_internals/timezones.py

bdice · 2023-04-26T19:37:14Z

python/cudf/cudf/core/column/datetime.py

+        )
+
+    @property
+    def values(self):


Should this property error implementation be inherited from DateTimeColumn? Or if you think it's possible that CuPy might support time-zone-naive data (the parent class but not this zone-aware class), should we mention "time zone" in the error message?

Nope -- you're right, we should just inherit this

bdice · 2023-04-26T19:39:34Z

python/cudf/cudf/core/index.py

+
+        Parameters
+        ----------
+        tz: str


I think there's usually supposed to be a space on both sides of the colon here.

The colon must be preceded by a space, or omitted if the type is absent.

https://numpydoc.readthedocs.io/en/latest/format.html#parameters

Suggested change

tz: str

tz : str

python/cudf/cudf/core/series.py

bdice · 2023-04-26T19:44:51Z

python/cudf/cudf/tests/series/test_datetimelike.py

+
+def _get_all_zones():
+    zones = []
+    for root, dirs, files in os.walk("/usr/share/zoneinfo"):


Just FYI, we ran into issues with /etc/localtime not existing in rapidsai/miniforge-cuda#16. The relation here is that I'm not 100% sure if you can assume /usr/share/zoneinfo is populated with data in super-minimalist Docker containers. What this means is that the result of this function could be an empty list. If you want that to raise an error/warning, you might want to add something explicit for that. I'm content with the current behavior or a warning. I might lean away from raising an error, but I'm not sure.

Do we want the tests to fail if that happens? I'd think not as we can't always control the environment the tests run in. If my understanding is correct, throwing either a warning or an error will cause tests to fail.

Right now, it looks like the empty list will cause these tests to be skipped which seems like the desired behaviour. We could do something like this to be more explicit about skipping (but I don't like it):

if not len(ALL_TIME_ZONES): ALL_TIME_ZONES = [ pytest.param( None, marks=pytest.mark.skip(reason="Missing /usr/share/zoneinfo/") ) ]

No strong feelings. Let's leave it as-is and not overengineer it just to get a skip message.

bdice · 2023-04-26T19:47:48Z

python/cudf/cudf/tests/series/test_datetimelike.py

+
+
+@pytest.fixture(
+    params=["America/New_York", "Asia/Tokyo", "CET", "Etc/GMT+1", "UTC"]


Yes, this is a good solution. I think testing all timezones for a very limited subset of functions is a helpful way to make sure we're not making bad assumptions about the timezone database structure which may change over time or by system.

bdice · 2023-04-26T19:49:53Z

python/cudf/cudf/tests/series/test_datetimelike.py

+    request.applymarker(
+        pytest.mark.xfail(
+            condition=(zone_name == "America/Grand_Turk"),
+            reason="https://www.worldtimezone.com/dst_news/dst_news_turkscaicos03.html",  # noqa: E501


Were we able to identify if there is an inconsistency somewhere for Grand Turk and Metlakatla? Perhaps disagreements between system-provided timezone information and some timezone database in a Python package?

This also might be a candidate for skipping instead of xfailing, if we run tests with xfail_strict. This sounds like something where we could get an unexpected success (which would cause an error) due to Python/system software updates ...particularly if my hypothesis above is true.

I honestly haven't gotten to the bottom of it. If we ever do hit an unexpected success here, e.g., if something changes in either zoneinfo or pytz, I would love to be alerted of that! So, should we keep the xfail?

Sounds good to me!

Co-authored-by: Bradley Dice <[email protected]>

…localize

bdice · 2023-04-26T22:14:05Z

python/cudf/cudf/core/column/__init__.py

@@ -23,6 +23,7 @@
    serialize_columns,
 )
 from cudf.core.column.datetime import DatetimeColumn  # noqa: F401
+from cudf.core.column.datetime import DatetimeTZColumn  # noqa: F401


FYI: you can fix these F401 "unused import" errors by defining __all__. If it's in __all__, it's "part of the API" of this module and thus flake8 sees it as used.

Isn't this more explicit though? I always thought the only "official" use for __all__ was to define what happens when you do from module import *

I think that defining your module’s public API (rather than not defining it) is the most explicit approach. I’m pro-__all__ in Python libraries.

think that defining your module’s public API (rather than not defining it) is the most explicit approach.

Hmm, but __all__ doesn't do that AFAICT. Placing a name in __all__ doesn't preclude it from appearing in e.g., tab completion options (naming it with a leading _ does).

Are there other tools than flake8 that do something with __all__?

I think type checkers (mypy, pyright) will utilize __all__ to see what APIs are "public"

Also, to your question about tab completion in IPython/etc., that's discussed here: ipython/ipykernel#129 (comment)

The community seems split on the issue, but some folks seem to think limiting to __all__ by default would be good. Either way -- it seems that __all__ is viewed as a source of truth for public APIs, and the real question is whether autocompletion should only show public APIs.

It's configurable, with the option IPCompleter.limit_to__all__ (config options docs).

The reason we do this is that we often want to use IPython to poke about and debug modules, so we like to see what's really there, not just what's whitelisted as public API. We should, however, do a better job of prioritising completions, so that things in __all__ come before things that aren't in it.

Thanks, both! I'm on board with defining __all__

I wonder if we can get a linter to enforce that __all__ is defined (even if empty) always.

bdice · 2023-04-26T22:14:48Z

python/cudf/cudf/core/column/column.py

@@ -1554,6 +1554,17 @@ def build_column(
            offset=offset,
            null_count=null_count,
        )
+    elif is_datetime64tz_dtype(dtype):
+        if data is None:
+            raise TypeError("Must specify data buffer")


ValueError seems more natural here unless you're matching something in pandas.

Suggested change

raise TypeError("Must specify data buffer")

raise ValueError("Must specify data buffer")

This is an issue in other places in this method and I just wanted to be consistent. Agree though that ValueError is the more appropriate error type. Let's fix in a follow-up.

python/cudf/cudf/core/column/datetime.py

bdice

Great work. I had a couple small comments/replies but nothing blocking. Thanks.

shwina · 2023-04-27T16:17:55Z

/merge

shwina and others added 15 commits April 18, 2023 11:17

Raw bindings

c2b1968

move declarations to /include

fb293c0

Renames

3457063

Remove extra files

6210054

Final touches

cf67f8b

Revert "move declarations to /include"

82ec81e

This reverts commit 7c9f66e.

Add datetimetzdtype

714116d

tmp

3a09e3d

First pass at localize

8ed31eb

Fix up and add tests for localize

e72b3fc

Add tests for nonexistent

d05e9e0

Improvements to localize

5b2c990

Docs

df3f7bc

Styles

a4f6430

More style

8f4b215

shwina requested a review from a team as a code owner April 18, 2023 15:31

shwina requested review from mroeschke and isVoid April 18, 2023 15:31

github-actions bot added the Python Affects Python cuDF API. label Apr 18, 2023

shwina mentioned this pull request Apr 18, 2023

[Story] Timezone-aware datetime support #12813

Closed

shwina added feature request New feature or request non-breaking Non-breaking change labels Apr 18, 2023

Use keywords

b6ec602

mroeschke reviewed Apr 18, 2023

View reviewed changes

python/cudf/cudf/core/column/datetime.py Outdated Show resolved Hide resolved

mroeschke reviewed Apr 18, 2023

View reviewed changes

python/cudf/cudf/core/_internals/timezones.py Show resolved Hide resolved

bdice reviewed Apr 18, 2023

View reviewed changes

shwina added 3 commits April 19, 2023 14:00

Run localize tests for all time zones

54dc4d5

Delocalize tests and index tests

fad1c6a

Use assume_timezone, not cast

4e89a5c

shwina and others added 3 commits April 19, 2023 15:27

Apply suggestions from code review

a473e51

Co-authored-by: Bradley Dice <[email protected]>

Doc

1d18fcd

Apply suggestions from code review

99cc4b0

Co-authored-by: Bradley Dice <[email protected]>

shwina requested review from bdice and mroeschke April 19, 2023 19:36

Merge branch 'add-tz-localize' of github.com:shwina/cudf into add-tz-…

6982474

…localize

mroeschke approved these changes Apr 19, 2023

View reviewed changes

bdice added the 0 - Blocked Cannot progress due to external reasons label Apr 23, 2023

shwina and others added 6 commits April 25, 2023 09:01

Merge branch 'branch-23.06' into add-tz-localize

b7563b8

Merge branch 'branch-23.06' of github.com:rapidsai/cudf into add-tz-l…

a1141d8

…ocalize

Merge branch 'add-tz-localize' of github.com:shwina/cudf into add-tz-…

fc5f497

…localize

Not all platforms have factory.

22040ce

Use Pandas directly to test if we can work with a TZ

9e0f922

Merge branch 'branch-23.06' into add-tz-localize

cfe5005

bdice reviewed Apr 26, 2023

View reviewed changes

shwina and others added 3 commits April 26, 2023 16:19

Apply suggestions from code review

dcf49f8

Co-authored-by: Bradley Dice <[email protected]>

Apply review suggestions

8cd32f8

Merge branch 'add-tz-localize' of github.com:shwina/cudf into add-tz-…

94aae6e

…localize

shwina requested a review from bdice April 26, 2023 20:21

bdice reviewed Apr 26, 2023

View reviewed changes

galipremsagar reviewed Apr 26, 2023

View reviewed changes

python/cudf/cudf/core/column/datetime.py Show resolved Hide resolved

bdice approved these changes Apr 26, 2023

View reviewed changes

Type metadata handling

ffbcf42

galipremsagar approved these changes Apr 27, 2023

View reviewed changes

rapids-bot bot merged commit e13dfec into rapidsai:branch-23.06 Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for DatetimeTZDtype and tz_localize #13163

Add support for DatetimeTZDtype and tz_localize #13163

shwina commented Apr 18, 2023 •

edited

Loading

mroeschke Apr 18, 2023

galipremsagar Apr 18, 2023

bdice Apr 18, 2023

shwina Apr 19, 2023

bdice Apr 26, 2023

mroeschke left a comment

shwina commented Apr 21, 2023

bdice commented Apr 23, 2023

bdice left a comment

bdice Apr 26, 2023

shwina Apr 26, 2023

bdice Apr 26, 2023

shwina Apr 26, 2023

bdice Apr 26, 2023

shwina Apr 26, 2023

bdice Apr 26, 2023

bdice Apr 26, 2023

bdice Apr 26, 2023

bdice Apr 26, 2023

shwina Apr 26, 2023

bdice Apr 26, 2023

bdice Apr 26, 2023 •

edited

Loading

shwina Apr 27, 2023

bdice Apr 27, 2023

shwina Apr 27, 2023

mroeschke Apr 27, 2023

bdice Apr 27, 2023 •

edited

Loading

shwina Apr 28, 2023

shwina Apr 28, 2023

bdice Apr 26, 2023

shwina Apr 27, 2023

bdice left a comment

shwina commented Apr 27, 2023



		@pytest.fixture(
		params=["America/New_York", "Asia/Tokyo", "CET", "Etc/GMT+1", "UTC"]

	raise TypeError("Must specify data buffer")
	raise ValueError("Must specify data buffer")

Add support for DatetimeTZDtype and tz_localize #13163

Add support for DatetimeTZDtype and tz_localize #13163

Conversation

shwina commented Apr 18, 2023 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

shwina commented Apr 21, 2023

bdice commented Apr 23, 2023

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Apr 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

shwina commented Apr 27, 2023

shwina commented Apr 18, 2023 •

edited

Loading

bdice Apr 26, 2023 •

edited

Loading

bdice Apr 27, 2023 •

edited

Loading