Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for DatetimeTZDtype and tz_localize #13163

Merged
merged 33 commits into from
Apr 27, 2023

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Apr 18, 2023

Description

TBD

Quick benchmark:

psr = pd.Series(pd.date_range("1970-01-01", "1980-01-01", freq="1T"))
sr = cudf.from_pandas(psr)

In [6]: %timeit psr.dt.tz_localize("America/New_York", ambiguous="NaT", nonexistent="NaT")
1.55 s ± 9.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit sr.dt.tz_localize("America/New_York", ambiguous="NaT", nonexistent="NaT")
9.98 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shwina shwina requested a review from a team as a code owner April 18, 2023 15:31
@shwina shwina requested review from mroeschke and isVoid April 18, 2023 15:31
@github-actions github-actions bot added the Python Affects Python cuDF API. label Apr 18, 2023
@shwina shwina added feature request New feature or request non-breaking Non-breaking change labels Apr 18, 2023
nanos = self._values.astype("datetime64[ns]")
if isinstance(self.dtype, pd.DatetimeTZDtype):
nanos = self._values.astype(
pd.DatetimeTZDtype("ns", self.dtype.tz)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a pandas 2.0 note, would be good to keep "s", "ms", or "us" resolution here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is currently targeting branch-23.06 which doesn't have pandas-2.0 support this change seems okay, what @mroeschke suggesting is taken care of in pandas_2.0_feature_branch, so perhaps a note or todo here would be useful while I sync these changes with pandas-2.0 support.

python/cudf/cudf/_lib/column.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/column.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/core/_internals/timezones.py Show resolved Hide resolved
python/cudf/cudf/core/column/datetime.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Show resolved Hide resolved


@pytest.fixture(
params=["America/New_York", "Asia/Tokyo", "CET", "Etc/GMT+1", "UTC"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see if we can test every time zone present on the system against pandas behavior for some function(s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated to include all time zones in the localize tests (except the two problematic ones indicated).

Note that we likely cannot do this for each and every timezone operation (e.g., in the future when we add tz_convert), given the sheer increase in test runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good solution. I think testing all timezones for a very limited subset of functions is a helpful way to make sure we're not making bad assumptions about the timezone database structure which may change over time or by system.

@shwina shwina requested review from bdice and mroeschke April 19, 2023 19:36
Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! My comments have been addressed

@shwina
Copy link
Contributor Author

shwina commented Apr 21, 2023

This PR is blocked on dropping Python 3.8 and adding Python 3.9 support in cuDF (the zoneinfo module was introduced in Python 3.9)

@bdice bdice added the 0 - Blocked Cannot progress due to external reasons label Apr 23, 2023
@bdice
Copy link
Contributor

bdice commented Apr 23, 2023

Waiting for #13196.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments. I'll approve once you've taken a look and addressed some of them. This is great work -- nice job @shwina.

docs/cudf/source/api_docs/index_objects.rst Show resolved Hide resolved
python/cudf/cudf/core/_internals/timezones.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/_internals/timezones.py Show resolved Hide resolved
)

@property
def values(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this property error implementation be inherited from DateTimeColumn? Or if you think it's possible that CuPy might support time-zone-naive data (the parent class but not this zone-aware class), should we mention "time zone" in the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope -- you're right, we should just inherit this


Parameters
----------
tz: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's usually supposed to be a space on both sides of the colon here.

The colon must be preceded by a space, or omitted if the type is absent.

https://numpydoc.readthedocs.io/en/latest/format.html#parameters

Suggested change
tz: str
tz : str

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix'd

python/cudf/cudf/core/series.py Show resolved Hide resolved

def _get_all_zones():
zones = []
for root, dirs, files in os.walk("/usr/share/zoneinfo"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, we ran into issues with /etc/localtime not existing in rapidsai/miniforge-cuda#16. The relation here is that I'm not 100% sure if you can assume /usr/share/zoneinfo is populated with data in super-minimalist Docker containers. What this means is that the result of this function could be an empty list. If you want that to raise an error/warning, you might want to add something explicit for that. I'm content with the current behavior or a warning. I might lean away from raising an error, but I'm not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the tests to fail if that happens? I'd think not as we can't always control the environment the tests run in. If my understanding is correct, throwing either a warning or an error will cause tests to fail.

Right now, it looks like the empty list will cause these tests to be skipped which seems like the desired behaviour. We could do something like this to be more explicit about skipping (but I don't like it):

if not len(ALL_TIME_ZONES):
    ALL_TIME_ZONES = [
        pytest.param(
             None,
             marks=pytest.mark.skip(reason="Missing /usr/share/zoneinfo/")
        )
    ]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong feelings. Let's leave it as-is and not overengineer it just to get a skip message.



@pytest.fixture(
params=["America/New_York", "Asia/Tokyo", "CET", "Etc/GMT+1", "UTC"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good solution. I think testing all timezones for a very limited subset of functions is a helpful way to make sure we're not making bad assumptions about the timezone database structure which may change over time or by system.

request.applymarker(
pytest.mark.xfail(
condition=(zone_name == "America/Grand_Turk"),
reason="https://www.worldtimezone.com/dst_news/dst_news_turkscaicos03.html", # noqa: E501
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were we able to identify if there is an inconsistency somewhere for Grand Turk and Metlakatla? Perhaps disagreements between system-provided timezone information and some timezone database in a Python package?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also might be a candidate for skipping instead of xfailing, if we run tests with xfail_strict. This sounds like something where we could get an unexpected success (which would cause an error) due to Python/system software updates ...particularly if my hypothesis above is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly haven't gotten to the bottom of it. If we ever do hit an unexpected success here, e.g., if something changes in either zoneinfo or pytz, I would love to be alerted of that! So, should we keep the xfail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

@shwina shwina requested a review from bdice April 26, 2023 20:21
@@ -23,6 +23,7 @@
serialize_columns,
)
from cudf.core.column.datetime import DatetimeColumn # noqa: F401
from cudf.core.column.datetime import DatetimeTZColumn # noqa: F401
Copy link
Contributor

@bdice bdice Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: you can fix these F401 "unused import" errors by defining __all__. If it's in __all__, it's "part of the API" of this module and thus flake8 sees it as used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this more explicit though? I always thought the only "official" use for __all__ was to define what happens when you do from module import *

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that defining your module’s public API (rather than not defining it) is the most explicit approach. I’m pro-__all__ in Python libraries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think that defining your module’s public API (rather than not defining it) is the most explicit approach.

Hmm, but __all__ doesn't do that AFAICT. Placing a name in __all__ doesn't preclude it from appearing in e.g., tab completion options (naming it with a leading _ does).

Are there other tools than flake8 that do something with __all__?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think type checkers (mypy, pyright) will utilize __all__ to see what APIs are "public"

Copy link
Contributor

@bdice bdice Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to your question about tab completion in IPython/etc., that's discussed here: ipython/ipykernel#129 (comment)

The community seems split on the issue, but some folks seem to think limiting to __all__ by default would be good. Either way -- it seems that __all__ is viewed as a source of truth for public APIs, and the real question is whether autocompletion should only show public APIs.

It's configurable, with the option IPCompleter.limit_to__all__ (config options docs).

The reason we do this is that we often want to use IPython to poke about and debug modules, so we like to see what's really there, not just what's whitelisted as public API. We should, however, do a better job of prioritising completions, so that things in __all__ come before things that aren't in it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, both! I'm on board with defining __all__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can get a linter to enforce that __all__ is defined (even if empty) always.

@@ -1554,6 +1554,17 @@ def build_column(
offset=offset,
null_count=null_count,
)
elif is_datetime64tz_dtype(dtype):
if data is None:
raise TypeError("Must specify data buffer")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueError seems more natural here unless you're matching something in pandas.

Suggested change
raise TypeError("Must specify data buffer")
raise ValueError("Must specify data buffer")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an issue in other places in this method and I just wanted to be consistent. Agree though that ValueError is the more appropriate error type. Let's fix in a follow-up.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. I had a couple small comments/replies but nothing blocking. Thanks.

@shwina
Copy link
Contributor Author

shwina commented Apr 27, 2023

/merge

@rapids-bot rapids-bot bot merged commit e13dfec into rapidsai:branch-23.06 Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Blocked Cannot progress due to external reasons feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants