-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for DatetimeTZDtype and tz_localize #13163
Conversation
This reverts commit 7c9f66e.
nanos = self._values.astype("datetime64[ns]") | ||
if isinstance(self.dtype, pd.DatetimeTZDtype): | ||
nanos = self._values.astype( | ||
pd.DatetimeTZDtype("ns", self.dtype.tz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a pandas 2.0 note, would be good to keep "s", "ms", or "us" resolution here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is currently targeting branch-23.06
which doesn't have pandas-2.0 support this change seems okay, what @mroeschke suggesting is taken care of in pandas_2.0_feature_branch
, so perhaps a note or todo here would be useful while I sync these changes with pandas-2.0 support.
|
||
|
||
@pytest.fixture( | ||
params=["America/New_York", "Asia/Tokyo", "CET", "Etc/GMT+1", "UTC"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see if we can test every time zone present on the system against pandas behavior for some function(s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated to include all time zones in the localize tests (except the two problematic ones indicated).
Note that we likely cannot do this for each and every timezone operation (e.g., in the future when we add tz_convert
), given the sheer increase in test runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a good solution. I think testing all timezones for a very limited subset of functions is a helpful way to make sure we're not making bad assumptions about the timezone database structure which may change over time or by system.
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! My comments have been addressed
This PR is blocked on dropping Python 3.8 and adding Python 3.9 support in cuDF (the |
Waiting for #13196. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor comments. I'll approve once you've taken a look and addressed some of them. This is great work -- nice job @shwina.
) | ||
|
||
@property | ||
def values(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this property error implementation be inherited from DateTimeColumn
? Or if you think it's possible that CuPy might support time-zone-naive data (the parent class but not this zone-aware class), should we mention "time zone" in the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope -- you're right, we should just inherit this
python/cudf/cudf/core/index.py
Outdated
|
||
Parameters | ||
---------- | ||
tz: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's usually supposed to be a space on both sides of the colon here.
The colon must be preceded by a space, or omitted if the type is absent.
https://numpydoc.readthedocs.io/en/latest/format.html#parameters
tz: str | |
tz : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix'd
|
||
def _get_all_zones(): | ||
zones = [] | ||
for root, dirs, files in os.walk("/usr/share/zoneinfo"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, we ran into issues with /etc/localtime
not existing in rapidsai/miniforge-cuda#16. The relation here is that I'm not 100% sure if you can assume /usr/share/zoneinfo
is populated with data in super-minimalist Docker containers. What this means is that the result of this function could be an empty list. If you want that to raise an error/warning, you might want to add something explicit for that. I'm content with the current behavior or a warning. I might lean away from raising an error, but I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the tests to fail if that happens? I'd think not as we can't always control the environment the tests run in. If my understanding is correct, throwing either a warning or an error will cause tests to fail.
Right now, it looks like the empty list will cause these tests to be skipped which seems like the desired behaviour. We could do something like this to be more explicit about skipping (but I don't like it):
if not len(ALL_TIME_ZONES):
ALL_TIME_ZONES = [
pytest.param(
None,
marks=pytest.mark.skip(reason="Missing /usr/share/zoneinfo/")
)
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong feelings. Let's leave it as-is and not overengineer it just to get a skip message.
|
||
|
||
@pytest.fixture( | ||
params=["America/New_York", "Asia/Tokyo", "CET", "Etc/GMT+1", "UTC"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a good solution. I think testing all timezones for a very limited subset of functions is a helpful way to make sure we're not making bad assumptions about the timezone database structure which may change over time or by system.
request.applymarker( | ||
pytest.mark.xfail( | ||
condition=(zone_name == "America/Grand_Turk"), | ||
reason="https://www.worldtimezone.com/dst_news/dst_news_turkscaicos03.html", # noqa: E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were we able to identify if there is an inconsistency somewhere for Grand Turk and Metlakatla? Perhaps disagreements between system-provided timezone information and some timezone database in a Python package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also might be a candidate for skipping instead of xfailing, if we run tests with xfail_strict
. This sounds like something where we could get an unexpected success (which would cause an error) due to Python/system software updates ...particularly if my hypothesis above is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly haven't gotten to the bottom of it. If we ever do hit an unexpected success here, e.g., if something changes in either zoneinfo
or pytz
, I would love to be alerted of that! So, should we keep the xfail
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me!
Co-authored-by: Bradley Dice <[email protected]>
@@ -23,6 +23,7 @@ | |||
serialize_columns, | |||
) | |||
from cudf.core.column.datetime import DatetimeColumn # noqa: F401 | |||
from cudf.core.column.datetime import DatetimeTZColumn # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: you can fix these F401 "unused import" errors by defining __all__
. If it's in __all__
, it's "part of the API" of this module and thus flake8 sees it as used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this more explicit
though? I always thought the only "official" use for __all__
was to define what happens when you do from module import *
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that defining your module’s public API (rather than not defining it) is the most explicit
approach. I’m pro-__all__
in Python libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think that defining your module’s public API (rather than not defining it) is the most
explicit
approach.
Hmm, but __all__
doesn't do that AFAICT. Placing a name in __all__
doesn't preclude it from appearing in e.g., tab completion options (naming it with a leading _
does).
Are there other tools than flake8
that do something with __all__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think type checkers (mypy, pyright) will utilize __all__
to see what APIs are "public"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, to your question about tab completion in IPython/etc., that's discussed here: ipython/ipykernel#129 (comment)
The community seems split on the issue, but some folks seem to think limiting to __all__
by default would be good. Either way -- it seems that __all__
is viewed as a source of truth for public APIs, and the real question is whether autocompletion should only show public APIs.
It's configurable, with the option
IPCompleter.limit_to__all__
(config options docs).The reason we do this is that we often want to use IPython to poke about and debug modules, so we like to see what's really there, not just what's whitelisted as public API. We should, however, do a better job of prioritising completions, so that things in
__all__
come before things that aren't in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, both! I'm on board with defining __all__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can get a linter to enforce that __all__
is defined (even if empty) always.
@@ -1554,6 +1554,17 @@ def build_column( | |||
offset=offset, | |||
null_count=null_count, | |||
) | |||
elif is_datetime64tz_dtype(dtype): | |||
if data is None: | |||
raise TypeError("Must specify data buffer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueError
seems more natural here unless you're matching something in pandas.
raise TypeError("Must specify data buffer") | |
raise ValueError("Must specify data buffer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an issue in other places in this method and I just wanted to be consistent. Agree though that ValueError
is the more appropriate error type. Let's fix in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. I had a couple small comments/replies but nothing blocking. Thanks.
/merge |
Description
TBD
Quick benchmark:
Checklist