Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Story] Timezone-aware datetime support #12813

Closed
5 of 6 tasks
shwina opened this issue Feb 21, 2023 · 12 comments
Closed
5 of 6 tasks

[Story] Timezone-aware datetime support #12813

shwina opened this issue Feb 21, 2023 · 12 comments
Assignees
Labels
Python Affects Python cuDF API.

Comments

@shwina
Copy link
Contributor

shwina commented Feb 21, 2023

This meta-issue tracks adding support for timezone-aware datetimes to the cuDF Python API.

As discussed in #11592, it's possible to implement timezone-aware operations like tz_localize and tz_convert using algorithms already provided by the libcudf public API - namely, lower_bound, upper_bound, and label_bins (histogramming).

What has been missing is a way to load the "time zone database" into device memory -- work on this is underway in #12805. Once we have a way to load the UTC offsets corresponding to a given timezone into a cudf.DataFrame, we can add implementations of timezone-aware operations, prototyped here.

Tasks

Preview Give feedback
  1. 5 - Ready to Merge CMake conda cuIO feature request libcudf non-breaking
    vuule
  2. CMake Python conda improvement libcudf non-breaking
  3. 0 - Blocked Python feature request non-breaking
@shwina shwina added the Python Affects Python cuDF API. label Feb 21, 2023
@shwina shwina self-assigned this Feb 21, 2023
@mroeschke
Copy link
Contributor

Will cuDF have a "default" time zone database (implementation) when this feature is shipped?

In pandas, once we drop Python 3.8 this year, we plan to default to the stdlib's ZoneInfo + tzdata package instead of pytz

@shwina
Copy link
Contributor Author

shwina commented Feb 21, 2023

Thanks for bring this up, @mroeschke!

Will cuDF have a "default" time zone database (implementation) when this feature is shipped?

Like ZoneInfo, cuDF will rely on the time zone data found in sysconfig.get_config_var(TZPATH) and fall back to the tzdata package. But, cuDF will use its own TZiF reader to read the time zone data into GPU memory.

So, cuDF will depend neither on ZoneInfo nor pytz. But, I'm hoping that by depending on the exact same data source as Pandas, cuDF will be more aligned with Pandas when the latter drops pytz. Does that seem like a reasonable assumption?

@mroeschke
Copy link
Contributor

Yup the sourcing of the data will align with pandas once pytz is no longer the default.

A small detail question, what object will be returned in cuDF when a user asks for the tz of the dtype. For example

In [11]: ser = pd.Series([pd.Timestamp("2023", tz="UTC")])

In [12]: ser.dtype.tz
Out[12]: datetime.timezone.utc

In [13]: ser = pd.Series([pd.Timestamp("2023", tz="US/Pacific")])

In [14]: ser.dtype.tz
Out[14]: <DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>

In [15]: type(ser.dtype.tz)
Out[15]: pytz.tzfile.US/Pacific  # Will be zoninfo.ZoneInfo in the future

@shwina
Copy link
Contributor Author

shwina commented Feb 21, 2023

We could return always a ZoneInfo, constructed from the zone string? Would that break any use-cases?

@mroeschke
Copy link
Contributor

That should be sufficient for most timezones.

In pandas we also support fixed-offset timezones, UTC and any other timezone with a "fixed" UTC offset, and we plan to return datetime.timezone.utc and datetime.timezone objects respectively since the later case cannot be represented by ZoneInfo AFAICT

@shwina
Copy link
Contributor Author

shwina commented Feb 21, 2023

I asked @mroeschke offline how the "fixed" UTC offsets in the above comment are different from the Etc/GMT+xx offsets, which can be used with ZoneInfo. Quoting his response:

Yes ZoneInfo("Etc/GMT+3") are essentially similar to the "fixed" offsets I mentioned, but historically pandas also supports custom fixed offsets that don't officially exist. They were originally supported with pytz.FixedOffset(-123) for example and also now include to datetime.timezone(datetime.timedelta(hours=11, minutes=10))


I think it's safe to say that for now, cuDF will not support these kind of fixed offsets.

@bdice
Copy link
Contributor

bdice commented Mar 6, 2023

@shwina: This topic came up with @vuule today. Can you describe which parts of this issue you want to implement in libcudf vs. implement as an experiment in pure Python? I think we want to eventually have C++ APIs for most/all of the features like tz_localize so they can serve Spark and C++ users as well.

@shwina
Copy link
Contributor Author

shwina commented Mar 7, 2023

@bdice: only the timezone table reader needs libcudf support; everything else can be implemented using existing libcudf algorithms.

@GregoryKimball and I agreed that it's reasonable to first implement timezone-aware operations in Python, get some user/community feedback, and later eventually implement in C++.

@vyasr
Copy link
Contributor

vyasr commented May 15, 2024

@shwina or @GregoryKimball do you have a sense of what remains to be done for this issue?

@shwina
Copy link
Contributor Author

shwina commented May 15, 2024

It looks like we're mostly done here. I don't think we ever got to supporting binary operations with timezone-aware columns and timedeltas. Perhaps it's fine to just create an issue for that and close this story out.

@vyasr
Copy link
Contributor

vyasr commented May 17, 2024

That sounds right to me. Thanks @shwina!

@vyasr
Copy link
Contributor

vyasr commented May 17, 2024

Opened #15774

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

4 participants