Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tz_convert method to convert between timestamps #13328

Merged
merged 18 commits into from
May 18, 2023

Conversation

shwina
Copy link
Contributor

@shwina shwina commented May 10, 2023

Description

Closes #13329

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shwina shwina added feature request New feature or request Python Affects Python cuDF API. non-breaking Non-breaking change labels May 10, 2023
@github-actions github-actions bot added the conda label May 11, 2023
@shwina
Copy link
Contributor Author

shwina commented May 12, 2023

rerun tests

@shwina shwina marked this pull request as ready for review May 16, 2023 19:27
@shwina shwina requested a review from a team as a code owner May 16, 2023 19:27
@shwina shwina requested review from wence- and vyasr May 16, 2023 19:27
Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments have been addressed

@@ -589,6 +601,18 @@ def as_string_column(
) -> "cudf.core.column.StringColumn":
return self._local_time.as_string_column(dtype, format, **kwargs)

def __repr__(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really unrelated to the rest of the PR, but a quality-of-life thing for debugging. Pandas always prints the local timestamps when looking at a tz-aware column and pyarrow always prints the UTC timestamps.

python/cudf/cudf/core/_internals/timezones.py Outdated Show resolved Hide resolved
def __repr__(self):
# Arrow prints the UTC timestamps, but we want to print the
# local timestamps:
arr = self._local_time.to_arrow().cast(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a silly question. Why convert to arrow and then attempt to mirror pandas repr conventions instead of just converting it to pandas and using that repr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly because pyarrow has the convenient to_string() method that lets us assemble the repr with a custom class name.

In [7]: print(dti._column.to_arrow().to_string())
[
  2001-01-01 05:00:00.000000000,
  2001-01-01 06:00:00.000000000,
  2001-01-01 07:00:00.000000000,
  2001-01-01 08:00:00.000000000,
  2001-01-01 09:00:00.000000000,
  2001-01-01 10:00:00.000000000,
  2001-01-01 11:00:00.000000000,
  2001-01-01 12:00:00.000000000,
  2001-01-01 13:00:00.000000000,
  2001-01-01 14:00:00.000000000
]

python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
Parameters
----------
tz: str
Time zone for time. Corresponding timestamps would be converted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pandas docstring is worded in a weird way. However, I would leave this as-is to match pandas.

'2018-03-03 14:00:00+00:00'],
dtype='datetime64[ns, Europe/London]')
"""
from cudf.core._internals.timezones import convert, localize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot, is there a reason we defer this import in each function rather than doing it once at the top?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circular imports :-(

python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved
"to_tz", ["Europe/London", "America/Chicago", "UTC", None]
)
def test_convert(from_tz, to_tz):
ps = pd.Series(pd.date_range("2023-01-01", periods=3, freq="H"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d love it if we could add some complexity to our test inputs. Maybe a data fixture that has some times on either side of a DST change, ambiguous times, pre-1900 times, etc. Include some times that we know have raised issues in the past (issue tracker has a few).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. @mroeschke does Pandas do something like this? Just wondering if there's tooling we can borrow/steal/vendor from Pandas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. The only related fixture we used is a fixed variety of timezones we use where applicable

https://github.com/pandas-dev/pandas/blob/0fa150016911de08025d82ef6975750278c5ad7b/pandas/conftest.py#L1196-L1214

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I threw the problem at ChatGPT and it generated some edge case tests that I added here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-1900 times

I did find that we return a result different from Pandas for this pre-1900 example:

>>> pd.Series(["1899-01-01 12:00"], dtype="datetime64[s]").dt.tz_localize("Europe/Paris").dt.tz_convert("America/New_York")
0   1899-01-01 06:55:00-04:56
dtype: datetime64[ns, America/New_York]

>>> cudf.Series(["1899-01-01 12:00"], dtype="datetime64[s]").dt.tz_localize("Europe/Paris").dt.tz_convert("America/New_York")
0   1899-01-01 06:50:39-04:56
dtype: datetime64[s, America/New_York]

However, our result is the same as you would get with zoneinfo:

>>> datetime(1899, 1, 1, 12, 0, tzinfo=ZoneInfo("Europe/Paris")).astimezone(ZoneInfo("America/New_York"))
datetime.datetime(1899, 1, 1, 6, 50, 39, tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))

@mroeschke I'm curious if this aligns with your experience with the difference between Pandas (pytz) and ZoneInfo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina If you want to add pre-1900 times in a later PR, that's fine. I think you hit a decent number of edge cases for now. But if we know we disagree with pandas for this specific case, I'd like to document that in an issue. I would consider that a bug.

Copy link
Contributor Author

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion

@shwina shwina requested a review from bdice May 17, 2023 19:23
@shwina
Copy link
Contributor Author

shwina commented May 18, 2023

/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tz_convert method to convert timezones
3 participants