Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tz_convert method to convert between timestamps #13328

Merged
merged 18 commits into from
May 18, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion python/cudf/cudf/_lib/column.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,6 @@ cdef class Column:
object null_count=None,
object children=()
):

self._size = size
self._distinct_count = {}
self._dtype = dtype
Expand Down
25 changes: 23 additions & 2 deletions python/cudf/cudf/core/_internals/timezones.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,15 +193,36 @@ def localize(
return cast(
DatetimeTZColumn,
build_column(
data=gmt_data.data,
data=gmt_data.base_data,
dtype=dtype,
mask=localized.mask,
mask=localized.base_mask,
shwina marked this conversation as resolved.
Show resolved Hide resolved
size=gmt_data.size,
offset=gmt_data.offset,
),
)


def convert(data: DatetimeTZColumn, zone_name: str) -> DatetimeTZColumn:
if not isinstance(data, DatetimeTZColumn):
raise TypeError(
"Cannot convert from timezone-naive timestamps to "
"timezone-aware timestamps. For that, "
"use `tz_localize instead."
shwina marked this conversation as resolved.
Show resolved Hide resolved
)
utc_time = data._utc_time
mroeschke marked this conversation as resolved.
Show resolved Hide resolved
out = cast(
DatetimeTZColumn,
build_column(
data=utc_time.base_data,
dtype=pd.DatetimeTZDtype(data._time_unit, zone_name),
mask=utc_time.base_mask,
size=utc_time.size,
offset=utc_time.offset,
),
)
return out


def utc_to_local(data: DatetimeColumn, zone_name: str) -> DatetimeColumn:
tz_data_for_zone = get_tz_data(zone_name)
transition_times, offsets = tz_data_for_zone._columns
Expand Down
16 changes: 14 additions & 2 deletions python/cudf/cudf/core/column/datetime.py
Original file line number Diff line number Diff line change
Expand Up @@ -565,8 +565,20 @@ def to_pandas(
)

def to_arrow(self):
return pa.compute.assume_timezone(
self._local_time.to_arrow(), str(self.dtype.tz)
return self._local_time.to_arrow().cast(
pa.timestamp(self.dtype.unit, str(self.dtype.tz))
)

@property
def _utc_time(self):
"""Return UTC time as naive timestamps."""
return DatetimeColumn(
data=self.base_data,
dtype=_get_base_dtype(self.dtype),
mask=self.base_mask,
size=self.size,
offset=self.offset,
null_count=self.null_count,
)

@property
Expand Down
41 changes: 41 additions & 0 deletions python/cudf/cudf/core/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -2547,6 +2547,47 @@ def tz_localize(self, tz, ambiguous="NaT", nonexistent="NaT"):
result_col = localize(self._column, tz, ambiguous, nonexistent)
return DatetimeIndex._from_data({self.name: result_col})

def tz_convert(self, tz):
"""
Convert tz-aware datetimes from one time zone to another.

Parameters
----------
tz: str
shwina marked this conversation as resolved.
Show resolved Hide resolved
Time zone for time. Corresponding timestamps would be converted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pandas docstring is worded in a weird way. However, I would leave this as-is to match pandas.

to this time zone of the Datetime Array/Index.
A `tz` of None will convert to UTC and remove the timezone
information.

Returns
-------
DatetimeIndex containing timestamps corresponding to the timezone
`tz`.

Examples
--------
>>> import cudf
>>> dti = cudf.date_range('2018-03-01 09:00', periods=3, freq='D')
>>> dti = dti.tz_localize("America/New_York")
>>> dti
DatetimeIndex(['2018-03-01 09:00:00-05:00',
'2018-03-02 09:00:00-05:00',
'2018-03-03 09:00:00-05:00'],
dtype='datetime64[ns, America/New_York]')
>>> dti.tz_convert("Europe/London")
DatetimeIndex(['2018-03-01 14:00:00+00:00',
'2018-03-02 14:00:00+00:00',
'2018-03-03 14:00:00+00:00'],
dtype='datetime64[ns, Europe/London]')
"""
from cudf.core._internals.timezones import convert, localize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot, is there a reason we defer this import in each function rather than doing it once at the top?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circular imports :-(


if tz is None:
result_col = localize(self._column._utc_time, None)
else:
result_col = convert(self._column, tz)
return DatetimeIndex._from_data({self.name: result_col})


class TimedeltaIndex(GenericIndex):
"""
Expand Down
21 changes: 21 additions & 0 deletions python/cudf/cudf/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -4609,6 +4609,27 @@ def tz_localize(self, tz, ambiguous="NaT", nonexistent="NaT"):
index=self.series._index,
)

@copy_docstring(DatetimeIndex.tz_convert)
def tz_convert(self, tz):
"""
Parameters
----------
tz: str
shwina marked this conversation as resolved.
Show resolved Hide resolved
Time zone for time. Corresponding timestamps would be converted
to this time zone of the Datetime Array/Index.
A `tz` of None will convert to UTC and remove the
timezone information.
"""
from cudf.core._internals.timezones import convert

if tz is None:
result_col = self.series._column._utc_time
else:
result_col = convert(self.series._column, tz)
return Series._from_data(
{self.series.name: result_col}, index=self.series._index
)


class TimedeltaProperties:
"""
Expand Down
10 changes: 10 additions & 0 deletions python/cudf/cudf/tests/indexes/datetime/test_time_specific.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,13 @@ def test_tz_localize():
pidx.tz_localize("America/New_York"),
idx.tz_localize("America/New_York"),
)


def test_tz_convert():
pidx = pd.date_range("2023-01-01", periods=3, freq="H")
idx = cudf.from_pandas(pidx)
pidx = pidx.tz_localize("UTC")
idx = idx.tz_localize("UTC")
assert_eq(
pidx.tz_convert("America/New_York"), idx.tz_convert("America/New_York")
)
22 changes: 22 additions & 0 deletions python/cudf/cudf/tests/series/test_datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,3 +111,25 @@ def test_delocalize(unit, tz):
expect = psr.dt.tz_localize(tz).dt.tz_localize(None)
got = sr.dt.tz_localize(tz).dt.tz_localize(None)
assert_eq(expect, got)


@pytest.mark.parametrize(
"from_tz", ["Europe/London", "America/Chicago", "UTC"]
)
@pytest.mark.parametrize(
"to_tz", ["Europe/London", "America/Chicago", "UTC", None]
)
def test_convert(from_tz, to_tz):
ps = pd.Series(pd.date_range("2023-01-01", periods=3, freq="H"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d love it if we could add some complexity to our test inputs. Maybe a data fixture that has some times on either side of a DST change, ambiguous times, pre-1900 times, etc. Include some times that we know have raised issues in the past (issue tracker has a few).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. @mroeschke does Pandas do something like this? Just wondering if there's tooling we can borrow/steal/vendor from Pandas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. The only related fixture we used is a fixed variety of timezones we use where applicable

https://github.com/pandas-dev/pandas/blob/0fa150016911de08025d82ef6975750278c5ad7b/pandas/conftest.py#L1196-L1214

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I threw the problem at ChatGPT and it generated some edge case tests that I added here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-1900 times

I did find that we return a result different from Pandas for this pre-1900 example:

>>> pd.Series(["1899-01-01 12:00"], dtype="datetime64[s]").dt.tz_localize("Europe/Paris").dt.tz_convert("America/New_York")
0   1899-01-01 06:55:00-04:56
dtype: datetime64[ns, America/New_York]

>>> cudf.Series(["1899-01-01 12:00"], dtype="datetime64[s]").dt.tz_localize("Europe/Paris").dt.tz_convert("America/New_York")
0   1899-01-01 06:50:39-04:56
dtype: datetime64[s, America/New_York]

However, our result is the same as you would get with zoneinfo:

>>> datetime(1899, 1, 1, 12, 0, tzinfo=ZoneInfo("Europe/Paris")).astimezone(ZoneInfo("America/New_York"))
datetime.datetime(1899, 1, 1, 6, 50, 39, tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))

@mroeschke I'm curious if this aligns with your experience with the difference between Pandas (pytz) and ZoneInfo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina If you want to add pre-1900 times in a later PR, that's fine. I think you hit a decent number of edge cases for now. But if we know we disagree with pandas for this specific case, I'd like to document that in an issue. I would consider that a bug.

gs = cudf.from_pandas(ps)
ps = ps.dt.tz_localize(from_tz)
gs = gs.dt.tz_localize(from_tz)
expect = ps.dt.tz_convert(to_tz)
got = gs.dt.tz_convert(to_tz)
assert_eq(expect, got)


def test_convert_from_naive():
gs = cudf.Series(cudf.date_range("2023-01-01", periods=3, freq="H"))
with pytest.raises(TypeError):
gs.dt.tz_convert("America/New_York")