-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clib.conversion._to_numpy: Add tests for PyArrow's timestamp type #3621
Conversation
@@ -18,6 +18,18 @@ | |||
|
|||
_HAS_PYARROW = True | |||
except ImportError: | |||
|
|||
class pa: # noqa: N801 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pa.timestamp()
with tz
doesn't have string aliases, so we can't use strings like timestamp[s, UTC]
. So we have to define a dummy function here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid this dummy class by starting from a pandas.Timestamp
instead of Python datetime
? The timezone info can be carried over:
import pyarrow as pa
import pandas as pd
from zoneinfo import ZoneInfo
array = pa.array(
[
pd.Timestamp(2024, 1, 2, 3, 4, 5, tzinfo=ZoneInfo("America/New_York")),
pd.Timestamp(2024, 1, 2, 3, 4, 6, tzinfo=ZoneInfo("America/New_York")),
]
)
print(array.type)
# timestamp[us, tz=America/New_York]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but I feel it will make the pytest parametrize more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's keep this dummy class then.
pygmt/tests/test_clib_to_numpy.py
Outdated
pytest.param( | ||
pa.timestamp("s", tz="America/New_York"), | ||
"datetime64[s]", | ||
id="timestamp[s, tz=America/New_York]", | ||
), | ||
pytest.param( | ||
pa.timestamp("s", tz="+07:30"), | ||
"datetime64[s]", | ||
id="timestamp[s, tz=+07:30]", | ||
), | ||
], | ||
) | ||
def test_to_numpy_pyarrow_timestamp(dtype, expected_dtype): | ||
""" | ||
Test the _to_numpy function with PyArrow arrays of PyArrow datetime types. | ||
|
||
pyarrow.timestamp(unit, tz=None) can accept units "s", "ms", "us", and "ns". | ||
|
||
Reference: https://arrow.apache.org/docs/python/generated/pyarrow.timestamp.html | ||
""" | ||
data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)] | ||
array = pa.array(data, type=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the timezone part of the pa.timestamp
type actually working?
data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]
array = pa.array(data, type=pa.timestamp("s", tz="America/New_York"))
print(array)
# [
# 2024-01-02 03:04:05Z,
# 2024-01-02 03:04:06Z
# ]
result = _to_numpy(array)
print(result)
# ['2024-01-02T03:04:05' '2024-01-02T03:04:06']
If using pd.Timestamp
instead
import pandas as pd
import pyarrow as pa
from datetime import datetime, timezone
from zoneinfo import ZoneInfo
from pygmt.clib.conversion import _to_numpy
array = pa.array(
[
pd.Timestamp(2024, 1, 2, 3, 4, 5, tzinfo=ZoneInfo("America/New_York")),
pd.Timestamp(2024, 1, 2, 3, 4, 6, tzinfo=ZoneInfo("America/New_York")),
]
)
print(array)
# [
# 2024-01-02 08:04:05.000000Z,
# 2024-01-02 08:04:06.000000Z
# ]
result = _to_numpy(array)
print(result)
# ['2024-01-02T08:04:05.000000' '2024-01-02T08:04:06.000000']
For the first one, the timezone offset for New York (UTC-5) doesn't seem to be applied? Whereas for the second one, it is converting from New York timezone (UTC-5) to UTC by adding 5 hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the timezone information is applied but not shown when representing the array.
In [1]: from datetime import datetime
In [2]: import pyarrow as pa
# For pa.scalar, it's shown with timezone offset
In [3]: pa.scalar(datetime(2024, 1, 2, 3, 4, 5), type=pa.timestamp("s", tz="America/New_York"))
Out[3]: <pyarrow.TimestampScalar: '2024-01-01T22:04:05-0500'>
In [4]: array = pa.array([datetime(2024, 1, 2, 3, 4, 5)], type=pa.timestamp("s", tz="America/New_York"))
# For pa.array, it's shown without timezone offset
In [5]: array
Out[5]:
<pyarrow.lib.TimestampArray object at 0x7f52feace740>
[
2024-01-02 03:04:05Z
]
# When converted to pandas, it's shown with timezone offset
In [6]: array.to_pandas()
Out[6]:
0 2024-01-01 22:04:05-05:00
dtype: datetime64[s, America/New_York]
# When converted to numpy, timezone information is lost, since numpy doesn's have tz support.
In [7]: array.to_numpy()
Out[7]: array(['2024-01-02T03:04:05'], dtype='datetime64[s]')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely related to apache/arrow#40122.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the timezone part of the
pa.timestamp
type actually working?data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)] array = pa.array(data, type=pa.timestamp("s", tz="America/New_York")) print(array) # [ # 2024-01-02 03:04:05Z, # 2024-01-02 03:04:06Z # ] result = _to_numpy(array) print(result) # ['2024-01-02T03:04:05' '2024-01-02T03:04:06']If using
pd.Timestamp
insteadimport pandas as pd import pyarrow as pa from datetime import datetime, timezone from zoneinfo import ZoneInfo from pygmt.clib.conversion import _to_numpy array = pa.array( [ pd.Timestamp(2024, 1, 2, 3, 4, 5, tzinfo=ZoneInfo("America/New_York")), pd.Timestamp(2024, 1, 2, 3, 4, 6, tzinfo=ZoneInfo("America/New_York")), ] ) print(array) # [ # 2024-01-02 08:04:05.000000Z, # 2024-01-02 08:04:06.000000Z # ] result = _to_numpy(array) print(result) # ['2024-01-02T08:04:05.000000' '2024-01-02T08:04:06.000000']
For the first one, the timezone offset for New York (UTC-5) doesn't seem to be applied? Whereas for the second one, it is converting from New York timezone (UTC-5) to UTC by adding 5 hours.
Also need to note the difference between datetime
and pd.Timestamp
. datetime
is timezone-unware, while pd.Timestamp
has timezone support.
So, in your first example:
- actual datetime:
2024-01-02T03:04:05+00:00
- In pyarrow.array, the actual datetime is
2024-01-01T22:04:05-05:00
, but UTC time (2024-01-02T03:04:05
) is shown - to_numpy:
2024-01-02T03:04:05
(UTC time)
In your 2nd example:
- the actual datetime:
2024-01-02T03:04:05-05:00
- In pyarrow.array: the UTC time (
2024-01-02 08:04:05
) is shown - to_numpy:
2024-01-02T08:04:05
So, I think there is no inconsistency. The pyarrow.array has timezone stored internally, but its repr always shows UTC time. And when converting to numpy array (using either its own to_numpy()
method or np.ascontinuousarray()), the datetime are always converted to UTC since numpy.datetime64 doesn't have timezone support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://stackoverflow.com/a/73276431
Arrow internally stores datetime as UTC + timezone info and will print it as such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see now that there is indeed consistency, thanks for clearing this up, the repr on pyarrow.array is indeed confusing 😅 . So in summary:
class/type | has timezone support | Link |
---|---|---|
Python datetime |
✅ | https://docs.python.org/3/library/datetime.html#datetime.datetime.tzinfo |
pandas.Timestamp | ✅ | https://pandas.pydata.org/pandas-docs/version/2.2/reference/api/pandas.Timestamp.html |
pyarrow.timestamp (type) | ✅ | https://arrow.apache.org/docs/17.0/python/generated/pyarrow.timestamp.html |
numpy.datetime64 | ❌ | https://numpy.org/doc/2.1/reference/arrays.scalars.html#numpy.datetime64 |
Since we're converting to NumPy, the timezone information will always be lost (everything is converted to UTC). What does this mean for PyGMT/GMT? Does it mean that users cannot plot data in a specific timezone, because it will always convert to UTC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As shown below, GMT itself doesn't have timezone support, similar to numpy.datetime64. The big difference is that GMT simply ignore the TZ infor, while the to_numpy conversion always converts to UTC.
gmt begin map
gmt basemap -R2024-01-01T00:00:00+08:00/2024-01-02T00:00:00+08:00/1/10 -JX10c/4c -Baf
echo 2024-01-01T08:00:00+08:00 5 | gmt plot -Sc1c -Gblack
gmt end show
Does it mean that users cannot plot data in a specific timezone, because it will always convert to UTC?
I think yes, unless we try hard to drop the TZ information before converting to numpy.datetime64. Below is an example, but if we want to support TZ, we need to find a more general way to do it.
In [1]: import pyarrow as pa
In [2]: from datetime import datetime
In [3]: data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]
In [4]: array = pa.array(data, type=pa.timestamp("s", tz="America/New_York"))
In [5]: array
Out[5]:
<pyarrow.lib.TimestampArray object at 0x7f78c07bafe0>
[
2024-01-02 03:04:05Z,
2024-01-02 03:04:06Z
]
In [6]: array.to_pandas().dt.tz_localize(None)
Out[6]:
0 2024-01-01 22:04:05
1 2024-01-01 22:04:06
dtype: datetime64[s]
In [7]: import numpy as np
In [8]: np.ascontiguousarray(array.to_pandas().dt.tz_localize(None))
Out[8]:
array(['2024-01-01T22:04:05', '2024-01-01T22:04:06'],
dtype='datetime64[s]')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to open an issue to discuss this - whether to:
- Follow GMT (allow users to plot data at a non-UTC timezone, by ignoring the timezone offset)
- Follow NumPy, whereby the data will from any non-UTC timezone will be converted to UTC always.
If going with 2, we should at least raise a warning if a non-UTC timezone is used, that a conversion is taking place. If going with 1, we would need to special-case datetime types, that might mean extra logic in the _to_numpy
function, or having to keep array_to_datetime
(so don't merge #3507 yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also debating whether to following option 1 or 2, so a separate issue sounds good.
Since the current behavior is option 2 (as done in array_to_datetime
), I think we should go with option 2 first, then revisit the timezone support later, which will be a breaking change anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll open an issue to discuss (edit: see #3656). And sure, we can go with option 2 for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-approving, but see optional suggestions below.
@@ -18,6 +18,18 @@ | |||
|
|||
_HAS_PYARROW = True | |||
except ImportError: | |||
|
|||
class pa: # noqa: N801 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's keep this dummy class then.
Co-authored-by: Wei Ji <[email protected]>
pyarrow.timestamp
is the datetime dtype in pyarrow. It has two parameters,unit
andtz
.This PR adds tests to ensure
_to_numpy
supportpyarrow.timestamp
. Please note that,unit
can only bes
/ms
/us
/ns
, andActually, when converting to np.datetime64, timezone is corrected to UTC.tz
information is discarded when converting to np.datetime64.Related to #3600.