Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clib.conversion._to_numpy: Add tests for PyArrow's timestamp type #3621

Merged
merged 9 commits into from
Nov 28, 2024

Conversation

seisman
Copy link
Member

@seisman seisman commented Nov 15, 2024

pyarrow.timestamp is the datetime dtype in pyarrow. It has two parameters, unit and tz.

This PR adds tests to ensure _to_numpy support pyarrow.timestamp. Please note that, unit can only be s/ms/us/ns, and tz information is discarded when converting to np.datetime64. Actually, when converting to np.datetime64, timezone is corrected to UTC.

In [1]: import pyarrow as pa

In [2]: import datetime

In [3]: import numpy as np

In [4]: x = pa.array([datetime.datetime(2024, 1, 1, 12, 34, 56)], type=pa.timestamp("s"))

In [5]: np.ascontiguousarray(x)
Out[5]: array(['2024-01-01T12:34:56'], dtype='datetime64[s]')

In [6]: x = pa.array([datetime.datetime(2024, 1, 1, 12, 34, 56)], type=pa.timestamp("ms"))

In [7]: np.ascontiguousarray(x)
Out[7]: array(['2024-01-01T12:34:56.000'], dtype='datetime64[ms]')

In [8]: x = pa.array([datetime.datetime(2024, 1, 1, 12, 34, 56)], type=pa.timestamp("ms", tz="UTZ"))

In [9]: np.ascontiguousarray(x)
Out[9]: array(['2024-01-01T12:34:56.000'], dtype='datetime64[ms]')

Related to #3600.

@seisman seisman added maintenance Boring but important stuff for the core devs skip-changelog Skip adding Pull Request to changelog needs review This PR has higher priority and needs review. labels Nov 15, 2024
@seisman seisman added this to the 0.14.0 milestone Nov 15, 2024
@seisman seisman removed the needs review This PR has higher priority and needs review. label Nov 16, 2024
@seisman seisman marked this pull request as draft November 16, 2024 00:21
@seisman seisman marked this pull request as ready for review November 16, 2024 07:39
@seisman seisman added the needs review This PR has higher priority and needs review. label Nov 16, 2024
@@ -18,6 +18,18 @@

_HAS_PYARROW = True
except ImportError:

class pa: # noqa: N801
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pa.timestamp() with tz doesn't have string aliases, so we can't use strings like timestamp[s, UTC]. So we have to define a dummy function here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid this dummy class by starting from a pandas.Timestamp instead of Python datetime? The timezone info can be carried over:

import pyarrow as pa
import pandas as pd
from zoneinfo import ZoneInfo

array = pa.array(
    [
        pd.Timestamp(2024, 1, 2, 3, 4, 5, tzinfo=ZoneInfo("America/New_York")),
        pd.Timestamp(2024, 1, 2, 3, 4, 6, tzinfo=ZoneInfo("America/New_York")),
    ]
)
print(array.type)
# timestamp[us, tz=America/New_York]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but I feel it will make the pytest parametrize more complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's keep this dummy class then.

@seisman seisman requested a review from weiji14 November 21, 2024 04:35
Comment on lines 396 to 417
pytest.param(
pa.timestamp("s", tz="America/New_York"),
"datetime64[s]",
id="timestamp[s, tz=America/New_York]",
),
pytest.param(
pa.timestamp("s", tz="+07:30"),
"datetime64[s]",
id="timestamp[s, tz=+07:30]",
),
],
)
def test_to_numpy_pyarrow_timestamp(dtype, expected_dtype):
"""
Test the _to_numpy function with PyArrow arrays of PyArrow datetime types.

pyarrow.timestamp(unit, tz=None) can accept units "s", "ms", "us", and "ns".

Reference: https://arrow.apache.org/docs/python/generated/pyarrow.timestamp.html
"""
data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]
array = pa.array(data, type=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the timezone part of the pa.timestamp type actually working?

data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]
array = pa.array(data, type=pa.timestamp("s", tz="America/New_York"))
print(array)
# [
#   2024-01-02 03:04:05Z,
#   2024-01-02 03:04:06Z
# ]
result = _to_numpy(array)
print(result)
# ['2024-01-02T03:04:05' '2024-01-02T03:04:06']

If using pd.Timestamp instead

import pandas as pd
import pyarrow as pa
from datetime import datetime, timezone
from zoneinfo import ZoneInfo
from pygmt.clib.conversion import _to_numpy

array = pa.array(
    [
        pd.Timestamp(2024, 1, 2, 3, 4, 5, tzinfo=ZoneInfo("America/New_York")),
        pd.Timestamp(2024, 1, 2, 3, 4, 6, tzinfo=ZoneInfo("America/New_York")),
    ]
)
print(array)
# [
#   2024-01-02 08:04:05.000000Z,
#   2024-01-02 08:04:06.000000Z
# ]
result = _to_numpy(array)
print(result)
# ['2024-01-02T08:04:05.000000' '2024-01-02T08:04:06.000000']

For the first one, the timezone offset for New York (UTC-5) doesn't seem to be applied? Whereas for the second one, it is converting from New York timezone (UTC-5) to UTC by adding 5 hours.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the timezone information is applied but not shown when representing the array.

In [1]: from datetime import datetime

In [2]: import pyarrow as pa

# For pa.scalar, it's shown with timezone offset
In [3]: pa.scalar(datetime(2024, 1, 2, 3, 4, 5), type=pa.timestamp("s", tz="America/New_York"))
Out[3]: <pyarrow.TimestampScalar: '2024-01-01T22:04:05-0500'>

In [4]: array = pa.array([datetime(2024, 1, 2, 3, 4, 5)], type=pa.timestamp("s", tz="America/New_York"))
# For pa.array, it's shown without timezone offset
In [5]: array
Out[5]:
<pyarrow.lib.TimestampArray object at 0x7f52feace740>
[
  2024-01-02 03:04:05Z
]

# When converted to pandas, it's shown with timezone offset
In [6]: array.to_pandas()
Out[6]:
0   2024-01-01 22:04:05-05:00
dtype: datetime64[s, America/New_York]

# When converted to numpy, timezone information is lost, since numpy doesn's have tz support.
In [7]: array.to_numpy()
Out[7]: array(['2024-01-02T03:04:05'], dtype='datetime64[s]')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely related to apache/arrow#40122.

Copy link
Member Author

@seisman seisman Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the timezone part of the pa.timestamp type actually working?

data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]
array = pa.array(data, type=pa.timestamp("s", tz="America/New_York"))
print(array)
# [
#   2024-01-02 03:04:05Z,
#   2024-01-02 03:04:06Z
# ]
result = _to_numpy(array)
print(result)
# ['2024-01-02T03:04:05' '2024-01-02T03:04:06']

If using pd.Timestamp instead

import pandas as pd
import pyarrow as pa
from datetime import datetime, timezone
from zoneinfo import ZoneInfo
from pygmt.clib.conversion import _to_numpy

array = pa.array(
    [
        pd.Timestamp(2024, 1, 2, 3, 4, 5, tzinfo=ZoneInfo("America/New_York")),
        pd.Timestamp(2024, 1, 2, 3, 4, 6, tzinfo=ZoneInfo("America/New_York")),
    ]
)
print(array)
# [
#   2024-01-02 08:04:05.000000Z,
#   2024-01-02 08:04:06.000000Z
# ]
result = _to_numpy(array)
print(result)
# ['2024-01-02T08:04:05.000000' '2024-01-02T08:04:06.000000']

For the first one, the timezone offset for New York (UTC-5) doesn't seem to be applied? Whereas for the second one, it is converting from New York timezone (UTC-5) to UTC by adding 5 hours.

Also need to note the difference between datetime and pd.Timestamp. datetime is timezone-unware, while pd.Timestamp has timezone support.

So, in your first example:

  • actual datetime: 2024-01-02T03:04:05+00:00
  • In pyarrow.array, the actual datetime is 2024-01-01T22:04:05-05:00, but UTC time (2024-01-02T03:04:05) is shown
  • to_numpy: 2024-01-02T03:04:05 (UTC time)

In your 2nd example:

  • the actual datetime: 2024-01-02T03:04:05-05:00
  • In pyarrow.array: the UTC time (2024-01-02 08:04:05) is shown
  • to_numpy: 2024-01-02T08:04:05

So, I think there is no inconsistency. The pyarrow.array has timezone stored internally, but its repr always shows UTC time. And when converting to numpy array (using either its own to_numpy() method or np.ascontinuousarray()), the datetime are always converted to UTC since numpy.datetime64 doesn't have timezone support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://stackoverflow.com/a/73276431

Arrow internally stores datetime as UTC + timezone info and will print it as such

Copy link
Member

@weiji14 weiji14 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see now that there is indeed consistency, thanks for clearing this up, the repr on pyarrow.array is indeed confusing 😅 . So in summary:

class/type has timezone support Link
Python datetime https://docs.python.org/3/library/datetime.html#datetime.datetime.tzinfo
pandas.Timestamp https://pandas.pydata.org/pandas-docs/version/2.2/reference/api/pandas.Timestamp.html
pyarrow.timestamp (type) https://arrow.apache.org/docs/17.0/python/generated/pyarrow.timestamp.html
numpy.datetime64 https://numpy.org/doc/2.1/reference/arrays.scalars.html#numpy.datetime64

Since we're converting to NumPy, the timezone information will always be lost (everything is converted to UTC). What does this mean for PyGMT/GMT? Does it mean that users cannot plot data in a specific timezone, because it will always convert to UTC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As shown below, GMT itself doesn't have timezone support, similar to numpy.datetime64. The big difference is that GMT simply ignore the TZ infor, while the to_numpy conversion always converts to UTC.

gmt begin map
gmt basemap -R2024-01-01T00:00:00+08:00/2024-01-02T00:00:00+08:00/1/10 -JX10c/4c -Baf 
echo 2024-01-01T08:00:00+08:00 5 | gmt plot -Sc1c -Gblack
gmt end show

map

Does it mean that users cannot plot data in a specific timezone, because it will always convert to UTC?

I think yes, unless we try hard to drop the TZ information before converting to numpy.datetime64. Below is an example, but if we want to support TZ, we need to find a more general way to do it.

In [1]: import pyarrow as pa

In [2]: from datetime import datetime

In [3]: data = [datetime(2024, 1, 2, 3, 4, 5), datetime(2024, 1, 2, 3, 4, 6)]

In [4]: array = pa.array(data, type=pa.timestamp("s", tz="America/New_York"))

In [5]: array
Out[5]:
<pyarrow.lib.TimestampArray object at 0x7f78c07bafe0>
[
  2024-01-02 03:04:05Z,
  2024-01-02 03:04:06Z
]

In [6]: array.to_pandas().dt.tz_localize(None)
Out[6]:
0   2024-01-01 22:04:05
1   2024-01-01 22:04:06
dtype: datetime64[s]

In [7]: import numpy as np

In [8]: np.ascontiguousarray(array.to_pandas().dt.tz_localize(None))
Out[8]:
array(['2024-01-01T22:04:05', '2024-01-01T22:04:06'],
      dtype='datetime64[s]')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to open an issue to discuss this - whether to:

  1. Follow GMT (allow users to plot data at a non-UTC timezone, by ignoring the timezone offset)
  2. Follow NumPy, whereby the data will from any non-UTC timezone will be converted to UTC always.

If going with 2, we should at least raise a warning if a non-UTC timezone is used, that a conversion is taking place. If going with 1, we would need to special-case datetime types, that might mean extra logic in the _to_numpy function, or having to keep array_to_datetime (so don't merge #3507 yet).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also debating whether to following option 1 or 2, so a separate issue sounds good.

Since the current behavior is option 2 (as done in array_to_datetime), I think we should go with option 2 first, then revisit the timezone support later, which will be a breaking change anyway.

Copy link
Member

@weiji14 weiji14 Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll open an issue to discuss (edit: see #3656). And sure, we can go with option 2 for now.

Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-approving, but see optional suggestions below.

pygmt/tests/test_clib_to_numpy.py Outdated Show resolved Hide resolved
@@ -18,6 +18,18 @@

_HAS_PYARROW = True
except ImportError:

class pa: # noqa: N801
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's keep this dummy class then.

pygmt/tests/test_clib_to_numpy.py Outdated Show resolved Hide resolved
pygmt/tests/test_clib_to_numpy.py Outdated Show resolved Hide resolved
@seisman seisman removed the needs review This PR has higher priority and needs review. label Nov 28, 2024
@seisman seisman enabled auto-merge (squash) November 28, 2024 09:50
@seisman seisman merged commit 6766b84 into main Nov 28, 2024
18 checks passed
@seisman seisman deleted the to_numpy/pyarrow_timestamp branch November 28, 2024 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs skip-changelog Skip adding Pull Request to changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants