Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clib.converison._to_numpy: Add tests for pandas.Series with datetime dtypes #3670

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

seisman
Copy link
Member

@seisman seisman commented Dec 3, 2024

This PR adds tests for pandas.Series with datetime dtypes. Address #3600.

In pandas, datetime dtypes can be specified in following ways:

  1. Via NumPy dtypes, e.g., "datetime64[s]"
  2. Via pandas.DatetimeTZDtype, e.g., pd.DatetimeTZDtype(s, tz="UTC") or "datetime64[s, UTC]"
  3. Via pyarrow timestamp types, e.g., pd.ArrowDtype(pyarrow.Timestamp(s, tz="UTC")) or "timestamp[s, UTC][pyarrow]"

The following codes help us understand the default conversion behaviors:

>>> import numpy as np
>>> import pandas as pd

>>> # The sample data
>>> data = [pd.Timestamp("2024-01-02T03:04:05"), pd.Timestamp("2024-01-02T03:04:06")]

Via NumPy dtypes. The conversions are done as expected.

>>> series = pd.Series(data, dtype="datetime64[s]")
>>> series
0   2024-01-02 03:04:05
1   2024-01-02 03:04:06
dtype: datetime64[s]
>>> np.ascontiguousarray(series)
array(['2024-01-02T03:04:05', '2024-01-02T03:04:06'],
      dtype='datetime64[s]')

Via pd.DateTimeTZDtype with TZ. The pandas.series object is converted to object dtype. So we need to deal with the conversion manually. The expected numpy dtype and TZ information can be accessed via series.dtype.base and series.dtype.tz.

>>> series = pd.Series(data, dtype="datetime64[s, America/New_York]")
>>> series
0   2024-01-02 03:04:05-05:00
1   2024-01-02 03:04:06-05:00
dtype: datetime64[s, America/New_York]
>>> np.ascontiguousarray(series)
array([Timestamp('2024-01-02 03:04:05-0500', tz='America/New_York'),
       Timestamp('2024-01-02 03:04:06-0500', tz='America/New_York')],
      dtype=object)

>>> series.dtype.base
dtype('<M8[s]')
>>> series.dtype.tz
<DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>

In pandas 2.0, there was a bug (pandas-dev/pandas#52705) that pd.DateTimeTZDtype with any units are stored with dtype in ns resolution. The bug was fixed in pandas 2.1 (pandas-dev/pandas#52706), but there is no workaround on our side so the related tests are marked as xfail for pandas 2.0


Via pa.Timestamp. The pandas.Series object is converted to object dtype. So, we need to deal with it manually. The expected numpy datetime type and TZ information can be accessed via series.dtype.numpy_dtype and series.dtype.pyarrow_dtype.tz.

>>> series = pd.Series(data, dtype="timestamp[s, America/New_York][pyarrow]")
>>> series
0    2024-01-01 22:04:05-05:00
1    2024-01-01 22:04:06-05:00
dtype: timestamp[s, tz=America/New_York][pyarrow]

>>> np.ascontiguousarray(series)
array([Timestamp('2024-01-01 22:04:05-0500', tz='America/New_York'),
       Timestamp('2024-01-01 22:04:06-0500', tz='America/New_York')],
      dtype=object)
>>> series.dtype.numpy_dtype
dtype('<M8[s]')
>>> series.dtype.pyarrow_dtype.tz
'America/New_York'

In pandas 2.0, series.dtype.numpy_dtype is dtype('O'), and it doesn't have the series.dt.tz_convert method.

Base automatically changed from to_numpy/pandas_numeric to main December 12, 2024 01:29
@seisman seisman force-pushed the to_numpy/pandas_datetime branch from 850ac31 to a64e9e3 Compare December 12, 2024 01:31
@seisman seisman force-pushed the to_numpy/pandas_datetime branch from a64e9e3 to 5867999 Compare December 12, 2024 07:22
@seisman seisman force-pushed the to_numpy/pandas_datetime branch from 068c5cb to 56a266d Compare December 12, 2024 07:59
@seisman seisman force-pushed the to_numpy/pandas_datetime branch from 6b7017b to fb14509 Compare December 12, 2024 09:55
@seisman seisman added the maintenance Boring but important stuff for the core devs label Dec 12, 2024
@seisman seisman marked this pull request as ready for review December 12, 2024 10:00
@seisman seisman added the needs review This PR has higher priority and needs review. label Dec 12, 2024
@seisman seisman added this to the 0.14.0 milestone Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs needs review This PR has higher priority and needs review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant