[Python] Pandas roundtrip doesn't preserve list of datetime objects #19770

asfimport · 2018-10-06T07:44:42Z

Adding the following to the pandas_example.py::dataframe_with_lists functionn:

datetime_data = [
     [datetime(2015, 1, 5, 12, 0, 0), datetime(2020, 8, 22, 10, 5, 0)],
     [datetime(2024, 5, 5, 5, 49, 1), datetime(2015, 12, 24, 22, 10, 17)],
     [datetime(1996, 4, 30, 2, 38, 11)],
     None,
     [datetime(1987, 1, 27, 8, 21, 59)]
]

type = pa.timestamp('s'|'ms'|'us'|'ns')

breaks the tests cases, because the roundtrip doesn't preserve the object type.

Reporter: Krisztian Szucs / @kszucs

Related issues:

[Python] timestamp_as_object support for pa.Table.to_pandas in pyarrow (relates to)

PRs and other links:

GitHub Pull Request #10866

_{Note: This issue was originally created as ARROW-3448. Please see the migration documentation for further details.}

asfimport · 2019-05-08T08:00:58Z

Joris Van den Bossche / @jorisvandenbossche:
Is this something we would like to fix?
In the meaning of: change the behaviour (but I am not sure this is desired), or have some to specify this as an option, or save more information in the pandas metadata?

I think in general ListArrays are always converted to columns of arrays (not preserving the original column of lists), and here in addition the question is whether it should be object dtype or datetime64 (for the nested array dtype).

asfimport · 2021-08-04T10:34:42Z

Antoine Pitrou / @pitrou:
Should we close this issue?

asfimport · 2021-08-04T13:26:34Z

Krisztian Szucs / @kszucs:
This might have been resolved since the creation of the ticket. I quickly checked the test cases with timestamp columns and the roundtrip seems to work - except in the nanosecond case when we use pandas timestamp objects.

asfimport · 2021-08-04T14:12:14Z

Joris Van den Bossche / @jorisvandenbossche:
I don't think this is actually "fixed". When converting back the nested list with timestamp columns, we create numpy arrays with datetime64 dtype, not object dtype arrays with datetime objects:

>>> df = pd.DataFrame({'a': datetime_data})
>>> table = pa.table(df)
>>> table 
pyarrow.Table
a: list<item: timestamp[us]>
  child 0, item: timestamp[us]

>>> table.to_pandas()['a'][0]
array(['2015-01-05T12:00:00.000000', '2020-08-22T10:05:00.000000'],
      dtype='datetime64[us]')

>>> df['a'][0]
[datetime.datetime(2015, 1, 5, 12, 0), datetime.datetime(2020, 8, 22, 10, 5)]

But as I mentioned above, not sure we actually want to change this behaviour.

asfimport · 2021-08-04T14:47:26Z

Krisztian Szucs / @kszucs:
I can't recall the original error, but it's more about keeping the dataframes equal before and after the roundtrip. Perhaps pandas has changed the semantics of checking equality of numpy arrays with datetime objects, for older pandas version the roundtrip fails: https://github.com/apache/arrow/pull/10866/checks?check_run_id=3242074723#step:8:7795

So assert_series_equal(pa.table(df).to_pandas(), df) wasn't true previously.

AlenkaF · 2023-03-23T11:52:11Z

I think this can be closed as the datetime object can now be preserved with the use of timestamp_as_object=True:

import pyarrow as pa
pa.__version__
# '12.0.0.dev279+gb20734438'

import pandas as pd
from datetime import datetime

datetime_data = [
     [datetime(2015, 1, 5, 12, 0, 0), datetime(2020, 8, 22, 10, 5, 0)],
     [datetime(2024, 5, 5, 5, 49, 1), datetime(2015, 12, 24, 22, 10, 17)],
     [datetime(1996, 4, 30, 2, 38, 11)],
     None,
     [datetime(1987, 1, 27, 8, 21, 59)]
]

df = pd.DataFrame({'a': datetime_data})
table = pa.table(df)
table.to_pandas(timestamp_as_object=True).values
# array([[array([datetime.datetime(2015, 1, 5, 12, 0),
#                datetime.datetime(2020, 8, 22, 10, 5)], dtype=object)],
#        [array([datetime.datetime(2024, 5, 5, 5, 49, 1),
#                datetime.datetime(2015, 12, 24, 22, 10, 17)], dtype=object)],
#        [array([datetime.datetime(1996, 4, 30, 2, 38, 11)], dtype=object)],
#        [None],
#        [array([datetime.datetime(1987, 1, 27, 8, 21, 59)], dtype=object)]],
#       dtype=object)

There is still an issue where the list roundtrips to a numpy array of numpy arrays, but there are other issues tracking this (#34574, #20222) - we could think of supporting an option to preserve list dtype also. But this should come after the optimisation of to_pylist (#28694)

asfimport mentioned this issue Jan 11, 2023

[Python] timestamp_as_object support for pa.Table.to_pandas in pyarrow #21818

Closed

AlenkaF closed this as not planned Won't fix, can't repro, duplicate, stale Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Pandas roundtrip doesn't preserve list of datetime objects #19770

[Python] Pandas roundtrip doesn't preserve list of datetime objects #19770

asfimport commented Oct 6, 2018 •

edited

Loading

asfimport commented May 8, 2019

asfimport commented Aug 4, 2021

asfimport commented Aug 4, 2021

asfimport commented Aug 4, 2021

asfimport commented Aug 4, 2021

AlenkaF commented Mar 23, 2023

[Python] Pandas roundtrip doesn't preserve list of datetime objects #19770

[Python] Pandas roundtrip doesn't preserve list of datetime objects #19770

Comments

asfimport commented Oct 6, 2018 • edited Loading

Related issues:

PRs and other links:

asfimport commented May 8, 2019

asfimport commented Aug 4, 2021

asfimport commented Aug 4, 2021

asfimport commented Aug 4, 2021

asfimport commented Aug 4, 2021

AlenkaF commented Mar 23, 2023

asfimport commented Oct 6, 2018 •

edited

Loading