Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Pandas roundtrip doesn't preserve list of datetime objects #19770

Closed
asfimport opened this issue Oct 6, 2018 · 6 comments
Closed

[Python] Pandas roundtrip doesn't preserve list of datetime objects #19770

asfimport opened this issue Oct 6, 2018 · 6 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Oct 6, 2018

Adding the following to the pandas_example.py::dataframe_with_lists functionn:

datetime_data = [
     [datetime(2015, 1, 5, 12, 0, 0), datetime(2020, 8, 22, 10, 5, 0)],
     [datetime(2024, 5, 5, 5, 49, 1), datetime(2015, 12, 24, 22, 10, 17)],
     [datetime(1996, 4, 30, 2, 38, 11)],
     None,
     [datetime(1987, 1, 27, 8, 21, 59)]
]

type = pa.timestamp('s'|'ms'|'us'|'ns')

breaks the tests cases, because the roundtrip doesn't preserve the object type.

Reporter: Krisztian Szucs / @kszucs

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-3448. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Is this something we would like to fix?
In the meaning of: change the behaviour (but I am not sure this is desired), or have some to specify this as an option, or save more information in the pandas metadata?

I think in general ListArrays are always converted to columns of arrays (not preserving the original column of lists), and here in addition the question is whether it should be object dtype or datetime64 (for the nested array dtype).

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Should we close this issue?

@asfimport
Copy link
Collaborator Author

Krisztian Szucs / @kszucs:
This might have been resolved since the creation of the ticket. I quickly checked the test cases with timestamp columns and the roundtrip seems to work - except in the nanosecond case when we use pandas timestamp objects.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
I don't think this is actually "fixed". When converting back the nested list with timestamp columns, we create numpy arrays with datetime64 dtype, not object dtype arrays with datetime objects:

>>> df = pd.DataFrame({'a': datetime_data})
>>> table = pa.table(df)
>>> table 
pyarrow.Table
a: list<item: timestamp[us]>
  child 0, item: timestamp[us]

>>> table.to_pandas()['a'][0]
array(['2015-01-05T12:00:00.000000', '2020-08-22T10:05:00.000000'],
      dtype='datetime64[us]')

>>> df['a'][0]
[datetime.datetime(2015, 1, 5, 12, 0), datetime.datetime(2020, 8, 22, 10, 5)]

But as I mentioned above, not sure we actually want to change this behaviour.

@asfimport
Copy link
Collaborator Author

Krisztian Szucs / @kszucs:
I can't recall the original error, but it's more about keeping the dataframes equal before and after the roundtrip. Perhaps pandas has changed the semantics of checking equality of numpy arrays with datetime objects, for older pandas version the roundtrip fails: https://github.com/apache/arrow/pull/10866/checks?check_run_id=3242074723#step:8:7795

So assert_series_equal(pa.table(df).to_pandas(), df) wasn't true previously.

@AlenkaF
Copy link
Member

AlenkaF commented Mar 23, 2023

I think this can be closed as the datetime object can now be preserved with the use of timestamp_as_object=True:

import pyarrow as pa
pa.__version__
# '12.0.0.dev279+gb20734438'

import pandas as pd
from datetime import datetime

datetime_data = [
     [datetime(2015, 1, 5, 12, 0, 0), datetime(2020, 8, 22, 10, 5, 0)],
     [datetime(2024, 5, 5, 5, 49, 1), datetime(2015, 12, 24, 22, 10, 17)],
     [datetime(1996, 4, 30, 2, 38, 11)],
     None,
     [datetime(1987, 1, 27, 8, 21, 59)]
]

df = pd.DataFrame({'a': datetime_data})
table = pa.table(df)
table.to_pandas(timestamp_as_object=True).values
# array([[array([datetime.datetime(2015, 1, 5, 12, 0),
#                datetime.datetime(2020, 8, 22, 10, 5)], dtype=object)],
#        [array([datetime.datetime(2024, 5, 5, 5, 49, 1),
#                datetime.datetime(2015, 12, 24, 22, 10, 17)], dtype=object)],
#        [array([datetime.datetime(1996, 4, 30, 2, 38, 11)], dtype=object)],
#        [None],
#        [array([datetime.datetime(1987, 1, 27, 8, 21, 59)], dtype=object)]],
#       dtype=object)

There is still an issue where the list roundtrips to a numpy array of numpy arrays, but there are other issues tracking this (#34574, #20222) - we could think of supporting an option to preserve list dtype also. But this should come after the optimisation of to_pylist (#28694)

@AlenkaF AlenkaF closed this as not planned Won't fix, can't repro, duplicate, stale Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants