-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][Parquet] timestamp[s]
does not round-trip parquet
serialization.
#41382
Comments
I guess this is expected. Please refer to https://github.com/apache/arrow/blob/main/docs/source/python/parquet.rst#storing-timestamps and https://arrow.apache.org/docs/cpp/parquet.html#logical-types . Maybe a "cast" can be applied here? |
Of course one can perform a cast, the issue is that the time resolution of the data carries meaning, which is lost. If Bob has a timeseries with second resolution and sends it to Alice as a So, if |
ParquetWriter currently has the option |
Maybe you can applying "cast" as a workaround first |
Personally, I think a warning might be annoying for most users where this difference in timestamp unit is not critical. For example, until very recently, we were also casting nanoseconds to microseconds by default (because only more recent Parquet versions support nanoseconds, and for compatibility with other readers it was best to not yet use this feature). But warning for this (except if actual data would be truncated) would be rather noisy (especially given pandas using nanoseconds by default; this is of course not the case for seconds) In theory we could actually restore the original seconds resolution upon reading the Parquet file, because we store the original Arrow schema in the Parquet metadata (e.g. to allow restoring timezone). But, that would mean we do an actual cast after reading incurring a cost for everyone (and not only for those who manually choose to do this) We could also add a keyword to control this, although I am hesitant to do this given we already have so many keyword (and also, if this would actually preserve the current behaviour as default, you still need to do something manually to get the roundtripping, and then you could also just cast manually) |
Describe the bug, including details regarding any error messages, version, and platform.
Timestamps with second resolution get upcasted to millisecond resolution when serializing and deserializing. They should either round trip, or there should be a warning/error when attempting to serialize them.
Tested with pyarrow 16.0.0
Component(s)
Parquet
The text was updated successfully, but these errors were encountered: