Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive with legacy ORC files timestamp semantic issues, unnecessary convertToLocal ? #18844

Open
kocomic opened this issue Aug 29, 2023 · 1 comment

Comments

@kocomic
Copy link

kocomic commented Aug 29, 2023

Since Trino changed the semantics of timestamps, please refer to issue #37. In Trino, the "timestamp" type refers to a point in time measured in seconds from 1970-01-01 00:00:00 and is not affected by the session's time zone.

In the old Presto 332, when reading ORC files, the values obtained from the ORC reader match the values written by the ORC writer if the storage time zone matches the file's time zone and is not UTC. Timestamp values are encoded using three fields:

Seconds since 2015-01-01 in the file's time zone
Nanoseconds
File time zone in the stripe footer
It appears that the time zone in this file only affects the encoding and decoding process, with the original value remaining unchanged.

However, in Trino 424, when reading legacy files, the ORC reader still decodes using the same three fields as mentioned above. After decoding, the value is then converted using fileDateTimeZone.convertUTCToLocal:

File: io.trino.orc.reader.TimestampColumnReader.java, Line 408

if (!isFileUtc()) {
    millis = fileDateTimeZone.convertUTCToLocal(millis);
}

return millis * MICROSECONDS_PER_MILLISECOND;

This means that the original value written by the ORC writer is altered by the reader in Trino 424, leading to incorrect timestamp semantics. Here's an example:

Session Time Zone: UTC
ORC File/Storage Time Zone: Asia/Shanghai

Original value: 2020-01-01 00:00:00+00:00
Presto 332 read: 2020-01-01 00:00:00+00:00
Trino 424 read: 2020-01-01 08:00:00
Desired Trino 424 result: 2020-01-01 00:00:00

@ericlgoodman
Copy link

@dain , I know you're quite familiar with ORC timezone semantics based on 53bafb4. Would you be able to take a look at this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants