Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use and record writer time zone in ORC files #212

Merged
merged 2 commits into from
Feb 15, 2019

Conversation

electrum
Copy link
Member

Early versions of the Apache ORC writer made the mistake of recording
timestamps from an epoch that was relative to the time zone of the
writer. This was fixed in later versions by recording the writer time
zone in the stripe footer. Hive 3.1 always writes using UTC.

Presto used a global configuration for the writer time zone, which
was needed to handle old files, but was never updated to use the time
zone from the stripe footer.

On read, Presto now uses the stripe value if present, otherwise it
uses the configured value. On write, Presto continues to write
timestamps using the configured time zone, but now records this value
when writing files.

@cla-bot cla-bot bot added the cla-signed label Feb 12, 2019
@electrum electrum requested a review from dain February 12, 2019 09:01
@findepi
Copy link
Member

findepi commented Feb 12, 2019

Hive 3.1 always writes using UTC. ..
On write, Presto continues to write
timestamps using the configured time zone, but now records this value
when writing files.

Can you share rationale for not using UTC when writing?

For timestamp values, using some other zone than UTC may mean that some values cannot be represented, right?

@sopel39
Copy link
Member

sopel39 commented Feb 12, 2019

For timestamp values, using some other zone than UTC may mean that some values cannot be represented, right?

Could you give an example? I don't think there are gaps in time in terms of missing milliseconds from relative epoch.

@electrum
Copy link
Member Author

electrum commented Feb 12, 2019

Can you share rationale for not using UTC when writing?

We should switch in the future, but if we did it now, older versions of Presto would return the wrong answer for these files (if hive.time-zone is not UTC). Maybe switch this in a year, after which we hope no one is using mixed versions over the same set of files.

For timestamp values, using some other zone than UTC may mean that some values cannot be represented, right?

The time zone is for the epoch, not the timestamp itself. The only potential (already existing) problem is if the epoch (2015-01-01 00:00:00) did not exist in that time zone.

@electrum electrum force-pushed the orc-timestamp branch 2 times, most recently from 4fe3800 to 1cb8879 Compare February 12, 2019 23:09
@dain dain self-assigned this Feb 14, 2019
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, but otherwise looks good

@dain dain assigned electrum and unassigned dain Feb 14, 2019
Early versions of the Apache ORC writer made the mistake of recording
timestamps from an epoch that was relative to the time zone of the
writer. This was fixed in later versions by recording the writer time
zone in the stripe footer. Hive 3.1 always writes using UTC.

Presto used a global configuration for the writer time zone, which
was needed to handle old files, but was never updated to use the time
zone from the stripe footer.

On read, Presto now uses the stripe value if present, otherwise it
uses the configured value. On write, Presto continues to write
timestamps using the configured time zone, but now records this value
when writing files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants