Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for stats on timestamp type in Delta Lake #22159

Merged
merged 3 commits into from
Jun 12, 2024

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented May 27, 2024

Description

Example of stats generated by Spark:

  • transaction log has timestamp millis in stats, e.g. 2020-08-26T01:02:03.123
  • checkpoint has timestamp millis in stats, e.g. 2024-05-29T15:08:00.991Z
  • data file has timestamp micros, e.g. 2020-08-26T01:02:03.123987 with INT64

Both int96 and int64 are allowed in parquet files: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-data-type-to-parquet-type-mappings

Fixes #21878

Release notes

(x) Release notes are required, with the following suggested text:

# Delta Lake
* TBD. ({issue}`21878`)

@cla-bot cla-bot bot added the cla-signed label May 27, 2024
@github-actions github-actions bot added the delta-lake Delta Lake connector label May 27, 2024
@ebyhr ebyhr marked this pull request as draft May 28, 2024 01:56
@ebyhr ebyhr force-pushed the ebi/delta-stats branch from 1c432cc to d474d49 Compare May 29, 2024 08:16
@ebyhr ebyhr marked this pull request as ready for review May 29, 2024 09:22
@ebyhr ebyhr force-pushed the ebi/delta-stats branch 3 times, most recently from 91d5e77 to 65e1c3a Compare June 7, 2024 02:09
@ebyhr
Copy link
Member Author

ebyhr commented Jun 7, 2024

@findepi Addressed comments.

long epochSeconds = floorDiv(epochMicros, MICROSECONDS_PER_SECOND);
int nanoAdjustment = floorMod(epochMicros, MICROSECONDS_PER_SECOND) * NANOSECONDS_PER_MICROSECOND;
Instant ts = Instant.ofEpochSecond(epochSeconds, nanoAdjustment);
Instant truncatedToMillis = ts.truncatedTo(MILLIS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • add a check that type has precision at least millis.
  • add a comment why truncating to millis even for timestamp micros
    • however, it wouldn't be hard to pick the truncation unit based on timestamp precision, so maybe let's do that instead of the comment

Copy link
Member Author

@ebyhr ebyhr Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting a conversion from TimestampType.TIMESTAMP_MILLIS to ChronoUnit.MILLIS? Can you share the code snippet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add a code comment instead. I didn't come up with a simple code for such a conversion.

@ebyhr ebyhr force-pushed the ebi/delta-stats branch from 37233d9 to 24b56cc Compare June 12, 2024 10:27
@ebyhr ebyhr merged commit 6966ad1 into trinodb:master Jun 12, 2024
25 checks passed
@ebyhr ebyhr deleted the ebi/delta-stats branch June 12, 2024 21:49
@github-actions github-actions bot added this to the 450 milestone Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector
Development

Successfully merging this pull request may close these issues.

Delta Lake connector doesn't write min/max stats on BOOLEAN, TIMESTAMP, and VARBINARY types
2 participants