Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add $file_modified_time column in Iceberg #13082

Merged
merged 1 commit into from
Jul 27, 2022

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Jul 5, 2022

Description

Add $file_modified_time column in Iceberg

Documentation

(x) Sufficient documentation is included in this PR.

Release notes

(x) Release notes entries required with the following suggested text:

# Iceberg
* Add support for hidden `$file_modified_time` columns. ({issue}`13082`)

@cla-bot cla-bot bot added the cla-signed label Jul 5, 2022
@github-actions github-actions bot added the docs label Jul 5, 2022
@ebyhr ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch 2 times, most recently from b9c4a00 to 8573440 Compare July 5, 2022 02:42
@ebyhr ebyhr requested review from findinpath and homar July 5, 2022 06:43
@ebyhr ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch from 8573440 to 4969f32 Compare July 5, 2022 09:09
@ebyhr ebyhr requested a review from electrum July 5, 2022 09:43
Copy link
Member

@phd3 phd3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ebyhr, could you please help understand the purpose of this change? If someone wants to look at records at a particular point of time, can't they just use AS OF syntax?
If the goal is to expose low-level file-status info for every record for debugging, are we planning to also add other stuff in FileStatus?

@@ -622,6 +624,12 @@ Retrieve all records that belong to a specific file using ``"$path"`` filter::
FROM iceberg.web.page_views
WHERE "$path" = '/usr/iceberg/table/web.page_views/data/file_01.parquet'

Retrieve all records that belong to a specific file using ``"$file_modified_time"`` filter::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO an example with inequality based filtering would be more practical, but nbd

@ebyhr
Copy link
Member Author

ebyhr commented Jul 5, 2022

@phd3 The main purpose is using additional condition in OPTIMIZE procedure (requires #13012 though). Another purpose is for debugging as you mentioned, but it's not primary.

@ebyhr ebyhr requested a review from phd3 July 8, 2022 03:19
Copy link
Member

@phd3 phd3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're specifically adding this for OPTIMIZE, would it make more sense to provide as an executeProperty rather than an explicit virtual column? however, this seems reasonable from the consistency perspective with delta and hive connector.

@@ -71,6 +72,9 @@ private static Types.NestedField toNestedField(Schema tableSchema, IcebergColumn
if (columnHandle.isPathColumn()) {
return FILE_PATH;
}
if (columnHandle.isFileModifiedTimeColumn()) {
return Types.NestedField.of(FILE_MODIFIED_TIME.getId(), false, FILE_MODIFIED_TIME.getColumnName(), Types.StructType.of());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is StructType.of used here?

import static io.trino.spi.type.VarcharType.VARCHAR;

public enum IcebergMetadataColumn
{
FILE_PATH(MetadataColumns.FILE_PATH.fieldId(), "$path", VARCHAR, PRIMITIVE),
FILE_MODIFIED_TIME(Integer.MAX_VALUE - 1001, "$file_modified_time", TIMESTAMP_TZ_MILLIS, PRIMITIVE),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we file an issue in iceberg repo and link it here (may be add a comment too) for standardizing on the IDs for this metadata column so that we don't forget? would be good to get a general consensus in the iceberg community before merging this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -493,6 +504,9 @@ else if (partitionKeys.containsKey(column.getId())) {
else if (column.isPathColumn()) {
columnAdaptations.add(ColumnAdaptation.constantColumn(nativeValueToBlock(FILE_PATH.getType(), utf8Slice(path.toString()))));
}
else if (column.isFileModifiedTimeColumn()) {
columnAdaptations.add(ColumnAdaptation.constantColumn(nativeValueToBlock(FILE_MODIFIED_TIME.getType(), packDateTimeWithZone(fileModifiedTime.orElseThrow(), DateTimeZone.getDefault().getID()))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the zone be configurable or should it be UTC? I think in Delta we do UTC

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's exactly UTC in Delta, Hive connector uses DateTimeZone.getDefault().

@findepi Do you have any opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be UTC.

Hive's should be changed to UTC as well

some ref: #13157 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the context. Changed to UTC.

@ebyhr
Copy link
Member Author

ebyhr commented Jul 21, 2022

Just rebased on upstream to resolve conflicts.

@ebyhr ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch from d51dad5 to cfdd614 Compare July 21, 2022 13:50
@ebyhr ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch from aca6d70 to 2c86a9a Compare July 26, 2022 07:11
@ebyhr ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch 2 times, most recently from 2d4aafb to ed4fcc6 Compare July 27, 2022 02:59
@ebyhr ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch from ed4fcc6 to 1b4d0cb Compare July 27, 2022 04:44
@ebyhr ebyhr merged commit f99ca9c into trinodb:master Jul 27, 2022
@ebyhr ebyhr deleted the ebi/iceberg-file-modified-time-v2 branch July 27, 2022 06:41
@ebyhr ebyhr mentioned this pull request Jul 27, 2022
@github-actions github-actions bot added this to the 392 milestone Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants