Add $file_modified_time column in Iceberg #13082

ebyhr · 2022-07-05T00:41:56Z

Description

Add $file_modified_time column in Iceberg

Documentation

(x) Sufficient documentation is included in this PR.

Release notes

(x) Release notes entries required with the following suggested text:

# Iceberg
* Add support for hidden `$file_modified_time` columns. ({issue}`13082`)

docs/src/main/sphinx/connector/iceberg.rst

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadataColumn.java

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

phd3

thanks @ebyhr, could you please help understand the purpose of this change? If someone wants to look at records at a particular point of time, can't they just use AS OF syntax?
If the goal is to expose low-level file-status info for every record for debugging, are we planning to also add other stuff in FileStatus?

phd3 · 2022-07-05T15:10:55Z

docs/src/main/sphinx/connector/iceberg.rst

@@ -622,6 +624,12 @@ Retrieve all records that belong to a specific file using ``"$path"`` filter::
    FROM iceberg.web.page_views
    WHERE "$path" = '/usr/iceberg/table/web.page_views/data/file_01.parquet'

+Retrieve all records that belong to a specific file using ``"$file_modified_time"`` filter::


IMO an example with inequality based filtering would be more practical, but nbd

ebyhr · 2022-07-05T23:57:04Z

@phd3 The main purpose is using additional condition in OPTIMIZE procedure (requires #13012 though). Another purpose is for debugging as you mentioned, but it's not primary.

phd3

If we're specifically adding this for OPTIMIZE, would it make more sense to provide as an executeProperty rather than an explicit virtual column? however, this seems reasonable from the consistency perspective with delta and hive connector.

phd3 · 2022-07-11T01:47:31Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/TrinoDeleteFilter.java

@@ -71,6 +72,9 @@ private static Types.NestedField toNestedField(Schema tableSchema, IcebergColumn
        if (columnHandle.isPathColumn()) {
            return FILE_PATH;
        }
+        if (columnHandle.isFileModifiedTimeColumn()) {
+            return Types.NestedField.of(FILE_MODIFIED_TIME.getId(), false, FILE_MODIFIED_TIME.getColumnName(), Types.StructType.of());


why is StructType.of used here?

phd3 · 2022-07-11T01:57:25Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadataColumn.java

 import static io.trino.spi.type.VarcharType.VARCHAR;

 public enum IcebergMetadataColumn
 {
    FILE_PATH(MetadataColumns.FILE_PATH.fieldId(), "$path", VARCHAR, PRIMITIVE),
+    FILE_MODIFIED_TIME(Integer.MAX_VALUE - 1001, "$file_modified_time", TIMESTAMP_TZ_MILLIS, PRIMITIVE),


can we file an issue in iceberg repo and link it here (may be add a comment too) for standardizing on the IDs for this metadata column so that we don't forget? would be good to get a general consensus in the iceberg community before merging this

Filed apache/iceberg#5240

alexjo2144 · 2022-07-14T16:01:23Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

@@ -493,6 +504,9 @@ else if (partitionKeys.containsKey(column.getId())) {
                else if (column.isPathColumn()) {
                    columnAdaptations.add(ColumnAdaptation.constantColumn(nativeValueToBlock(FILE_PATH.getType(), utf8Slice(path.toString()))));
                }
+                else if (column.isFileModifiedTimeColumn()) {
+                    columnAdaptations.add(ColumnAdaptation.constantColumn(nativeValueToBlock(FILE_MODIFIED_TIME.getType(), packDateTimeWithZone(fileModifiedTime.orElseThrow(), DateTimeZone.getDefault().getID()))));


Should the zone be configurable or should it be UTC? I think in Delta we do UTC

While it's exactly UTC in Delta, Hive connector uses DateTimeZone.getDefault().

@findepi Do you have any opinion?

It should be UTC.

Hive's should be changed to UTC as well

some ref: #13157 (comment)

Thanks for sharing the context. Changed to UTC.

ebyhr · 2022-07-21T12:39:15Z

Just rebased on upstream to resolve conflicts.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java

cla-bot bot added the cla-signed label Jul 5, 2022

github-actions bot added the docs label Jul 5, 2022

ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch 2 times, most recently from b9c4a00 to 8573440 Compare July 5, 2022 02:42

ebyhr requested review from findinpath and homar July 5, 2022 06:43

findinpath reviewed Jul 5, 2022

View reviewed changes

ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch from 8573440 to 4969f32 Compare July 5, 2022 09:09

findinpath approved these changes Jul 5, 2022

View reviewed changes

ebyhr requested a review from electrum July 5, 2022 09:43

findinpath assigned ebyhr Jul 5, 2022

phd3 reviewed Jul 5, 2022

View reviewed changes

ebyhr requested a review from phd3 July 8, 2022 03:19

phd3 approved these changes Jul 11, 2022

View reviewed changes

ebyhr mentioned this pull request Jul 11, 2022

Add a metadata column for file modified time apache/iceberg#5240

Closed

ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch from 4969f32 to af0d153 Compare July 11, 2022 02:36

alexjo2144 reviewed Jul 14, 2022

View reviewed changes

alexjo2144 mentioned this pull request Jul 14, 2022

Iceberg optimize, allow specifying specific partition spec ids #12983

Closed

ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch from af0d153 to d51dad5 Compare July 21, 2022 12:39

ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch from d51dad5 to cfdd614 Compare July 21, 2022 13:50

alexjo2144 approved these changes Jul 25, 2022

View reviewed changes

ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch from aca6d70 to 2c86a9a Compare July 26, 2022 07:11

findinpath approved these changes Jul 26, 2022

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java Outdated Show resolved Hide resolved

ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch 2 times, most recently from 2d4aafb to ed4fcc6 Compare July 27, 2022 02:59

Add $file_modified_time column in Iceberg

1b4d0cb

ebyhr force-pushed the ebi/iceberg-file-modified-time-v2 branch from ed4fcc6 to 1b4d0cb Compare July 27, 2022 04:44

ebyhr merged commit f99ca9c into trinodb:master Jul 27, 2022

ebyhr deleted the ebi/iceberg-file-modified-time-v2 branch July 27, 2022 06:41

ebyhr mentioned this pull request Jul 27, 2022

Release notes for 392 #13320

Closed

github-actions bot added this to the 392 milestone Jul 27, 2022

colebow mentioned this pull request Jul 27, 2022

Add Trino 392 release notes #13342

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add $file_modified_time column in Iceberg #13082

Add $file_modified_time column in Iceberg #13082

ebyhr commented Jul 5, 2022 •

edited

Loading

phd3 left a comment

phd3 Jul 5, 2022

ebyhr commented Jul 5, 2022

phd3 left a comment

phd3 Jul 11, 2022

phd3 Jul 11, 2022

ebyhr Jul 19, 2022

alexjo2144 Jul 14, 2022

ebyhr Jul 19, 2022

findepi Jul 21, 2022

ebyhr Jul 21, 2022

ebyhr commented Jul 21, 2022

Add $file_modified_time column in Iceberg #13082

Add $file_modified_time column in Iceberg #13082

Conversation

ebyhr commented Jul 5, 2022 • edited Loading

Description

Documentation

Release notes

phd3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr commented Jul 5, 2022

phd3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr commented Jul 21, 2022

ebyhr commented Jul 5, 2022 •

edited

Loading