-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add $file_modified_time column in Iceberg #13082
Conversation
b9c4a00
to
8573440
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadataColumn.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java
Show resolved
Hide resolved
8573440
to
4969f32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @ebyhr, could you please help understand the purpose of this change? If someone wants to look at records at a particular point of time, can't they just use AS OF syntax?
If the goal is to expose low-level file-status info for every record for debugging, are we planning to also add other stuff in FileStatus
?
@@ -622,6 +624,12 @@ Retrieve all records that belong to a specific file using ``"$path"`` filter:: | |||
FROM iceberg.web.page_views | |||
WHERE "$path" = '/usr/iceberg/table/web.page_views/data/file_01.parquet' | |||
|
|||
Retrieve all records that belong to a specific file using ``"$file_modified_time"`` filter:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO an example with inequality based filtering would be more practical, but nbd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're specifically adding this for OPTIMIZE, would it make more sense to provide as an executeProperty rather than an explicit virtual column? however, this seems reasonable from the consistency perspective with delta and hive connector.
@@ -71,6 +72,9 @@ private static Types.NestedField toNestedField(Schema tableSchema, IcebergColumn | |||
if (columnHandle.isPathColumn()) { | |||
return FILE_PATH; | |||
} | |||
if (columnHandle.isFileModifiedTimeColumn()) { | |||
return Types.NestedField.of(FILE_MODIFIED_TIME.getId(), false, FILE_MODIFIED_TIME.getColumnName(), Types.StructType.of()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is StructType.of used here?
import static io.trino.spi.type.VarcharType.VARCHAR; | ||
|
||
public enum IcebergMetadataColumn | ||
{ | ||
FILE_PATH(MetadataColumns.FILE_PATH.fieldId(), "$path", VARCHAR, PRIMITIVE), | ||
FILE_MODIFIED_TIME(Integer.MAX_VALUE - 1001, "$file_modified_time", TIMESTAMP_TZ_MILLIS, PRIMITIVE), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we file an issue in iceberg repo and link it here (may be add a comment too) for standardizing on the IDs for this metadata column so that we don't forget? would be good to get a general consensus in the iceberg community before merging this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed apache/iceberg#5240
4969f32
to
af0d153
Compare
@@ -493,6 +504,9 @@ else if (partitionKeys.containsKey(column.getId())) { | |||
else if (column.isPathColumn()) { | |||
columnAdaptations.add(ColumnAdaptation.constantColumn(nativeValueToBlock(FILE_PATH.getType(), utf8Slice(path.toString())))); | |||
} | |||
else if (column.isFileModifiedTimeColumn()) { | |||
columnAdaptations.add(ColumnAdaptation.constantColumn(nativeValueToBlock(FILE_MODIFIED_TIME.getType(), packDateTimeWithZone(fileModifiedTime.orElseThrow(), DateTimeZone.getDefault().getID())))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the zone be configurable or should it be UTC? I think in Delta we do UTC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it's exactly UTC in Delta, Hive connector uses DateTimeZone.getDefault()
.
@findepi Do you have any opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing the context. Changed to UTC.
af0d153
to
d51dad5
Compare
Just rebased on upstream to resolve conflicts. |
d51dad5
to
cfdd614
Compare
aca6d70
to
2c86a9a
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java
Outdated
Show resolved
Hide resolved
2d4aafb
to
ed4fcc6
Compare
ed4fcc6
to
1b4d0cb
Compare
Description
Add
$file_modified_time
column in IcebergDocumentation
(x) Sufficient documentation is included in this PR.
Release notes
(x) Release notes entries required with the following suggested text: