-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support DML operations on Delta Lake tables with id
column mapping
#16600
Support DML operations on Delta Lake tables with id
column mapping
#16600
Conversation
357f148
to
f09f502
Compare
6db5fe5
to
4e6dae7
Compare
/test-with-secrets sha=4e6dae7c622f3fffa809c8cce7fff6c587334505 |
import static org.apache.parquet.schema.Type.Repetition.REQUIRED; | ||
|
||
public final class DeltaLakeMetadataSchemas | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raunaqmorarka this class is a stripped version of ParquetSchemaConverter
used in order to cope with the need to create Parquet schemas for Delta Lake column handles with nested types (the id
information is the Delta Lake metadata entry schema items)
The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/4468845163 |
OptionalInt id, | ||
DeltaLakeSchemaSupport.ColumnMappingMode columnMappingMode, | ||
List<String> parent, | ||
BiConsumer<List<String>, Type> primitiveTypesConsumer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass in the builder instead of BiConsumer
? Not sure why we would need a more generic parameter here.
...ta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeMetadataSchemas.java
Outdated
Show resolved
Hide resolved
...ta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeMetadataSchemas.java
Outdated
Show resolved
Hide resolved
...ta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeMetadataSchemas.java
Outdated
Show resolved
Hide resolved
4e6dae7
to
b7c07a0
Compare
b7c07a0
to
8f00561
Compare
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java
Outdated
Show resolved
Hide resolved
f00d515
to
4c24030
Compare
@ebyhr could you please run this PR with secrets? |
4c24030
to
11686fc
Compare
} | ||
case "timestamp" -> { | ||
// Spark/DeltaLake stores timestamps in UTC, but renders them in session time zone. | ||
// For more info, see https://delta-users.slack.com/archives/GKTUWT03T/p1585760533005400 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this link still accessible? I can't access the page. They use https://linen.delta.io/ to archive Slack messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken it from here
Lines 515 to 517 in de4cbc3
// Spark/DeltaLake stores timestamps in UTC, but renders them in session time zone. | |
// For more info, see https://delta-users.slack.com/archives/GKTUWT03T/p1585760533005400 | |
// and https://cwiki.apache.org/confluence/display/Hive/Different+TIMESTAMP+types |
Should I remove it from there as well?
...ta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeMetadataSchemas.java
Outdated
Show resolved
Hide resolved
...ta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeMetadataSchemas.java
Outdated
Show resolved
Hide resolved
...ke/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeParquetSchemaMapping.java
Outdated
Show resolved
Hide resolved
...ake/src/test/java/io/trino/plugin/deltalake/transactionlog/TestDeltaLakeMetadataSchemas.java
Outdated
Show resolved
Hide resolved
...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java
Outdated
Show resolved
Hide resolved
...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java
Show resolved
Hide resolved
...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java
Outdated
Show resolved
Hide resolved
...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java
Outdated
Show resolved
Hide resolved
ColumnMappingMode columnMappingMode = getColumnMappingMode(handle.getMetadataEntry()); | ||
List<String> partitionColumns = switch (columnMappingMode) { | ||
case NAME, ID -> getPartitionColumnsForNameOrIdMapping(handle.getMetadataEntry().getOriginalPartitionColumns(), mergeHandle.getInsertTableHandle().getInputColumns()); | ||
case NONE -> handle.getMetadataEntry().getOriginalPartitionColumns(); | ||
case UNKNOWN -> throw new TrinoException(NOT_SUPPORTED, "Unsupported column mapping mode"); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of code is very similar to code on line 1370. Can we extract it to function?
I'll need to rethink a bit the parquet schema conversion for Delta Lake. In any case hive's ParquetSchemaConverter builds on Trino |
c608b7b
to
723ee73
Compare
Regarding #16600 (comment) , I created the method It feels to me like a patch method, but I didn't find a better alternative that doesn't involve more refactorings. |
723ee73
to
112dd5f
Compare
112dd5f
to
55e2d90
Compare
Given that adding I think the PR is now ready for review. |
613a0a7
to
01cc453
Compare
Rebased on |
@@ -1778,7 +1791,7 @@ private void checkWriteSupported(ConnectorSession session, SchemaTableName schem | |||
checkSupportedWriterVersion(session, schemaTableName); | |||
checkUnsupportedGeneratedColumns(metadataEntry); | |||
ColumnMappingMode columnMappingMode = getColumnMappingMode(metadataEntry); | |||
if (!(columnMappingMode == NONE || columnMappingMode == ColumnMappingMode.NAME)) { | |||
if (!(columnMappingMode == NONE || columnMappingMode == ColumnMappingMode.NAME || columnMappingMode == ColumnMappingMode.ID)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add static import for NAME and ID similar to NONE?
assertQueryFailure(() -> onTrino().executeQuery("UPDATE delta.default." + tableName + " SET a_string = 'test'")) | ||
.hasMessageContaining("Writing with column mapping id is not supported"); | ||
} | ||
assertThat(onTrino().executeQuery("INSERT INTO delta.default." + tableName + " VALUES (1, 'one'), (2, 'two')")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As test is named testUnsupportedOperationsColumnMappingMode
can we have separate test for write operation and test only unsupported here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. The DML operations here are actually irrelevant now.
I'll remove them.
@@ -840,33 +960,17 @@ public void testUnsupportedColumnMappingModeChangeDataFeed(String mode) | |||
onDelta().executeQuery("INSERT INTO default." + targetTableName + " VALUES (3, 'nation3', 300)"); | |||
|
|||
// Column mapping mode 'none' is tested in TestDeltaLakeDatabricksChangeDataFeedCompatibility | |||
// TODO: Remove these failure check and update TestDeltaLakeDatabricksChangeDataFeedCompatibility when adding support the column mapping mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO still valid right?
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeParquetSchemas.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeParquetSchemas.java
Outdated
Show resolved
Hide resolved
...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java
Outdated
Show resolved
Hide resolved
01cc453
to
cdd8150
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AC
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeParquetSchemas.java
Outdated
Show resolved
Hide resolved
assertQueryFailure(() -> onTrino().executeQuery("UPDATE delta.default." + tableName + " SET a_string = 'test'")) | ||
.hasMessageContaining("Writing with column mapping id is not supported"); | ||
} | ||
assertThat(onTrino().executeQuery("INSERT INTO delta.default." + tableName + " VALUES (1, 'one'), (2, 'two')")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. The DML operations here are actually irrelevant now.
I'll remove them.
Rebasing on |
cdd8150
to
b734bdf
Compare
b734bdf
to
1e886c9
Compare
Rebased again on |
/test-with-secrets sha=1e886c943c9b3f9f4649cb1da2df9addf0f386a3 |
Rebasing on top of |
1e886c9
to
e23f4b6
Compare
/test-with-secrets sha=e23f4b626ac58f8a31c07068fc6ba49c774e9a24
|
Description
This PR is a follow-up of #16183 and relates to #12638
Additional context and related issues
https://books.japila.pl/delta-lake-internals/DeltaConfigs/#COLUMN_MAPPING_MODE
I intended to test this column mapping mode for tables migrated from Iceberg and used
https://docs.databricks.com/sql/language-manual/delta-convert-to-delta.html
Apply
CLONE
on an Iceberg table created by Trino was causing internal error issues on Databricks, so I used Spark with Iceberg to create the table on AWS and then appliedCLONE
via Databricks.Outcome was a table with table writer version 7 , which can't be written by Trino. :(
Nevertheless, relevant info - the Parquet files written via Iceberg look like:
The new Parquet schema for the files written by Databricks (on
INSERT
statements after theCLONE
statement) look likeThe parquet files produced by Trino with the column mapping mode set to
id
are also consistent with this schema.Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text: