-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Iceberg $properties system table #10480
Conversation
8edea11
to
50df67b
Compare
Please follow the guideline https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#commits-and-pull-requests regarding the commit comment. |
{ | ||
assertUpdate("CREATE TABLE test_iceberg_get_table_props (x BIGINT) WITH (format='PARQUET')"); | ||
MaterializedResult result = computeActual("SELECT * FROM \"test_iceberg_get_table_props$properties\""); | ||
assertEquals(result.getRowCount(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more concise of doing these checks would be
assertThat(query("SELECT * FROM \"test_iceberg_get_table_props$properties\""))
.matches(format("VALUES (VARCHAR 'write.format.default', VARCHAR '%s')", format.name()));
@Test | ||
public void testGetTableProperties() | ||
{ | ||
assertUpdate("CREATE TABLE test_iceberg_get_table_props (x BIGINT) WITH (format='PARQUET')"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please drop WITH (format='PARQUET')
.
If you have a look at https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java#L138 you notice that the file format is specified at the connector level to be either ORC
or PARQUET
depending on the concrete test class in which the test is executed.
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve the readability of the changes in BaseIcebergConnectorTest
@Test | ||
public void testGetTableProperties() | ||
{ | ||
assertUpdate("CREATE TABLE test_iceberg_get_table_props (x BIGINT) WITH (format='PARQUET')"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add partitioning
to show the $properties
is not the contents of the WITH (...)
clause.
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java
Show resolved
Hide resolved
Please change commit and PR title to |
cc @kokosing for access checks if any. |
@@ -238,6 +238,8 @@ public IcebergTableHandle getTableHandle(ConnectorSession session, SchemaTableNa | |||
return Optional.of(new ManifestsTable(systemTableName, table, getSnapshotId(table, name.getSnapshotId()))); | |||
case FILES: | |||
return Optional.of(new FilesTable(systemTableName, typeManager, table, getSnapshotId(table, name.getSnapshotId()))); | |||
case PROPERTIES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this table at all? Why not to model table properties as table properties (io.trino.spi.connector.ConnectorTableMetadata#properties
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"TableProperties" seems to be an overloaded term in Trino. The ConnectorTableMetadata#properties
seems only to have three options for properties, which are required for tables creation (format, partitioning, and location).
Iceberg has a large list of configuration that can be set on a table that have different behaviour, either to readers / writers / hive / etc (Iceberg Configuration). This configuration is on the Iceberg Table properties, which exist in the Iceberg Metadata file, which is ledger that is pointed to via the catalog (in our case hive).
properties | A string to string map of table properties. This is used to control settings that affect reading and writing and is not intended to be used for arbitrary metadata. For example, commit.retry.num-retries is used to control the number of commit retries.
These properties can be updated via the updateProperties
operation in the Iceberg api (list of api operations)
Given that the ConnectorTableMetadata#properties
functionality seems to limit the properties intentionally, it did not seems appropriate to change its that behaviour. Instead I took the approach of creating a "systems table" to dig out the properties from the iceberg metadata file.
In addition, when we met with Iceberg folks about how we intend to work with Iceberg. They agreed with us putting custom entries in the Table Properties that are pivotal to data interpretation internally at Shopify. Meaning that data we put in there is not "arbitrary", but required. Given that fact, we need the table to be unrestricted in the configs it returns.
This PR avoid access checks, which is fine. Notice to get the access to |
@Buktoria is your request somehow related to https://github.com/starburstdata/dbt-trino ? If yes, can you share some details on how you plan to use it with this new feature? |
50df67b
to
c27c243
Compare
This is the first time of me hearing about dbt-trino. I need to look into it further. Our current use case though is that our various modeling tools need some information about the dataset to help them interpret the table. We currently have Iceberg tables that represent different kinds of data, Also information about the sensitivity of the table, and if privacy enforcement currently running on it, as well data retention,
Our Spark based modeling tool utilizes the iceberg api to get this information stored in the table properties. Our other modeling tool that right now is using Python + DBT to format SQL queries to run on Trino also needs access to the metadata stored in the properties. Given that the Python Iceberg API is not fully fleshed out, and in our teams opinion not reliable at the moment, an alternative was to have the same information accessible via Trino through a query. This also had the additional benefit of not adding an additional dependancy. Thus we landed here, providing direct access to the Iceberg Table properties that are found in the Metadata File. |
@Buktoria can you please check the build results? |
Pls squash the commits |
storageFormat.toString(), | ||
specVersion)); | ||
|
||
assertThat(onTrino().executeQuery(format("SELECT" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format
is not needed
.containsOnly( | ||
row("custom.table-property", "my_custom_value"), | ||
row("write.format.default", storageFormat.name()), | ||
row("owner", "hive")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onSpark().executeQuery("DROP TABLE " + sparkTableName);
specVersion)); | ||
|
||
assertThat(onTrino().executeQuery(format("SELECT" + | ||
" key" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for extra new lines for the sql statement.
String sparkTableName = sparkTableName(baseTableName); | ||
String trinoPropertiesTableName = trinoTableName(propertiesTableName); | ||
|
||
onSpark().executeQuery(format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onSpark().executeQuery("DROP TABLE IF EXISTS " + sparkTableName);
A systems table by the name of 'properties' has been added, which provides a querable table that represents the properties from the Iceberg Data File of the table (this is the ledger of the table). The contents of the properties can chanage over time.
6917aa6
to
747bcf1
Compare
Iceberg supports a collection of system tables for access metadata. Unlike when using Iceberg with Spark where you have full access to the Iceberg Java API, tooling dependent on Trino do not have a way of getting the tables properties of a given table.
My team has found it useful to have this "Properties" table so that we can inspect the tables properties via Trino. More specifically we have DBT based modeling framework written in Python that utilizes Trino as its engine. It relies on specific metadata that we are currently storing in the table properties of our Iceberg tables. Iceberg's Python API is lacking in features and support, thus didn't want to depend on it yet. These few small changes to add this "Properties" table to Trino was ideal solution to our predicament.
It's also important to note that the original Iceberg Spec does not include such a table in Spark SQL, though I do not see a huge necessity for it given the Java API which can be imported into a Spark based application.