Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Iceberg $properties system table #10480

Merged
merged 1 commit into from
Jan 17, 2022

Conversation

Buktoria
Copy link
Member

@Buktoria Buktoria commented Jan 5, 2022

Iceberg supports a collection of system tables for access metadata. Unlike when using Iceberg with Spark where you have full access to the Iceberg Java API, tooling dependent on Trino do not have a way of getting the tables properties of a given table.

My team has found it useful to have this "Properties" table so that we can inspect the tables properties via Trino. More specifically we have DBT based modeling framework written in Python that utilizes Trino as its engine. It relies on specific metadata that we are currently storing in the table properties of our Iceberg tables. Iceberg's Python API is lacking in features and support, thus didn't want to depend on it yet. These few small changes to add this "Properties" table to Trino was ideal solution to our predicament.

It's also important to note that the original Iceberg Spec does not include such a table in Spark SQL, though I do not see a huge necessity for it given the Java API which can be imported into a Spark based application.

@cla-bot cla-bot bot added the cla-signed label Jan 5, 2022
@Buktoria Buktoria changed the title added Iceberg Properties tables to access table properties added Iceberg "Properties" table to access table properties Jan 5, 2022
@Buktoria Buktoria force-pushed the iceberg-add-properties-table branch 3 times, most recently from 8edea11 to 50df67b Compare January 6, 2022 14:40
@Buktoria Buktoria marked this pull request as ready for review January 6, 2022 16:14
@findinpath
Copy link
Contributor

Please follow the guideline https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#commits-and-pull-requests regarding the commit comment.

{
assertUpdate("CREATE TABLE test_iceberg_get_table_props (x BIGINT) WITH (format='PARQUET')");
MaterializedResult result = computeActual("SELECT * FROM \"test_iceberg_get_table_props$properties\"");
assertEquals(result.getRowCount(), 1);
Copy link
Contributor

@findinpath findinpath Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more concise of doing these checks would be

        assertThat(query("SELECT * FROM \"test_iceberg_get_table_props$properties\""))
                .matches(format("VALUES (VARCHAR 'write.format.default', VARCHAR '%s')", format.name()));

@Test
public void testGetTableProperties()
{
assertUpdate("CREATE TABLE test_iceberg_get_table_props (x BIGINT) WITH (format='PARQUET')");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please drop WITH (format='PARQUET').
If you have a look at https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java#L138 you notice that the file format is specified at the connector level to be either ORC or PARQUET depending on the concrete test class in which the test is executed.

Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve the readability of the changes in BaseIcebergConnectorTest

@Test
public void testGetTableProperties()
{
assertUpdate("CREATE TABLE test_iceberg_get_table_props (x BIGINT) WITH (format='PARQUET')");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add partitioning to show the $properties is not the contents of the WITH (...) clause.

@findepi
Copy link
Member

findepi commented Jan 11, 2022

Please change commit and PR title to Add Iceberg $properties system table.

@findepi
Copy link
Member

findepi commented Jan 11, 2022

cc @kokosing for access checks if any.

@@ -238,6 +238,8 @@ public IcebergTableHandle getTableHandle(ConnectorSession session, SchemaTableNa
return Optional.of(new ManifestsTable(systemTableName, table, getSnapshotId(table, name.getSnapshotId())));
case FILES:
return Optional.of(new FilesTable(systemTableName, typeManager, table, getSnapshotId(table, name.getSnapshotId())));
case PROPERTIES:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this table at all? Why not to model table properties as table properties (io.trino.spi.connector.ConnectorTableMetadata#properties)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"TableProperties" seems to be an overloaded term in Trino. The ConnectorTableMetadata#properties seems only to have three options for properties, which are required for tables creation (format, partitioning, and location).

Iceberg has a large list of configuration that can be set on a table that have different behaviour, either to readers / writers / hive / etc (Iceberg Configuration). This configuration is on the Iceberg Table properties, which exist in the Iceberg Metadata file, which is ledger that is pointed to via the catalog (in our case hive).

properties | A string to string map of table properties. This is used to control settings that affect reading and writing and is not intended to be used for arbitrary metadata. For example, commit.retry.num-retries is used to control the number of commit retries.

These properties can be updated via the updateProperties operation in the Iceberg api (list of api operations)

Given that the ConnectorTableMetadata#properties functionality seems to limit the properties intentionally, it did not seems appropriate to change its that behaviour. Instead I took the approach of creating a "systems table" to dig out the properties from the iceberg metadata file.

In addition, when we met with Iceberg folks about how we intend to work with Iceberg. They agreed with us putting custom entries in the Table Properties that are pivotal to data interpretation internally at Shopify. Meaning that data we put in there is not "arbitrary", but required. Given that fact, we need the table to be unrestricted in the configs it returns.

@kokosing
Copy link
Member

for access checks if any.

This PR avoid access checks, which is fine. Notice to get the access to t$properties typically you need an explicit grant to that table. Access to just t does not give you an access to system tables for that table.

@findinpath
Copy link
Contributor

My team has found it useful to have this "Properties" table so that we can inspect the tables properties via Trino. More specifically we have DBT based modeling framework written in Python that utilizes Trino as its engine. It relies on specific metadata that we are currently storing in the table properties of our Iceberg tables.

@Buktoria is your request somehow related to https://github.com/starburstdata/dbt-trino ? If yes, can you share some details on how you plan to use it with this new feature?

@Buktoria Buktoria changed the title added Iceberg "Properties" table to access table properties Add Iceberg $properties system table Jan 14, 2022
@Buktoria Buktoria force-pushed the iceberg-add-properties-table branch from 50df67b to c27c243 Compare January 14, 2022 23:12
@Buktoria
Copy link
Member Author

Buktoria commented Jan 14, 2022

This is the first time of me hearing about dbt-trino. I need to look into it further.

Our current use case though is that our various modeling tools need some information about the dataset to help them interpret the table. We currently have Iceberg tables that represent different kinds of data,
ex: CDC, schematized_kafka_events, non_schematized_kafka_events, etc.

Also information about the sensitivity of the table, and if privacy enforcement currently running on it, as well data retention,
ex:

  • privacy.sensitive=true/false
  • privacy.enforced=true/false
  • privacy.retention=P30D
  • prviacy.retention.field=<field which is used to purge data by>
  • etc.

Our Spark based modeling tool utilizes the iceberg api to get this information stored in the table properties. catalog.loadTable(...).properties.

Our other modeling tool that right now is using Python + DBT to format SQL queries to run on Trino also needs access to the metadata stored in the properties. Given that the Python Iceberg API is not fully fleshed out, and in our teams opinion not reliable at the moment, an alternative was to have the same information accessible via Trino through a query. This also had the additional benefit of not adding an additional dependancy.

Thus we landed here, providing direct access to the Iceberg Table properties that are found in the Metadata File.

@findepi
Copy link
Member

findepi commented Jan 15, 2022

@Buktoria can you please check the build results?
seems some checks failed.

@findinpath
Copy link
Contributor

Pls squash the commits

storageFormat.toString(),
specVersion));

assertThat(onTrino().executeQuery(format("SELECT" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format is not needed

.containsOnly(
row("custom.table-property", "my_custom_value"),
row("write.format.default", storageFormat.name()),
row("owner", "hive"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onSpark().executeQuery("DROP TABLE " + sparkTableName);

specVersion));

assertThat(onTrino().executeQuery(format("SELECT" +
" key" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for extra new lines for the sql statement.

String sparkTableName = sparkTableName(baseTableName);
String trinoPropertiesTableName = trinoTableName(propertiesTableName);

onSpark().executeQuery(format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onSpark().executeQuery("DROP TABLE IF EXISTS " + sparkTableName);

A systems table by the name of 'properties' has been added, which
provides a querable table that represents the properties from the
Iceberg Data File of the table (this is the ledger of the table).
The contents of the properties can chanage over time.
@Buktoria Buktoria force-pushed the iceberg-add-properties-table branch from 6917aa6 to 747bcf1 Compare January 15, 2022 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants