Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document special tables exposed by Iceberg #10514

Merged
merged 1 commit into from
Feb 3, 2022

Conversation

findinpath
Copy link
Contributor

No description provided.

@cla-bot cla-bot bot added the cla-signed label Jan 10, 2022
@losipiuk losipiuk requested a review from mosabua January 10, 2022 10:19
- The upper bounds per column in the columnar file
* - ``key_metadata``
- ``varbinary``
- TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much.

@findinpath findinpath force-pushed the docs/iceberg-special-tables branch from 24bdb7d to 3d98369 Compare January 10, 2022 10:44
@ckdarby
Copy link

ckdarby commented Jan 10, 2022

Looks like #10480 will introduce more to be documented with $properties

Copy link
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we combine this with #10515

Also.. ideally each header has an anchor so we can link to it with ref anchors in sphinx

docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
Special tables
--------------

The Iceberg connector makes available several hidden tables that can provide insights regarding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Iceberg connector maintains several hidden tables that provide metadata for a specific table. You can query each metadata table by appending the metadata table name to the table name::

SELECT * FROM "test_table$data"

docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
``$manifests`` table
^^^^^^^^^^^^^^^^^^^^

The ``$manifests`` table provides a detailed overview of the manifests corresponding to the snapshots performed in the log of the Iceberg table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap

@findinpath findinpath force-pushed the docs/iceberg-special-tables branch 5 times, most recently from 9bc3be7 to 061e801 Compare January 11, 2022 16:08
@findinpath findinpath force-pushed the docs/iceberg-special-tables branch 2 times, most recently from e720e58 to 1ee8c11 Compare January 18, 2022 10:23
Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good and is an important addition.

I've gone over most of the types and they all look good to me. I'll do another pass, but I left some initial nits / questions. Please feel free to resolve any comments that aren't relevant etc, as I'm new to this review and less familiar with the Trino codebase (coming mostly from the core Iceberg side).

I'd also mention that in the Iceberg docs, we provide at least one query that shows how to make use of the metadata tables. Such as this query that joins the history table on snapshots table.

Probably the most important metadata table (in my opinion) is the $files table, as many users want to inspect their underlying storage to see / count files, but that can be incorrect given that we retain older data. In a follow up, that might be something to emphasize.

As a follow up to this, I think it would be good to include some example queries like that to help users use these metadata tables. That's also something we need to work on in the iceberg docs themselves, so I'd be happy to collaborate with somebody on that or help them find the right people to do so. 😄

Please feel free to reach out to me anytime on the Iceberg slack or the Trino slack if I can be of help with anything or help finding additional points of contact.

Comment on lines 353 to 354
Special tables
--------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Within iceberg core, we typically refer to these as just Metadata Tables. Is there a conflicting concept within Trino that conflicts naming-wise which makes it better to avoid that terminology at the top-level?

If not, I would maybe make this heading Metadata Tables to be consistent. Or maybe Special Metadata Tables if you're looking to avoid conflicts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with you @kbendick .
I got inspired from hive connector where these tables are called like this.
I will change the naming here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mosabua do you know why the metadata tables are called special tables in the hive connector?

``$properties`` table
^^^^^^^^^^^^^^^^^^^^^

The ``$properties`` table provides general metadata information about the table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Possibly the usage of provides general metadata information about the table is redundant without further clarifying that the metadata is user supplied (or generated) tags for the tables in addition to configured table properties (some of which are specific to iceberg).

Maybe The $properties table provides access to general information about iceberg table configuration and any additional metadata key/value pairs that users or engines mighthave tagged the table with. ?

Alternatively, it might be ok to simply reference that this is similar to hive's TBLPROPERTIES?

For reference I found the following in Hive's documentation:

The TBLPROPERTIES clause allows you to tag the table definition with your own
metadata key/value pairs. Some predefined table properties also exist,
such as last_modified_user and last_modified_time which are automatically
added and managed by Hive.

Iceberg doesn't supply last_modified_time as a tblproperty itself, but we also don't specifically try to stop HMS from adding these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed there is another PR specifically related to $properties, so I'll likely move this comment there if appropriate =)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed there is another PR specifically related to $properties

#10480 is already merged.
I'm updating correspondingly the docs here.

docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
- ``integer``
- The number of data files deleted during the snapshot
* - ``partitions``
- ``array(row(contains_null boolean, lower_bound varchar, upper_bound varchar))``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is contains_nan not included from within Trino?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 627 to 629
* - ``column_sizes``
- ``map(integer, bigint)``
- The size for each column in the columnar file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Would it be worth mentioning for these that the integer keys are the field IDs used by Iceberg?

I don't have a good way at the moment to express that (especially without looking at the rest of the existing docs). This could be a follow up item, if anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I have adapted this to:

Mapping between the Iceberg column ID and its corresponding size within the columnar file

- The upper bounds per column in the columnar file
* - ``key_metadata``
- ``varbinary``
- Metadata about how this file is encrypted, if applicable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is an odd one, as it's not currently used in open source (though it's coming and a high priority on the roadmap).

The javadoc for this interface (which is a light wrapper around a ByteBuffer) refer to this as metadata about an encrypted data file's encryption key. More specifically, it likely refers to the location of the encryption key but that's probably being iterated on to be more efficient so I wouldn't state that for now.

Maybe Metadata about the encryption key used to encrypt this file, if applicable?

I should mention that the avro doc comment for this just simply says Encryption key metadata blob.

- Description
* - ``content``
- ``integer``
- Type of content stored in the file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might consider referring to this as an enum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suggest me an appropriate way to bring up the enum aspect in the description?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just talk about it as list of valid options or available values or so. And define what each means

@findinpath findinpath force-pushed the docs/iceberg-special-tables branch 2 times, most recently from f46a253 to 6fd7499 Compare January 20, 2022 08:42
Copy link
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it! Great addiiton.

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits.
Let me know when ready to merge

@findinpath findinpath force-pushed the docs/iceberg-special-tables branch 3 times, most recently from fec4e49 to 084a712 Compare January 26, 2022 12:37
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
* - ``added_snapshot_id``
- ``bigint``
- The identifier of the snapshot during which this manifest entry has been added
* - ``added_data_files_count``
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow-up PR for exposing further fields in the $manifests table: #10809

@findinpath findinpath force-pushed the docs/iceberg-special-tables branch 2 times, most recently from 85910ab to 16fbf84 Compare February 3, 2022 15:47
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
^^^^^^^^^^^^^^^^^^^^

The ``$snapshots`` table provides a detailed view of snapshots of the
Iceberg table. A snapshot consist of one or more file manifests, and the complete table contents
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consists of

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the compete table content

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compete?

docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
@findinpath findinpath force-pushed the docs/iceberg-special-tables branch from 16fbf84 to 80ab441 Compare February 3, 2022 17:25
@findinpath findinpath requested a review from mosabua February 3, 2022 17:25
Copy link
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates. Great addition to the docs.

@martint martint merged commit 028a3f3 into trinodb:master Feb 3, 2022
@github-actions github-actions bot added this to the 370 milestone Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants