-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document special tables exposed by Iceberg #10514
Conversation
- The upper bounds per column in the columnar file | ||
* - ``key_metadata`` | ||
- ``varbinary`` | ||
- TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may find inspiration in https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/ContentFile.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much.
24bdb7d
to
3d98369
Compare
Looks like #10480 will introduce more to be documented with $properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we combine this with #10515
Also.. ideally each header has an anchor so we can link to it with ref anchors in sphinx
Special tables | ||
-------------- | ||
|
||
The Iceberg connector makes available several hidden tables that can provide insights regarding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Iceberg connector maintains several hidden tables that provide metadata for a specific table. You can query each metadata table by appending the metadata table name to the table name::
SELECT * FROM "test_table$data"
``$manifests`` table | ||
^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The ``$manifests`` table provides a detailed overview of the manifests corresponding to the snapshots performed in the log of the Iceberg table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrap
9bc3be7
to
061e801
Compare
e720e58
to
1ee8c11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good and is an important addition.
I've gone over most of the types and they all look good to me. I'll do another pass, but I left some initial nits / questions. Please feel free to resolve any comments that aren't relevant etc, as I'm new to this review and less familiar with the Trino codebase (coming mostly from the core Iceberg side).
I'd also mention that in the Iceberg docs, we provide at least one query that shows how to make use of the metadata tables. Such as this query that joins the history table on snapshots table.
Probably the most important metadata table (in my opinion) is the $files
table, as many users want to inspect their underlying storage to see / count files, but that can be incorrect given that we retain older data. In a follow up, that might be something to emphasize.
As a follow up to this, I think it would be good to include some example queries like that to help users use these metadata tables. That's also something we need to work on in the iceberg docs themselves, so I'd be happy to collaborate with somebody on that or help them find the right people to do so. 😄
Please feel free to reach out to me anytime on the Iceberg slack or the Trino slack if I can be of help with anything or help finding additional points of contact.
Special tables | ||
-------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Within iceberg core, we typically refer to these as just Metadata Tables
. Is there a conflicting concept within Trino that conflicts naming-wise which makes it better to avoid that terminology at the top-level?
If not, I would maybe make this heading Metadata Tables
to be consistent. Or maybe Special Metadata Tables
if you're looking to avoid conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mosabua do you know why the metadata tables are called special tables in the hive connector?
``$properties`` table | ||
^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The ``$properties`` table provides general metadata information about the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Possibly the usage of provides general metadata information about the table
is redundant without further clarifying that the metadata is user supplied (or generated) tags for the tables in addition to configured table properties (some of which are specific to iceberg).
Maybe The $properties table provides access to general information about iceberg table configuration and any additional metadata key/value pairs that users or engines mighthave tagged the table with.
?
Alternatively, it might be ok to simply reference that this is similar to hive's TBLPROPERTIES
?
For reference I found the following in Hive's documentation:
The TBLPROPERTIES clause allows you to tag the table definition with your own
metadata key/value pairs. Some predefined table properties also exist,
such as last_modified_user and last_modified_time which are automatically
added and managed by Hive.
Iceberg doesn't supply last_modified_time
as a tblproperty itself, but we also don't specifically try to stop HMS from adding these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed there is another PR specifically related to $properties
, so I'll likely move this comment there if appropriate =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed there is another PR specifically related to $properties
#10480 is already merged.
I'm updating correspondingly the docs here.
- ``integer`` | ||
- The number of data files deleted during the snapshot | ||
* - ``partitions`` | ||
- ``array(row(contains_null boolean, lower_bound varchar, upper_bound varchar))`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is contains_nan
not included from within Trino?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* - ``column_sizes`` | ||
- ``map(integer, bigint)`` | ||
- The size for each column in the columnar file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Would it be worth mentioning for these that the integer keys are the field IDs used by Iceberg?
I don't have a good way at the moment to express that (especially without looking at the rest of the existing docs). This could be a follow up item, if anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I have adapted this to:
Mapping between the Iceberg column ID and its corresponding size within the columnar file
- The upper bounds per column in the columnar file | ||
* - ``key_metadata`` | ||
- ``varbinary`` | ||
- Metadata about how this file is encrypted, if applicable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is an odd one, as it's not currently used in open source (though it's coming and a high priority on the roadmap).
The javadoc for this interface (which is a light wrapper around a ByteBuffer) refer to this as metadata about an encrypted data file's encryption key.
More specifically, it likely refers to the location of the encryption key but that's probably being iterated on to be more efficient so I wouldn't state that for now.
Maybe Metadata about the encryption key used to encrypt this file, if applicable
?
I should mention that the avro doc comment for this just simply says Encryption key metadata blob
.
- Description | ||
* - ``content`` | ||
- ``integer`` | ||
- Type of content stored in the file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might consider referring to this as an enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you suggest me an appropriate way to bring up the enum
aspect in the description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just talk about it as list of valid options or available values or so. And define what each means
f46a253
to
6fd7499
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it! Great addiiton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits.
Let me know when ready to merge
fec4e49
to
084a712
Compare
* - ``added_snapshot_id`` | ||
- ``bigint`` | ||
- The identifier of the snapshot during which this manifest entry has been added | ||
* - ``added_data_files_count`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow-up PR for exposing further fields in the $manifests table: #10809
85910ab
to
16fbf84
Compare
^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The ``$snapshots`` table provides a detailed view of snapshots of the | ||
Iceberg table. A snapshot consist of one or more file manifests, and the complete table contents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consists of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the compete table content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compete?
16fbf84
to
80ab441
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the updates. Great addition to the docs.
No description provided.