-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Puffin, indices and stats file format specification #4944
Conversation
This was already reviewed at apache/iceberg-docs#69. |
format/puffin-spec.md
Outdated
|
||
- `Magic`: four bytes, same as at the beginning of the file. | ||
- `FooterPayload`: optionally compressed, UTF-8 encoded JSON payload describing the | ||
blobs in the file, with the structure described below, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: period instead of comma, or we could change to comma instead of period for magic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, i think it would be best if it was a comma. however, i removed all the EOL punctucation, because it looked weird on the item with sub-items.
format/puffin-spec.md
Outdated
- `Magic`: four bytes, same as at the beginning of the file. | ||
- `FooterPayload`: optionally compressed, UTF-8 encoded JSON payload describing the | ||
blobs in the file, with the structure described below, | ||
- `FooterPayloadSize`: a length in bytes of the `FooterPayload` (compressed), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does compressed
in brackets signify(given it can be uncompressed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed this isn't clear, changing to (after compression, if compressed)
format/puffin-spec.md
Outdated
| Field Name | Field Type | Required | Description | | ||
|-------------------|-------------------| -------- | ----------- | | ||
| type | JSON string | yes | See [Blob types](#blob-types) | ||
| fields | list of JSON long | yes | List of field IDs the blob was computed for; the order of items is used to compute sketches stored in the blob. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically talks about sketches.
Since it is generic, should we remove this from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the wording comes from apache/iceberg-docs#69 (comment)
|
||
| Codec name | Description | | ||
|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| lz4 | Single [LZ4 compression frame](https://github.com/lz4/lz4/blob/77d1b93f72628af7bbde0243b4bba9205c3138d9/doc/lz4_Frame_format.md), with content size present | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Does it make sense to add a link to the doc from a released branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the precise version (commit) is the only way to guarantee the link remains valid.
for example, the .md file could be renamed on the release branch.
2cac9ec
to
ad0cd4e
Compare
@karuppayya thanks for your review! updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I'm +1 with a small comment about being more explicit about the context where Puffin can be used.
Add a specification for Puffin format, a container file format to store indices and stats for Iceberg tables.
ad0cd4e
to
bf4444c
Compare
thanks @rdblue for your detailed review. changes applied. |
where | ||
|
||
- `Magic`: four bytes, same as at the beginning of the file | ||
- `FooterPayload`: optionally compressed, UTF-8 encoded JSON payload describing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mentioning of lz4 as the compression option should ideally be placed near the first mention of compression imo. Or perhaps a mention of the compression section would suffice.
I had to go looking for an indication of the compression type after reading this line which was somewhat confusing for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would remove mention of compression from here rather than elaborate this enumeration item with compression specification. However, it was added for the clarity here.
I think this is ready to commit. Next step is to bring it up on the dev list and vote on it. |
thanks for the merge!
@rdblue who can do this? |
per offline conversation, sent an email to the iceberg dev list (https://lists.apache.org/thread/950rz31y3kr3kz0zzncwokvgzbrmmz4q) |
|
||
The blobs can be of a type listed below | ||
|
||
#### `apache-datasketches-theta-v1` blob type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to clarify, alpha-sketch is just one example, and the idea here is it support various statistics/secondary index that is to be defined, right? Example, other sketches, or even bloom filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the idea is to support bloom filters, dictionaries, and other index types in addition to sketches like this.
We'd also like to get histograms in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, led me to wonder on #4945, whether these stats/indices would apply to an individual data file, as well as across whole snapshot (I guess both will be very useful).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter for this file format. I think the next step is to start thinking about adding these at the partition level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea makes sense, its separate.
Hey @findepi , do we have any docs regarding how the puffin files should be tracked by snapshots/manifests? We had implemented data file level indices, e.g. bloom filter, bitmap, and wonder whether we could adapt to this new format. |
@lirui-apache sure! the Puffin files are currently being deployed to hold table statistics. Note that all the above is centered about certain use-case of the file format, namely the table-level statistics. |
@findepi Thanks for the pointers! Our indices are used for data skipping so they have to be file-level. Currently we just store serialized indices in the files. I guess switching to Puffin format can help us better track metadata like snapshot and sequence numbers. |
@lirui-apache sounds very interesting. |
Joined. Thanks 😄 |
Add a specification for Puffin format, a container file format to store
indices and stats for Iceberg tables.