-
Notifications
You must be signed in to change notification settings - Fork 98
Conversation
The format description is based on the doc (https://docs.google.com/document/d/1we0BuQbbdqiJS2eUFC_-6TPSuO57GXivzKmcTzApivY) and has been preliminarily discussed with @rdblue @losipiuk @alexjo2144 and disseminated as a proposal on the Iceberg dev list and slack. New comments and feedback is welcome. The description is added next to spec file, so within landing-page. Please advice whether this is the right place. |
|
||
| Field Name | Field Type | Required | Description | | ||
| ---------- | ---------------------- | -------- | ----------- | | ||
| blobs | list of Blob objects | yes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about list of BlobMetadata
objects? That way this doesn't use Blob
for the binary payload and for the metadata.
|
||
| Blob type | Description | | ||
| ------------------------------ | ----------- | | ||
| ndv-long-little-endian | 8-bytes integer stored little-endian and representing number of distinct values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed or can we put some of the small payloads in a map stored in table metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do that. That would obviously make the spec more complicated. Do you think it is worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to make this document simpler by not including this. We can have a map of summary stats in the table spec changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wanted the writer to sort blobs by type, so that NDVs end up within single read.
Of course, this still requires two reads to get this information (1 for the footer, and 1 for the NDV).
Is it a problem? IDK. We do lots of reads when planning a scan over an Iceberg table.
if you feel like this is a problem, or a suboptimality that we shouldn't have, I can remove it from here and add it to table spec instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach adds a lot of complexity. If a blob can be distilled into a few scalar values, like the theta sketch and a single NDV, then I would simply put the values into blob metadata. No need for additional blobs and encodings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach adds a lot of complexity.
I think it's actually simpler, because it doesn't require any new type of entity like "blob metadata without the blob itself".
Note that NDV information can exist without a Theta sketch, so we can't just assume that NDV is just an attached info to the sketch.
For example, current Trino version already allows computing NDV and some engine & SPI changes are required to support Theta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, @findepi! I have a few comments to clarify and make it more strict but this is a great start.
I also had a bigger idea to make this a header-based format. I think that would simplify it a bit. I'd like to hear what you think.
9a67e8b
to
3107292
Compare
(AC, pushed changes) |
Is it in the scope for this doc to specify the lifecycle of these files? Like is a particular stats file owned by a snapshot or is this just some generic description of a file with possibly more information whose lifestyle may need special care? I mostly just want to understand better the relationship between one of these files and a given table. Should a table know where all these files are? or is it up to a reader to know where potential files may be? Please let me know if this is out of scope for this PR. |
| type | JSON string | yes | See [Blob types](#blob-types) | ||
| columns | list of JSON long | yes | list of column IDs the blob was computed for | ||
| offset | JSON long | yes | The offset in the file where the blob contents start. Reader should assume the value can be more than 2^32. | ||
| length | JSON long | yes | The length of the blob stored in the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncompressed length would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather let the compression handled that. Both ZSTD and LZ4 can write uncompressed size in the frame header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC some compressions libraries (in Go) don't make the information easy to obtain from the frame header. I don't recall which compression formats.
I don't think so. This is focused on how we manage blobs. The design doc linked at the top covers the lifecycle for stats, and other docs cover it for indexes.
The design doc states that these files are tracked in each snapshot with a "statistics" key and an object that tracks what payloads are in the file. |
3107292
to
dad2764
Compare
AC should i also create a PR for table spec changes? |
720e730
to
b8268f1
Compare
a proposed implementation of reader/writer: apache/iceberg#4537 |
|
||
#### `ndv-long-little-endian` blob type | ||
|
||
8-bytes unsigned integer stored little-endian and representing number of distinct values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why this is needed, since it isn't a sketch. I think it is valuable to have either in the metadata that tracks this file, or in the footer. But a blob just to hold it seems overly complex to me.
What about introducing a summary
object to BlobMetadata
that can contain high-level information like this? Then we can specify that for the apache-datasketches-theta-v1
blob, the summary metadata will have a "ndv-count"
field with the number of distinct values as a JSON long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why this is needed, since it isn't a sketch.
This file format is not about "sketches". It's a container format for statistics and indexes, and some of them may be "sketches". This is why the spec uses a more vague term "blob".
NDV number is defined as a blob type, to avoid special casing this particular information.
Otherwise we should add this information in the Table spec directly, not in this file format at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a summary of a blob. We don't need a specific blob type for it and I think that having one makes this overly complex. What is the argument for making this a blob vs keeping it in the sketch metadata as a summary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
95d8fc8
to
95bfd64
Compare
Thank you @rdblue for your awesome review. AC; PTAL |
95bfd64
to
deff3db
Compare
deff3db
to
2f27732
Compare
AC |
This may be ready to go. PTAL. |
2f27732
to
3f6a2ce
Compare
3f6a2ce
to
32129fc
Compare
Per offline conversation, the file format got a name. Meet Puffin. |
32129fc
to
0c603c4
Compare
@rdblue anything else I could improve here? |
per @pvary 's comment #80 (comment) Please confirm and I will move over. |
Add a specification for Puffin format, a container file format to store indices and stats for Iceberg tables.
0c603c4
to
ed57e44
Compare
Per apache/iceberg#4944, and offline confirmation, the PR is moved to Iceberg repo apache/iceberg#4944 |
Add a specification for Puffin format, a container file format to store
indices and stats for Iceberg tables.
This follows earlier discussion in https://docs.google.com/document/d/1we0BuQbbdqiJS2eUFC_-6TPSuO57GXivzKmcTzApivY