diff --git a/landing-page/content/common/statistics-format-spec.md b/landing-page/content/common/statistics-format-spec.md new file mode 100644 index 000000000..29540bbdf --- /dev/null +++ b/landing-page/content/common/statistics-format-spec.md @@ -0,0 +1,113 @@ +--- +url: statistics-format +toc: false +--- + + +# Index and statistics file format + +This is a specification for the Plain Format for Iceberg Statistics, a file +format designed to store information such as statistics about data managed in an +Iceberg table that cannot be stored directly within the Iceberg manifest. A +statistics file contains arbitrary pieces of information (here called "blobs"), +along with metadata necessary to interpret them. The blobs supported by Iceberg +are documented at [Blob types](#blob-types). + +## Format specification + +A file conforming to the format specification should have the structure as +described below. + +### File structure + +The file has the following structure + +``` +Magic Blob₁ Blob₂ ... Blobₙ Footer +``` + +where + +- `Magic` is four bytes 0x80, 0x70, 0x73, 0x83 (short for: Plain Format for + Indices and Statistics), +- `Blobᵢ` is i-th blob contained in the file, to be interpreted by application + according to the footer, +- `Footer` is defined below. + +### Footer structure + +Footer has the following structure + +``` +FooterPayload FooterPayloadSize Reserved FileFormatVersion Magic +``` + +where + +- `FooterPayload` LZ4-compressed, UTF-8 encoded JSON payload describing the + blobs in the file, with the structure described below, +- `FooterPayloadSize` is a length in bytes of the `FooterPayload` (compressed), + stored as 4 byte integer, little-endian, +- `Reserved` is 4 bytes reserved for future use, currently should be written as + 0x00, 0x00, 0x00, 0x00, +- `FileFormatVersion` is a number, stored as 4 byte integer, little-endian, +- `Magic` four bytes, same as at the beginning of the file. + +### Footer Payload + +Footer payload bytes is LZ4-compressed, UTF-8 encoded JSON payload representing +a single `FileMetadata` object. + +#### FileMetadata + +`FileMetadata` has the following fields + + +| Field Name | Field Type | Required | Description | +| ---------- | --------------------------------------- | -------- | ----------- | +| blobs | list of BlobMetadata objects | yes | +| properties | JSON object with string property values | no | storage for arbitrary meta-information, like writer identification/version + +#### BlobMetadata + +`BlobMetadata` has the following fields + +| Field Name | Field Type | Required | Description | +| ---------- | ---------------------- | -------- | ----------- | +| type | JSON string | yes | See [Blob types](#blob-types) +| columns | list of JSON long | yes | list of column IDs the blob was computed for +| offset | JSON long | yes | The offset in the file where the blob contents start. Reader should assume the value can be more than 2^32. +| length | JSON long | yes | The length of the blob stored in the file +| compression_codec | JSON string | no | See [Compression codecs](#compression-codecs). If omitted, the data is assumed to be uncompressed. + +### Blob types + +The file format makes no assumptions about the information stored in it. The +applications using the format need to agree on type of information stored, the +format of the information and type designators. The list below is not part of +the file format specification itself, but is provided for readers' convenience. + +| Blob type | Description | +| ------------------------------ | ----------- | +| ndv-long-little-endian | 8-bytes integer stored little-endian and representing number of distinct values +| apache-datasketches-theta-v1 | A serialized form of a "compact" Theta sketch produced by the [Apache DataSketches](https://datasketches.apache.org/) library. + +### Compression codecs + +The compression codec used should be one of: ZSTD, LZ4. The data can also be +uncompressed. For maximal interoperability, other codecs are not supported.