From 01097d8e2a6cf683c518d8b0da083aa5bd4a81a1 Mon Sep 17 00:00:00 2001 From: Piotr Findeisen Date: Thu, 2 Jun 2022 14:11:30 +0200 Subject: [PATCH] Add statistics information in table snapshot --- format/spec.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/format/spec.md b/format/spec.md index 1154cb74484e..1aec5091a177 100644 --- a/format/spec.md +++ b/format/spec.md @@ -665,9 +665,34 @@ Table metadata consists of the following fields: | _optional_ | _required_ | **`sort-orders`**| A list of sort orders, stored as full sort order objects. | | _optional_ | _required_ | **`default-sort-order-id`**| Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. | | | _optional_ | **`refs`** | A map of snapshot references. The map keys are the unique snapshot reference names in the table, and the map values are snapshot reference objects. There is always a `main` branch reference pointing to the `current-snapshot-id` even if the `refs` map is null. | +| _optional_ | _optional_ | **`snapshot-statistics`** | A list (optional) of [table statistics](#table-statistics). | For serialization details, see Appendix C. +#### Table statistics + +Table statistics files are valid [Puffin files](../puffin-spec). Statistics are informational. A reader can choose to +ignore statistics information. Statistics support is not required to read the table correctly. A table can contain +many statistics files associated with different table snapshots. + +Statistics files metadata within `snapshot-statistics` table metadata field is a struct with the following fields: + +| v1 | v2 | Field name | Type | Description | +|------------|------------|---------------------------------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------| +| _required_ | _required_ | **`snapshot-id`** | `string` | ID of the Iceberg table's snapshot the statistics were computed from. | +| _required_ | _required_ | **`statistics-path`** | `string` | Path of the statistics file. See [Puffin file format](../puffin-spec). | +| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the statistics file. | +| _required_ | _required_ | **`file-footer-size-in-bytes`** | `long` | Total size of the statistics file's footer (not the footer payload size). See [Puffin file format](../puffin-spec) for footer definition. | +| _required_ | _required_ | **`blob-metadata`** | `list` (see below) | A list of the blob metadata for statistics contained in the file with structure described below. | + +Blob metadata is a struct with the following fields: + +| v1 | v2 | Field name | Type | Description | +|------------|------------|------------------|-----------------------|----------------------------------------------------------------------------------------------------| +| _required_ | _required_ | **`type`** | `string` | Type of the blob. Matches Blob type in the Puffin file. | +| _required_ | _required_ | **`fields`** | `list` | Ordered list of fields, given by field ID, on which the statistic was calculated. | +| _required_ | _optional_ | **`properties`** | `map` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | + #### Commit Conflict Resolution and Retry