From cc660217723004e7ccd42c11ed261ed0dae9a4f2 Mon Sep 17 00:00:00 2001 From: Piotr Findeisen Date: Thu, 2 Jun 2022 14:11:30 +0200 Subject: [PATCH] Add statistics information in table snapshot --- format/spec.md | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/format/spec.md b/format/spec.md index 41b22ff5af64..ba83ffceaa15 100644 --- a/format/spec.md +++ b/format/spec.md @@ -496,6 +496,7 @@ A snapshot consists of the following fields: | _optional_ | | **`manifests`** | A list of manifest file locations. Must be omitted if `manifest-list` is present | | _optional_ | _required_ | **`summary`** | A string map that summarizes the snapshot changes, including `operation` (see below) | | _optional_ | _optional_ | **`schema-id`** | ID of the table's current schema when the snapshot was created | +| | _optional_ | **`statistics`** | A list of [statistics files' metadata](#statistics-file). The field should be retained by writers, unless writer updates the statistics, or knows they became obsolete. | The snapshot summary's `operation` field is used by some operations, like snapshot expiration, to skip processing certain snapshots. Possible `operation` values are: @@ -631,6 +632,29 @@ When expiring snapshots, retention policies in table and snapshot references are 2. The snapshot is not one of the first `min-snapshots-to-keep` in the branch (including the branch's referenced snapshot) 5. Expire any snapshot not in the set of snapshots to retain. +#### Statistics file + +Statistics files are valid [Puffin files](../puffin-spec). Statistics are informational. A reader can choose to +ignore statistics information. Statistics support is not required to read the table correctly. + +Statistics files' metadata within `statistics` table snapshot field is a struct with the following fields: + +| Field name | Type | Description | +|---------------------------------|----------------------------------------|--------------------------------------------------------------------------------------------------------| +| **`statistics-path`** | `string` | Path of the statistics file. See [Puffin file format](../puffin-spec). | +| **`file-size-in-bytes`** | `long` | Size of the statistics file. | +| **`file-footer-size-in-bytes`** | `long` | Size of the statistics file's footer. See [Puffin file format](../puffin-spec) for footer definition. | +| **`source-sequence-number`** | `long` | Table sequence number at which the stats were calculated | +| **`statistics-metadata`** | `list` (see below) | A list of the statistics metadata for statistics contained in the file with structure described below. | + +Statistic metadata is a struct with the following fields: + +| Field name | Type | Description | +|-----------------------|-----------------------|--------------------------------------------------------------------------------------------------| +| **`statistics-type`** | `string` | Type of the statistics. Matches blob type in the Puffin file. | +| **`fields`** | `list` | Ordered list of fields, given by ID, on which the statistic was calculated. | +| **`properties`** | `map` | Additional properties associated with the statistic. Matches Blob properties in the Puffin file. | + ### Table Metadata Table metadata is stored as JSON. Each table metadata change creates a new table metadata file that is committed by an atomic operation. This operation is used to ensure that a new version of table metadata replaces the version on which it was based. This produces a linear history of table versions and ensures that concurrent writes are not lost.