From 6594d54f59ccdea6d721f04834c1e3d677152944 Mon Sep 17 00:00:00 2001 From: Piotr Findeisen Date: Thu, 2 Jun 2022 14:11:30 +0200 Subject: [PATCH] Add statistics information in table snapshot --- format/spec.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/format/spec.md b/format/spec.md index 9b64d49a993e..ca7cf8c58486 100644 --- a/format/spec.md +++ b/format/spec.md @@ -495,6 +495,7 @@ A snapshot consists of the following fields: | _optional_ | | **`manifests`** | A list of manifest file locations. Must be omitted if `manifest-list` is present | | _optional_ | _required_ | **`summary`** | A string map that summarizes the snapshot changes, including `operation` (see below) | | _optional_ | _optional_ | **`schema-id`** | ID of the table's current schema when the snapshot was created | +| | _optional_ | **`statistics`** | A list of statistics files' metadata (see below) | The snapshot summary's `operation` field is used by some operations, like snapshot expiration, to skip processing certain snapshots. Possible `operation` values are: @@ -513,6 +514,17 @@ Manifests for a snapshot are tracked by a manifest list. Valid snapshots are stored as a list in table metadata. For serialization, see Appendix C. +Statistics files' metadata within `statistics` field is a struct with the following fields: + +| Field name | Type | Description | +|---------------------------------|------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **`location`** | `string` | Location of the statistics file. See [Puffin file format](../puffin). | +| **`file-size-in-bytes`** | `long` | Size of the statistics file. | +| **`file-footer-size-in-bytes`** | `long` | Size of the statistics file's footer. See [Puffin file format](../puffin) for footer definition. | +| **`source-sequence-number`** | `long` | Table sequence number at which the stats were calculated | +| **`statistics-fields-sets`** | `map>>` | A map indicating which statistics are contained in the statistics file and on which columns they were calculated. The map keys are statistics sketch names and map values represent sets of columns, given by column ID. | + +Snapshot's statistics field should be retained by writers, unless writer updates the statistics, or knows they became obsolete. #### Manifest Lists