Skip to content

Commit

Permalink
Add statistics information in table snapshot
Browse files Browse the repository at this point in the history
  • Loading branch information
findepi committed Jun 24, 2022
1 parent 77ac584 commit 8125ad7
Showing 1 changed file with 25 additions and 0 deletions.
25 changes: 25 additions & 0 deletions format/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -496,6 +496,7 @@ A snapshot consists of the following fields:
| _optional_ | | **`manifests`** | A list of manifest file locations. Must be omitted if `manifest-list` is present |
| _optional_ | _required_ | **`summary`** | A string map that summarizes the snapshot changes, including `operation` (see below) |
| _optional_ | _optional_ | **`schema-id`** | ID of the table's current schema when the snapshot was created |
| _optional_ | _optional_ | **`statistics`** | A [statistics file's metadata](#statistics-file). The field should be retained by writers, unless writer updates the statistics, or knows they became obsolete. |

The snapshot summary's `operation` field is used by some operations, like snapshot expiration, to skip processing certain snapshots. Possible `operation` values are:

Expand Down Expand Up @@ -631,6 +632,30 @@ When expiring snapshots, retention policies in table and snapshot references are
2. The snapshot is not one of the first `min-snapshots-to-keep` in the branch (including the branch's referenced snapshot)
5. Expire any snapshot not in the set of snapshots to retain.

#### Statistics file

Statistics files are valid [Puffin files](../puffin-spec). Statistics are informational. A reader can choose to
ignore statistics information. Statistics support is not required to read the table correctly.

Statistics file's metadata within `statistics` table snapshot field is a struct with the following fields:

| v1 | v2 | Field name | Type | Description |
|------------|------------|---------------------------------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| _required_ | _required_ | **`statistics-path`** | `string` | Path of the statistics file. See [Puffin file format](../puffin-spec). |
| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the statistics file. |
| _required_ | _required_ | **`file-footer-size-in-bytes`** | `long` | Total size of the statistics file's footer (not the footer payload size). See [Puffin file format](../puffin-spec) for footer definition. |
| _required_ | _required_ | **`blob-metadata`** | `list<blob metadata>` (see below) | A list of the blob metadata for statistics contained in the file with structure described below. |

Statistic metadata is a struct with the following fields:

| v1 | v2 | Field name | Type | Description |
|------------|------------|--------------------------------------|------------------------|--------------------------------------------------------------------------------------------------|
| _required_ | _required_ | **`type`** | `string` | Type of the statistic. Matches Blob type in the Puffin file. |
| _required_ | _required_ | **`fields`** | `list<integer>` | Ordered list of fields, given by field ID, on which the statistic was calculated. |
| _required_ | _required_ | **`source-snapshot-id`** | `long` | ID of the Iceberg table's snapshot the blob was computed from. |
| _required_ | _required_ | **`source-snapshot-sequence-numer`** | `long` | Sequence number of the Iceberg table's snapshot the blob was computed from. |
| _required_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Matches Blob properties in the Puffin file. |

### Table Metadata

Table metadata is stored as JSON. Each table metadata change creates a new table metadata file that is committed by an atomic operation. This operation is used to ensure that a new version of table metadata replaces the version on which it was based. This produces a linear history of table versions and ensures that concurrent writes are not lost.
Expand Down

0 comments on commit 8125ad7

Please sign in to comment.