Skip to content

Commit

Permalink
Add statistics information in table snapshot
Browse files Browse the repository at this point in the history
  • Loading branch information
findepi committed Jun 7, 2022
1 parent 5d6c6cc commit 26da01e
Showing 1 changed file with 24 additions and 0 deletions.
24 changes: 24 additions & 0 deletions format/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -496,6 +496,7 @@ A snapshot consists of the following fields:
| _optional_ | | **`manifests`** | A list of manifest file locations. Must be omitted if `manifest-list` is present |
| _optional_ | _required_ | **`summary`** | A string map that summarizes the snapshot changes, including `operation` (see below) |
| _optional_ | _optional_ | **`schema-id`** | ID of the table's current schema when the snapshot was created |
| | _optional_ | **`statistics`** | A list of [statistics files' metadata](#statistics-file). The field should be retained by writers, unless writer updates the statistics, or knows they became obsolete. |

The snapshot summary's `operation` field is used by some operations, like snapshot expiration, to skip processing certain snapshots. Possible `operation` values are:

Expand Down Expand Up @@ -631,6 +632,29 @@ When expiring snapshots, retention policies in table and snapshot references are
2. The snapshot is not one of the first `min-snapshots-to-keep` in the branch (including the branch's referenced snapshot)
5. Expire any snapshot not in the set of snapshots to retain.

#### Statistics file

Statistics files are valid [Puffin files](../puffin-spec). Statistics are informational. A reader can choose to
ignore statistics information. Statistics support is not required to read the table correctly.

Statistics files' metadata within `statistics` table snapshot field is a struct with the following fields:

| v2 | Field name | Type | Description |
|------------|---------------------------------|----------------------------------------|--------------------------------------------------------------------------------------------------------|
| _required_ | **`statistics-path`** | `string` | Path of the statistics file. See [Puffin file format](../puffin-spec). |
| _required_ | **`file-size-in-bytes`** | `long` | Size of the statistics file. |
| _required_ | **`file-footer-size-in-bytes`** | `long` | Size of the statistics file's footer. See [Puffin file format](../puffin-spec) for footer definition. |
| _required_ | **`source-sequence-number`** | `long` | Table sequence number at which the stats were calculated |
| _required_ | **`statistics-metadata`** | `list<statistic metadata>` (see below) | A list of the statistics metadata for statistics contained in the file with structure described below. |

Statistic metadata is a struct with the following fields:

| v2 | Field name | Type | Description |
|------------|----------------------|-----------------------|--------------------------------------------------------------------------------------------------|
| _required_ | **`statistic-type`** | `string` | Type of the statistic. Matches Blob type in the Puffin file. |
| _required_ | **`fields`** | `list<integer>` | Ordered list of fields, given by field ID, on which the statistic was calculated. |
| _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Matches Blob properties in the Puffin file. |

### Table Metadata

Table metadata is stored as JSON. Each table metadata change creates a new table metadata file that is committed by an atomic operation. This operation is used to ensure that a new version of table metadata replaces the version on which it was based. This produces a linear history of table versions and ensures that concurrent writes are not lost.
Expand Down

0 comments on commit 26da01e

Please sign in to comment.