Skip to content
This repository was archived by the owner on Feb 6, 2024. It is now read-only.

Commit

Permalink
Fix spec links (#68)
Browse files Browse the repository at this point in the history
  • Loading branch information
samredai authored Mar 30, 2022
1 parent e6d2e34 commit 4f8dd64
Showing 1 changed file with 14 additions and 14 deletions.
28 changes: 14 additions & 14 deletions landing-page/content/common/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,15 @@ The format version number is incremented when new features are added that will b

Version 1 of the Iceberg spec defines how to manage large analytic tables using immutable file formats: Parquet, Avro, and ORC.

All version 1 data and metadata files are valid after upgrading a table to version 2. [Appendix E](spec/#version-2) documents how to default version 2 fields when reading version 1 metadata.
All version 1 data and metadata files are valid after upgrading a table to version 2. [Appendix E](#version-2) documents how to default version 2 fields when reading version 1 metadata.

### Version 2: Row-level Deletes

Version 2 of the Iceberg spec adds row-level updates and deletes for analytic tables with immutable files.

The primary change in version 2 adds delete files to encode that rows that are deleted in existing data files. This version can be used to delete or replace individual rows in immutable data files without rewriting the files.

In addition to row-level deletes, version 2 makes some requirements stricter for writers. The full set of changes are listed in [Appendix E](spec/#version-2).
In addition to row-level deletes, version 2 makes some requirements stricter for writers. The full set of changes are listed in [Appendix E](#version-2).


## Goals
Expand Down Expand Up @@ -93,10 +93,10 @@ Row-level deletes are stored in delete files.

There are two ways to encode a row-level delete:

* [_Position deletes_](spec/#position-delete-files) mark a row deleted by data file path and the row position in the data file
* [_Equality deletes_](spec/#equality-delete-files) mark a row deleted by one or more column values, like `id = 5`
* [_Position deletes_](#position-delete-files) mark a row deleted by data file path and the row position in the data file
* [_Equality deletes_](#equality-delete-files) mark a row deleted by one or more column values, like `id = 5`

Like data files, delete files are tracked by partition. In general, a delete file must be applied to older data files with the same partition; see [Scan Planning](spec/#scan-planning) for details. Column metrics can be used to determine whether a delete file's rows overlap the contents of a data file or a scan range.
Like data files, delete files are tracked by partition. In general, a delete file must be applied to older data files with the same partition; see [Scan Planning](#scan-planning) for details. Column metrics can be used to determine whether a delete file's rows overlap the contents of a data file or a scan range.


### File System Operations
Expand Down Expand Up @@ -237,7 +237,7 @@ Field mapping fields are constrained by the following rules:
* Map types should contain mappings in `fields` for `key` and `value`.
* Struct types should contain mappings in `fields` for their child fields.

For details on serialization, see [Appendix C](spec/#name-mapping-serialization).
For details on serialization, see [Appendix C](#name-mapping-serialization).

#### Identifier Field IDs

Expand Down Expand Up @@ -356,20 +356,20 @@ Users can sort their data within partitions by columns to gain performance. The
A sort order is defined by an sort order id and a list of sort fields. The order of the sort fields within the list defines the order in which the sort is applied to the data. Each sort field consists of:

* A **source column id** from the table's schema
* A **transform** that is used to produce values to be sorted on from the source column. This is the same transform as described in [partition transforms](spec/#partition-transforms).
* A **transform** that is used to produce values to be sorted on from the source column. This is the same transform as described in [partition transforms](#partition-transforms).
* A **sort direction**, that can only be either `asc` or `desc`
* A **null order** that describes the order of null values when sorted. Can only be either `nulls-first` or `nulls-last`

Order id `0` is reserved for the unsorted order.

Sorting floating-point numbers should produce the following behavior: `-NaN` < `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. This aligns with the implementation of Java floating-point types comparisons.

A data or delete file is associated with a sort order by the sort order's id within [a manifest](spec/#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.
A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.


### Manifests

A manifest is an immutable Avro file that lists data files or delete files, along with each file’s partition data tuple, metrics, and tracking information. One or more manifest files are used to store a [snapshot](spec/#snapshots), which tracks all of the files in a table at some point in time. Manifests are tracked by a [manifest list](spec/#manifest-lists) for each table snapshot.
A manifest is an immutable Avro file that lists data files or delete files, along with each file’s partition data tuple, metrics, and tracking information. One or more manifest files are used to store a [snapshot](#snapshots), which tracks all of the files in a table at some point in time. Manifests are tracked by a [manifest list](#manifest-lists) for each table snapshot.

A manifest is a valid Iceberg data file: files must use valid Iceberg formats, schemas, and column projection.

Expand Down Expand Up @@ -426,7 +426,7 @@ Notes:

1. Single-value serialization for lower and upper bounds is detailed in Appendix D.
2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate.
3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](spec/#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files.
3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files.

The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec.

Expand Down Expand Up @@ -577,9 +577,9 @@ Notes:
#### Snapshot Reference

Iceberg tables keep track of branches and tags using snapshot references.
Tags are labels for individual snapshots. Branches are mutable named references that can be updated by committing a new snapshot as the branch's referenced snapshot using the [Commit Conflict Resolution and Retry](spec/#commit-conflict-resolution-and-retry) procedures.
Tags are labels for individual snapshots. Branches are mutable named references that can be updated by committing a new snapshot as the branch's referenced snapshot using the [Commit Conflict Resolution and Retry](#commit-conflict-resolution-and-retry) procedures.

The snapshot reference object records all the information of a reference including snapshot ID, reference type and [Snapshot Retention Policy](spec/#snapshot-retention-policy).
The snapshot reference object records all the information of a reference including snapshot ID, reference type and [Snapshot Retention Policy](#snapshot-retention-policy).

| v1 | v2 | Field name | Type | Description |
| ---------- | ---------- | ---------------------------- | --------- | ----------- |
Expand Down Expand Up @@ -728,9 +728,9 @@ The rows in the delete file must be sorted by `file_path` then `position` to opt

Equality delete files identify deleted rows in a collection of data files by one or more column values, and may optionally contain additional columns of the deleted row.

Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](spec/#manifests). Float and double columns cannot be used as delete columns in equality delete files.
Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). Float and double columns cannot be used as delete columns in equality delete files.

A data row is deleted if its values are equal to all delete columns for any row in an equality delete file that applies to the row's data file (see [`Scan Planning`](spec/#scan-planning)).
A data row is deleted if its values are equal to all delete columns for any row in an equality delete file that applies to the row's data file (see [`Scan Planning`](#scan-planning)).

Each row of the delete file produces one equality predicate that matches any row where the delete columns are equal. Multiple columns can be thought of as an `AND` of equality predicates. A `null` value in a delete column matches a row if the row's value is `null`, equivalent to `col IS NULL`.

Expand Down

0 comments on commit 4f8dd64

Please sign in to comment.