Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Document NameMapping #3556

Merged
merged 8 commits into from
Nov 17, 2021
39 changes: 39 additions & 0 deletions site/docs/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,24 @@ Columns in Iceberg data files are selected by field id. The table schema's colum

For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.

Tables may also define a property `schema.name-mapping.default` with a JSON name mapping containing a list of field mapping objects. These mappings provide fallback field ids to be used when a data file does not contain field id information. Each object should contain

* `names`: A required list of 0 or more names for a field.
* `field-id`: An optional Iceberg field ID used when a field's name is present in `names`
* `fields`: An optional list of field mappings for child field of structs, maps, and lists.

RussellSpitzer marked this conversation as resolved.
Show resolved Hide resolved
Field mapping fields are constrained by the following rules:

* A name may contain `.` but this refers to a literal name, not a nested field. For example, `a.b` refers to a field named `a.b`, not child field `b` of field `a`.
* Each child field should be defined with their own field mapping under `fields`.
* Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files. For example, all Avro field aliases should be listed in `names`.
* Fields which exist only in the Iceberg schema and not in imported data files may use an empty `names` list.
* Fields that exist in imported files but not in the Iceberg schema may omit `field-id`.
* List types should contain a mapping in `fields` for `element`.
* Map types should contain mappings in `fields` for `key` and `value`.
* Struct types should contain mappings in `fields` for their child fields.

For details on serialization, see [Appendix C](#name-mapping-serialization).

#### Identifier Field IDs

Expand Down Expand Up @@ -990,6 +1008,27 @@ Table metadata is serialized as a JSON object according to the following table.
|**`default-sort-order-id`**|`JSON int`|`0`|

RussellSpitzer marked this conversation as resolved.
Show resolved Hide resolved

### Name Mapping Serialization

Name mapping is serialized as a list of field mapping JSON Objects which are serialized as follows

|Field mapping field|JSON representation|Example|
|--- |--- |--- |
|**`names`**|`JSON list of strings`|`["latitude", "lat"]`|
|**`field_id`**|`JSON int`|`1`|
|**`fields`**|`JSON field mappings (list of objects)`|`[{ `<br />&nbsp;&nbsp;`"field-id": 4,`<br />&nbsp;&nbsp;`"names": ["latitude", "lat"]`<br />`}, {`<br />&nbsp;&nbsp;`"field-id": 5,`<br />&nbsp;&nbsp;`"names": ["longitude", "long"]`<br />`}]`|

Example
```json
[ { "field-id": 1, "names": ["id", "record_id"] },
{ "field-id": 2, "names": ["data"] },
{ "field-id": 3, "names": ["location"], "fields": [
{ "field-id": 4, "names": ["latitude", "lat"] },
{ "field-id": 5, "names": ["longitude", "long"] }
] } ]
```


## Appendix D: Single-value serialization

This serialization scheme is for storing single values as individual binary values in the lower and upper bounds maps of manifest files.
Expand Down