From 0844e243bdc2091f6f4ec46cf3b09d536f18454b Mon Sep 17 00:00:00 2001 From: Russell_Spitzer Date: Mon, 15 Nov 2021 13:44:11 -0600 Subject: [PATCH 1/8] API: Add information on NameMapping to Spec --- site/docs/spec.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/site/docs/spec.md b/site/docs/spec.md index 5627df919572..9f3fcfd5db2e 100644 --- a/site/docs/spec.md +++ b/site/docs/spec.md @@ -212,6 +212,9 @@ Columns in Iceberg data files are selected by field id. The table schema's colum For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order. +Tables may also define a property `schema.name-mapping.default` with a JSON map of `columnName` -> `fieldId` which will be used if a data file was written without field ids. This `NameMapping` will **only** be used on files without field ids. Files imported or added to an Iceberg table from a system that does not generate field ids will fall back to using the table's name mapping to map columns to field ids. + + #### Identifier Field IDs From 1c0a45a04aa3448f3a8795a4efc96b42dd690038 Mon Sep 17 00:00:00 2001 From: Russell_Spitzer Date: Mon, 15 Nov 2021 16:39:15 -0600 Subject: [PATCH 2/8] Correct mistakes in doc, apply reviewer comments --- site/docs/spec.md | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/site/docs/spec.md b/site/docs/spec.md index 9f3fcfd5db2e..2feceedebdd8 100644 --- a/site/docs/spec.md +++ b/site/docs/spec.md @@ -212,9 +212,33 @@ Columns in Iceberg data files are selected by field id. The table schema's colum For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order. -Tables may also define a property `schema.name-mapping.default` with a JSON map of `columnName` -> `fieldId` which will be used if a data file was written without field ids. This `NameMapping` will **only** be used on files without field ids. Files imported or added to an Iceberg table from a system that does not generate field ids will fall back to using the table's name mapping to map columns to field ids. +Tables may also define a property `schema.name-mapping.default` with a JSON `name mapping` containing a list of `field mapping` objects. These mappings provide fallback field ids to be used when a data file does not contain field id information. Each object should contain +##### field mapping +* `names`: A required list of 0 or more names for a field. +* `field-id`: An optional Iceberg field ID used when a field's name is present in `names` +* `fields`: An optional list of field mappings for child field of structs, maps, and lists. + +##### names + +A name may contain `.` but this refers to a literal name, not a nested field. For example, `a.b` refers to a field named `a.b`, not child field `b` of field `a`. Each child field should be defined with their own `field-mapping` under `fields` + +Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files. For example, all Avro field aliases should be listed in `names`. + +Fields which exist only in the Iceberg schema and not in imported data files may be included as a `field-mapping` with an empty list of `names`. + +##### field-id + +Fields that exist in imported files but not in the Iceberg schema may omit `field-id`. + +##### fields + +List types should contain a mapping in `fields` for `element` + +Map types should contain mappings in `fields` for `key` and `value`. + +Struct types should contain mappings for their child fields. #### Identifier Field IDs From 588d1e7c3b29e80c1178e7285f09aa99affafd5b Mon Sep 17 00:00:00 2001 From: Russell_Spitzer Date: Mon, 15 Nov 2021 16:43:41 -0600 Subject: [PATCH 3/8] Plural fix --- site/docs/spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site/docs/spec.md b/site/docs/spec.md index 2feceedebdd8..aa6755e1f4a2 100644 --- a/site/docs/spec.md +++ b/site/docs/spec.md @@ -226,7 +226,7 @@ A name may contain `.` but this refers to a literal name, not a nested field. Fo Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files. For example, all Avro field aliases should be listed in `names`. -Fields which exist only in the Iceberg schema and not in imported data files may be included as a `field-mapping` with an empty list of `names`. +Fields which exist only in the Iceberg schema and not in imported data files may be included as `field-mapping`s with an empty `names` list. ##### field-id From 5c83c66e7b06cd549bd5e3f204b8158eeacff0cc Mon Sep 17 00:00:00 2001 From: Russell_Spitzer Date: Mon, 15 Nov 2021 21:54:13 -0600 Subject: [PATCH 4/8] Remove Subheaders, Add Appendix --- site/docs/spec.md | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/site/docs/spec.md b/site/docs/spec.md index aa6755e1f4a2..2f061499a9fe 100644 --- a/site/docs/spec.md +++ b/site/docs/spec.md @@ -214,32 +214,26 @@ For example, a file may be written with schema `1: a int, 2: b string, 3: c doub Tables may also define a property `schema.name-mapping.default` with a JSON `name mapping` containing a list of `field mapping` objects. These mappings provide fallback field ids to be used when a data file does not contain field id information. Each object should contain -##### field mapping - * `names`: A required list of 0 or more names for a field. * `field-id`: An optional Iceberg field ID used when a field's name is present in `names` * `fields`: An optional list of field mappings for child field of structs, maps, and lists. -##### names - A name may contain `.` but this refers to a literal name, not a nested field. For example, `a.b` refers to a field named `a.b`, not child field `b` of field `a`. Each child field should be defined with their own `field-mapping` under `fields` Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files. For example, all Avro field aliases should be listed in `names`. Fields which exist only in the Iceberg schema and not in imported data files may be included as `field-mapping`s with an empty `names` list. -##### field-id - Fields that exist in imported files but not in the Iceberg schema may omit `field-id`. -##### fields - List types should contain a mapping in `fields` for `element` Map types should contain mappings in `fields` for `key` and `value`. Struct types should contain mappings for their child fields. +For details on serialization see [Appendix F](#appendix-f-name-mapping-serialization) + #### Identifier Field IDs A schema can optionally track the set of primitive fields that identify rows in a table, using the property `identifier-field-ids` (see JSON encoding in Appendix C). @@ -1110,3 +1104,13 @@ Writing v2 metadata: * `sort_columns` was removed Note that these requirements apply when writing data to a v2 table. Tables that are upgraded from v1 may contain metadata that does not follow these requirements. Implementations should remain backward-compatible with v1 metadata requirements. + +## Appendix F: Name Mapping Serialization + +Name mapping is serialized as a list of field mapping JSON Objects which are serialized as follows + +|Field mapping field|JSON representation|Example| +|--- |--- |--- | +|**`names`**|`JSON list of strings`|`["latitude", "lat"]`| +|**`field_id`**|`JSON int`|`1`| +|**`fields`**|`JSON field mappings (list of objects)`|`[{ `
  `"field-id": 4,`
  `"names": ["latitude", "lat"]`
`}, {`
  `"field-id": 5,`
  `"names": ["longitude", "long"]`
`}]`| \ No newline at end of file From a978a0504aca457a7ece20fa05ce12f2697c7b08 Mon Sep 17 00:00:00 2001 From: Russell_Spitzer Date: Tue, 16 Nov 2021 11:00:14 -0600 Subject: [PATCH 5/8] Add example --- site/docs/spec.md | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/site/docs/spec.md b/site/docs/spec.md index 2f061499a9fe..d6a571c8fe4b 100644 --- a/site/docs/spec.md +++ b/site/docs/spec.md @@ -1113,4 +1113,14 @@ Name mapping is serialized as a list of field mapping JSON Objects which are ser |--- |--- |--- | |**`names`**|`JSON list of strings`|`["latitude", "lat"]`| |**`field_id`**|`JSON int`|`1`| -|**`fields`**|`JSON field mappings (list of objects)`|`[{ `
  `"field-id": 4,`
  `"names": ["latitude", "lat"]`
`}, {`
  `"field-id": 5,`
  `"names": ["longitude", "long"]`
`}]`| \ No newline at end of file +|**`fields`**|`JSON field mappings (list of objects)`|`[{ `
  `"field-id": 4,`
  `"names": ["latitude", "lat"]`
`}, {`
  `"field-id": 5,`
  `"names": ["longitude", "long"]`
`}]`| + +Example +```json +[ { "field-id": 1, "names": ["id", "record_id"] }, + { "field-id": 2, "names": ["data"] }, + { "field-id": 3, "names": ["location"], "fields": [ + { "field-id": 4, "names": ["latitude", "lat"] }, + { "field-id": 5, "names": ["longitude", "long"] } + ] } ] +``` \ No newline at end of file From 46764e1fec6066890451bc0aa6435cc6026b555d Mon Sep 17 00:00:00 2001 From: Russell_Spitzer Date: Tue, 16 Nov 2021 15:44:44 -0600 Subject: [PATCH 6/8] Move Appendix into C, change paragraphs to list --- site/docs/spec.md | 63 ++++++++++++++++++++++------------------------- 1 file changed, 30 insertions(+), 33 deletions(-) diff --git a/site/docs/spec.md b/site/docs/spec.md index d6a571c8fe4b..84a0f9b1d4c1 100644 --- a/site/docs/spec.md +++ b/site/docs/spec.md @@ -218,21 +218,18 @@ Tables may also define a property `schema.name-mapping.default` with a JSON `nam * `field-id`: An optional Iceberg field ID used when a field's name is present in `names` * `fields`: An optional list of field mappings for child field of structs, maps, and lists. -A name may contain `.` but this refers to a literal name, not a nested field. For example, `a.b` refers to a field named `a.b`, not child field `b` of field `a`. Each child field should be defined with their own `field-mapping` under `fields` +Field mapping fields are constrained by the following rules -Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files. For example, all Avro field aliases should be listed in `names`. +* A name may contain `.` but this refers to a literal name, not a nested field. For example, `a.b` refers to a field named `a.b`, not child field `b` of field `a`. +* Each child field should be defined with their own `field-mapping` under `fields`. Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files. +* For example, all Avro field aliases should be listed in `names`. +* Fields which exist only in the Iceberg schema and not in imported data files may be included as `field-mapping`s with an empty `names` list. +* Fields that exist in imported files but not in the Iceberg schema may omit `field-id`. +* List types should contain a mapping in `fields` for `element`. +* Map types should contain mappings in `fields` for `key` and `value`. +* Struct types should contain mappings inf `fields` for their child fields. -Fields which exist only in the Iceberg schema and not in imported data files may be included as `field-mapping`s with an empty `names` list. - -Fields that exist in imported files but not in the Iceberg schema may omit `field-id`. - -List types should contain a mapping in `fields` for `element` - -Map types should contain mappings in `fields` for `key` and `value`. - -Struct types should contain mappings for their child fields. - -For details on serialization see [Appendix F](#appendix-f-name-mapping-serialization) +For details on serialization see [Appendix C](#name-mapping-serialization) #### Identifier Field IDs @@ -1010,6 +1007,26 @@ Table metadata is serialized as a JSON object according to the following table. |**`sort-orders`**|`JSON sort orders (list of sort field object)`|`See above`| |**`default-sort-order-id`**|`JSON int`|`0`| +### Name Mapping Serialization + +Name mapping is serialized as a list of field mapping JSON Objects which are serialized as follows + +|Field mapping field|JSON representation|Example| +|--- |--- |--- | +|**`names`**|`JSON list of strings`|`["latitude", "lat"]`| +|**`field_id`**|`JSON int`|`1`| +|**`fields`**|`JSON field mappings (list of objects)`|`[{ `
  `"field-id": 4,`
  `"names": ["latitude", "lat"]`
`}, {`
  `"field-id": 5,`
  `"names": ["longitude", "long"]`
`}]`| + +Example +```json +[ { "field-id": 1, "names": ["id", "record_id"] }, + { "field-id": 2, "names": ["data"] }, + { "field-id": 3, "names": ["location"], "fields": [ + { "field-id": 4, "names": ["latitude", "lat"] }, + { "field-id": 5, "names": ["longitude", "long"] } + ] } ] +``` + ## Appendix D: Single-value serialization @@ -1104,23 +1121,3 @@ Writing v2 metadata: * `sort_columns` was removed Note that these requirements apply when writing data to a v2 table. Tables that are upgraded from v1 may contain metadata that does not follow these requirements. Implementations should remain backward-compatible with v1 metadata requirements. - -## Appendix F: Name Mapping Serialization - -Name mapping is serialized as a list of field mapping JSON Objects which are serialized as follows - -|Field mapping field|JSON representation|Example| -|--- |--- |--- | -|**`names`**|`JSON list of strings`|`["latitude", "lat"]`| -|**`field_id`**|`JSON int`|`1`| -|**`fields`**|`JSON field mappings (list of objects)`|`[{ `
  `"field-id": 4,`
  `"names": ["latitude", "lat"]`
`}, {`
  `"field-id": 5,`
  `"names": ["longitude", "long"]`
`}]`| - -Example -```json -[ { "field-id": 1, "names": ["id", "record_id"] }, - { "field-id": 2, "names": ["data"] }, - { "field-id": 3, "names": ["location"], "fields": [ - { "field-id": 4, "names": ["latitude", "lat"] }, - { "field-id": 5, "names": ["longitude", "long"] } - ] } ] -``` \ No newline at end of file From 09e3add808781442e4ec7d9b56d284cc2a6b502a Mon Sep 17 00:00:00 2001 From: Russell_Spitzer Date: Tue, 16 Nov 2021 16:50:27 -0600 Subject: [PATCH 7/8] Reivewer Suggestions and Corrections --- site/docs/spec.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/site/docs/spec.md b/site/docs/spec.md index 84a0f9b1d4c1..372b2a734ce5 100644 --- a/site/docs/spec.md +++ b/site/docs/spec.md @@ -218,18 +218,18 @@ Tables may also define a property `schema.name-mapping.default` with a JSON `nam * `field-id`: An optional Iceberg field ID used when a field's name is present in `names` * `fields`: An optional list of field mappings for child field of structs, maps, and lists. -Field mapping fields are constrained by the following rules +Field mapping fields are constrained by the following rules: * A name may contain `.` but this refers to a literal name, not a nested field. For example, `a.b` refers to a field named `a.b`, not child field `b` of field `a`. -* Each child field should be defined with their own `field-mapping` under `fields`. Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files. -* For example, all Avro field aliases should be listed in `names`. -* Fields which exist only in the Iceberg schema and not in imported data files may be included as `field-mapping`s with an empty `names` list. +* Each child field should be defined with their own `field-mapping` under `fields`. +* Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files. For example, all Avro field aliases should be listed in `names`. +* Fields which exist only in the Iceberg schema and not in imported data files may use an empty `names` list. * Fields that exist in imported files but not in the Iceberg schema may omit `field-id`. * List types should contain a mapping in `fields` for `element`. * Map types should contain mappings in `fields` for `key` and `value`. -* Struct types should contain mappings inf `fields` for their child fields. +* Struct types should contain mappings in `fields` for their child fields. -For details on serialization see [Appendix C](#name-mapping-serialization) +For details on serialization, see [Appendix C](#name-mapping-serialization). #### Identifier Field IDs @@ -1007,6 +1007,7 @@ Table metadata is serialized as a JSON object according to the following table. |**`sort-orders`**|`JSON sort orders (list of sort field object)`|`See above`| |**`default-sort-order-id`**|`JSON int`|`0`| + ### Name Mapping Serialization Name mapping is serialized as a list of field mapping JSON Objects which are serialized as follows From 45b53828d31c973700ddf0a2a5326ebf700aac18 Mon Sep 17 00:00:00 2001 From: Russell_Spitzer Date: Wed, 17 Nov 2021 13:09:25 -0600 Subject: [PATCH 8/8] Remove Fixed Width from Name Mapping and Field Mapping --- site/docs/spec.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/site/docs/spec.md b/site/docs/spec.md index 372b2a734ce5..61891063ca30 100644 --- a/site/docs/spec.md +++ b/site/docs/spec.md @@ -212,7 +212,7 @@ Columns in Iceberg data files are selected by field id. The table schema's colum For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order. -Tables may also define a property `schema.name-mapping.default` with a JSON `name mapping` containing a list of `field mapping` objects. These mappings provide fallback field ids to be used when a data file does not contain field id information. Each object should contain +Tables may also define a property `schema.name-mapping.default` with a JSON name mapping containing a list of field mapping objects. These mappings provide fallback field ids to be used when a data file does not contain field id information. Each object should contain * `names`: A required list of 0 or more names for a field. * `field-id`: An optional Iceberg field ID used when a field's name is present in `names` @@ -221,7 +221,7 @@ Tables may also define a property `schema.name-mapping.default` with a JSON `nam Field mapping fields are constrained by the following rules: * A name may contain `.` but this refers to a literal name, not a nested field. For example, `a.b` refers to a field named `a.b`, not child field `b` of field `a`. -* Each child field should be defined with their own `field-mapping` under `fields`. +* Each child field should be defined with their own field mapping under `fields`. * Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files. For example, all Avro field aliases should be listed in `names`. * Fields which exist only in the Iceberg schema and not in imported data files may use an empty `names` list. * Fields that exist in imported files but not in the Iceberg schema may omit `field-id`.