Spec: Document NameMapping #3556

RussellSpitzer · 2021-11-15T19:53:05Z

While we have a significant amount of code relying on the NameMapping of a table, we don't actually include any information on this table property in the Spec. This PR aims to codify what we already do in most implementations of Iceberg.

RussellSpitzer · 2021-11-15T19:55:23Z

#3542 - @rymurr + @rdblue + @ajantha-bhat + @aokolnychyi + @openinx Please take a look :)

site/docs/spec.md

rdblue · 2021-11-15T21:56:10Z

site/docs/spec.md

@@ -212,6 +212,9 @@ Columns in Iceberg data files are selected by field id. The table schema's colum

 For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.

+Tables may also define a property `schema.name-mapping.default` with a JSON map of `columnName` -> `fieldId` which will be used if a data file was written without field ids. This `NameMapping` will **only** be used on files without field ids. Files imported or added to an Iceberg table from a system that does not generate field ids will fall back to using the table's name mapping to map columns to field ids.


I'd change "This NameMapping will only" to "This NameMapping may only" because we're not describing behavior, we are setting requirements for behavior.

This is a great start, but should also specify the name mapping itself more formally.

A name mapping is a list of field mapping objects. Each field mapping has the following properties:

names: A required list of 0 or more names for a field. Note that names may contain .

field-id: An optional Iceberg field ID to be used for a field with one of the given names

fields: An optional list of field mappings for child fields of structs, maps, and lists

A field mapping may map multiple names to a single field ID to support cases where a name has been updated. For example, Avro field aliases should also be listed in names. Similarly, fields that exist only in the Iceberg schema may be in the field mapping with an empty list of names, and fields that exist in imported files but not in the Iceberg schema may omit field-id.

Mappings for list types should contain a child mapping for the "element" field and mappings for map types should contain child mappings for "key" and "value" fields.

Fields that are not mapped to IDs must be ignored.

Sounds good to me, i'll make the mods

Oh, and if name does contain ., then the field name itself must contain .. A mapping for names: ["a.b"] will map a field called "a.b" and does NOT apply to field "b" nested within a field "a".

What do you think about including an example JSON? I was thinking of just adding in the example in NameMappingParser.java

I think we should have a section on serializing these in the appendix and can give examples there, like the others. Does that work?

I think it's easier to have whole examples, but I added one in the style of the other appendixes

I agree that a whole example would be rather helpful

RussellSpitzer · 2021-11-15T22:40:51Z

Updated

site/docs/spec.md

RussellSpitzer · 2021-11-16T03:55:38Z

Updated again

rymurr · 2021-11-16T14:11:58Z

LGTM w/ the addition that a full example would really help the reader

RussellSpitzer · 2021-11-16T17:02:00Z

@rymurr example added!

rdblue · 2021-11-16T17:52:58Z

site/docs/spec.md

+
+Struct types should contain mappings for their child fields.
+
+For details on serialization see [Appendix F](#appendix-f-name-mapping-serialization)


Can we avoid adding another appendix and add this in Appendix C: JSON serialization? https://iceberg.apache.org/#spec/#appendix-c-json-serialization

rdblue · 2021-11-16T17:56:21Z

Looking great! Just a couple things I'd change.

rdblue · 2021-11-16T17:58:57Z

site/docs/spec.md

+
+Map types should contain mappings in `fields` for `key` and `value`.
+
+Struct types should contain mappings for their child fields.


Minor: I think these related items could be a single paragraph.

The last five "paragraphs" might be easier to read as a list (of rules for implementors to follow), similar to how the Commit Conflict Resolution and Retry section is formatted.

I would prefer that @electrum since I think these are discrete rules that aren't really related.

electrum

Nice, the new formatting is much easier to read.

electrum · 2021-11-16T22:28:20Z

site/docs/spec.md

+Field mapping fields are constrained by the following rules
+
+* A name may contain `.` but this refers to a literal name, not a nested field. For example, `a.b` refers to a field named `a.b`, not child field `b` of field `a`. 
+* Each child field should be defined with their own `field-mapping` under `fields`. Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files. 


It seems like the second sentence should be part of the next item.

Each child field should be defined with their own field-mapping under fields.

Multiple values for names may be mapped to a single field ID to support cases where a field may have different names in different data files. For example, all Avro field aliases should be listed in names.

Why is it field-mapping rather than "field mapping"? Shouldn't it be "Each child field should be defined with its own field mapping under fields".

Also, I think it makes sense to combine the "Each child field..." part with the first bullet because it explains how nesting works, not just names. Maybe it makes sense to everyone else this way though?

site/docs/spec.md

electrum · 2021-11-16T22:33:09Z

site/docs/spec.md

+* Fields that exist in imported files but not in the Iceberg schema may omit `field-id`.
+* List types should contain a mapping in `fields` for `element`. 
+* Map types should contain mappings in `fields` for `key` and `value`. 
+* Struct types should contain mappings inf `fields` for their child fields.


I still see it?

That's me, pushing to the wrong repository

site/docs/spec.md

rdblue · 2021-11-16T23:12:13Z

site/docs/spec.md

+
+* A name may contain `.` but this refers to a literal name, not a nested field. For example, `a.b` refers to a field named `a.b`, not child field `b` of field `a`. 
+* Each child field should be defined with their own `field-mapping` under `fields`. Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files. 
+* For example, all Avro field aliases should be listed in `names`.


I think this should be part of the previous bullet because it is clarifying the behavior of names and giving an example where you'd have multiple names.

rdblue · 2021-11-16T23:14:38Z

site/docs/spec.md

@@ -212,6 +212,24 @@ Columns in Iceberg data files are selected by field id. The table schema's colum

 For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.

+Tables may also define a property `schema.name-mapping.default` with a JSON `name mapping` containing a list of `field mapping` objects. These mappings provide fallback field ids to be used when a data file does not contain field id information. Each object should contain


name mapping and field mapping are fixed width, but aren't properties? Why used fixed width font?

I thought about this as names of new terms we are defining. We can remove the font

Yeah, I think we should.

rdblue · 2021-11-17T18:19:41Z

+1 once the fixed-width font for name mapping and field mapping are fixed. Thanks @RussellSpitzer!

RussellSpitzer · 2021-11-17T19:35:05Z

@electrum Are you +1? Let me know if you have any other mods

electrum · 2021-11-17T20:20:48Z

@RussellSpitzer Looks good to me.

RussellSpitzer · 2021-11-17T20:22:13Z

Ok thanks for review everyone! I'll merge this now

RussellSpitzer · 2021-11-17T20:23:58Z

Closes #3542

API: Add information on NameMapping to Spec

0844e24

github-actions bot added the docs label Nov 15, 2021

rdblue changed the title ~~API: Add information on NameMapping to Spec~~ Spec: Document NameMapping Nov 15, 2021

rdblue reviewed Nov 15, 2021

View reviewed changes

site/docs/spec.md Show resolved Hide resolved

rdblue reviewed Nov 15, 2021

View reviewed changes

Correct mistakes in doc, apply reviewer comments

1c0a45a

Plural fix

588d1e7

rdblue reviewed Nov 15, 2021

View reviewed changes

site/docs/spec.md Outdated Show resolved Hide resolved

rdblue reviewed Nov 15, 2021

View reviewed changes

site/docs/spec.md Outdated Show resolved Hide resolved

Remove Subheaders, Add Appendix

5c83c66

RussellSpitzer force-pushed the AddNameMappingToSpec branch from 0e46237 to 5c83c66 Compare November 16, 2021 03:58

Add example

a978a05

rymurr approved these changes Nov 16, 2021

View reviewed changes

rdblue reviewed Nov 16, 2021

View reviewed changes

alexjo2144 mentioned this pull request Nov 16, 2021

Handle Iceberg files with missing Field IDs trinodb/trino#9959

Merged

Move Appendix into C, change paragraphs to list

46764e1

electrum reviewed Nov 16, 2021

View reviewed changes

Reivewer Suggestions and Corrections

09e3add

rdblue reviewed Nov 16, 2021

View reviewed changes

Remove Fixed Width from Name Mapping and Field Mapping

45b5382

RussellSpitzer merged commit cf972cf into apache:master Nov 17, 2021

RussellSpitzer deleted the AddNameMappingToSpec branch November 17, 2021 20:24

Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Nov 23, 2021

Spec: Document NameMapping (apache#3556)

b0e55f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec: Document NameMapping #3556

Spec: Document NameMapping #3556

RussellSpitzer commented Nov 15, 2021

RussellSpitzer commented Nov 15, 2021

rdblue Nov 15, 2021

RussellSpitzer Nov 15, 2021

rdblue Nov 15, 2021

RussellSpitzer Nov 15, 2021

rdblue Nov 15, 2021

RussellSpitzer Nov 16, 2021

rymurr Nov 16, 2021

RussellSpitzer commented Nov 15, 2021

RussellSpitzer commented Nov 16, 2021

rymurr commented Nov 16, 2021

RussellSpitzer commented Nov 16, 2021

rdblue Nov 16, 2021

rdblue commented Nov 16, 2021

rdblue Nov 16, 2021

electrum Nov 16, 2021

RussellSpitzer Nov 16, 2021

electrum left a comment

electrum Nov 16, 2021

rdblue Nov 16, 2021

electrum Nov 16, 2021

rdblue Nov 16, 2021

RussellSpitzer Nov 16, 2021

rdblue Nov 16, 2021

rdblue Nov 16, 2021

RussellSpitzer Nov 17, 2021

rdblue Nov 17, 2021

rdblue commented Nov 17, 2021

RussellSpitzer commented Nov 17, 2021

electrum commented Nov 17, 2021

RussellSpitzer commented Nov 17, 2021

RussellSpitzer commented Nov 17, 2021

		@@ -212,6 +212,9 @@ Columns in Iceberg data files are selected by field id. The table schema's colum

		For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.

		Tables may also define a property `schema.name-mapping.default` with a JSON map of `columnName` -> `fieldId` which will be used if a data file was written without field ids. This `NameMapping` will only be used on files without field ids. Files imported or added to an Iceberg table from a system that does not generate field ids will fall back to using the table's name mapping to map columns to field ids.


		Struct types should contain mappings for their child fields.

		For details on serialization see [Appendix F](#appendix-f-name-mapping-serialization)


		Map types should contain mappings in `fields` for `key` and `value`.

		Struct types should contain mappings for their child fields.

		@@ -212,6 +212,24 @@ Columns in Iceberg data files are selected by field id. The table schema's colum

		For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.

		Tables may also define a property `schema.name-mapping.default` with a JSON `name mapping` containing a list of `field mapping` objects. These mappings provide fallback field ids to be used when a data file does not contain field id information. Each object should contain

Spec: Document NameMapping #3556

Spec: Document NameMapping #3556

Conversation

RussellSpitzer commented Nov 15, 2021

RussellSpitzer commented Nov 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer commented Nov 15, 2021

RussellSpitzer commented Nov 16, 2021

rymurr commented Nov 16, 2021

RussellSpitzer commented Nov 16, 2021

Choose a reason for hiding this comment

rdblue commented Nov 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

electrum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Nov 17, 2021

RussellSpitzer commented Nov 17, 2021

electrum commented Nov 17, 2021

RussellSpitzer commented Nov 17, 2021

RussellSpitzer commented Nov 17, 2021