Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Document NameMapping #3556

Merged
merged 8 commits into from
Nov 17, 2021

Conversation

RussellSpitzer
Copy link
Member

While we have a significant amount of code relying on the NameMapping of a table, we don't actually include any information on this table property in the Spec. This PR aims to codify what we already do in most implementations of Iceberg.

@github-actions github-actions bot added the docs label Nov 15, 2021
@RussellSpitzer
Copy link
Member Author

#3542 - @rymurr + @rdblue + @ajantha-bhat + @aokolnychyi + @openinx Please take a look :)

@rdblue rdblue changed the title API: Add information on NameMapping to Spec Spec: Document NameMapping Nov 15, 2021
@@ -212,6 +212,9 @@ Columns in Iceberg data files are selected by field id. The table schema's colum

For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.

Tables may also define a property `schema.name-mapping.default` with a JSON map of `columnName` -> `fieldId` which will be used if a data file was written without field ids. This `NameMapping` will **only** be used on files without field ids. Files imported or added to an Iceberg table from a system that does not generate field ids will fall back to using the table's name mapping to map columns to field ids.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change "This NameMapping will only" to "This NameMapping may only" because we're not describing behavior, we are setting requirements for behavior.

This is a great start, but should also specify the name mapping itself more formally.

A name mapping is a list of field mapping objects. Each field mapping has the following properties:

  • names: A required list of 0 or more names for a field. Note that names may contain .
  • field-id: An optional Iceberg field ID to be used for a field with one of the given names
  • fields: An optional list of field mappings for child fields of structs, maps, and lists

A field mapping may map multiple names to a single field ID to support cases where a name has been updated. For example, Avro field aliases should also be listed in names. Similarly, fields that exist only in the Iceberg schema may be in the field mapping with an empty list of names, and fields that exist in imported files but not in the Iceberg schema may omit field-id.

Mappings for list types should contain a child mapping for the "element" field and mappings for map types should contain child mappings for "key" and "value" fields.

Fields that are not mapped to IDs must be ignored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, i'll make the mods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and if name does contain ., then the field name itself must contain .. A mapping for names: ["a.b"] will map a field called "a.b" and does NOT apply to field "b" nested within a field "a".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about including an example JSON? I was thinking of just adding in the example in NameMappingParser.java

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a section on serializing these in the appendix and can give examples there, like the others. Does that work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's easier to have whole examples, but I added one in the style of the other appendixes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a whole example would be rather helpful

@RussellSpitzer
Copy link
Member Author

Updated

site/docs/spec.md Outdated Show resolved Hide resolved
site/docs/spec.md Outdated Show resolved Hide resolved
@RussellSpitzer
Copy link
Member Author

Updated again

@rymurr
Copy link
Contributor

rymurr commented Nov 16, 2021

LGTM w/ the addition that a full example would really help the reader

@RussellSpitzer
Copy link
Member Author

@rymurr example added!


Struct types should contain mappings for their child fields.

For details on serialization see [Appendix F](#appendix-f-name-mapping-serialization)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid adding another appendix and add this in Appendix C: JSON serialization? https://iceberg.apache.org/#spec/#appendix-c-json-serialization

@rdblue
Copy link
Contributor

rdblue commented Nov 16, 2021

Looking great! Just a couple things I'd change.


Map types should contain mappings in `fields` for `key` and `value`.

Struct types should contain mappings for their child fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I think these related items could be a single paragraph.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last five "paragraphs" might be easier to read as a list (of rules for implementors to follow), similar to how the Commit Conflict Resolution and Retry section is formatted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that @electrum since I think these are discrete rules that aren't really related.

Copy link
Contributor

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, the new formatting is much easier to read.

Field mapping fields are constrained by the following rules

* A name may contain `.` but this refers to a literal name, not a nested field. For example, `a.b` refers to a field named `a.b`, not child field `b` of field `a`.
* Each child field should be defined with their own `field-mapping` under `fields`. Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the second sentence should be part of the next item.

  • Each child field should be defined with their own field-mapping under fields.
  • Multiple values for names may be mapped to a single field ID to support cases where a field may have different names in different data files. For example, all Avro field aliases should be listed in names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it field-mapping rather than "field mapping"? Shouldn't it be "Each child field should be defined with its own field mapping under fields".

Also, I think it makes sense to combine the "Each child field..." part with the first bullet because it explains how nesting works, not just names. Maybe it makes sense to everyone else this way though?

site/docs/spec.md Outdated Show resolved Hide resolved
site/docs/spec.md Show resolved Hide resolved
site/docs/spec.md Outdated Show resolved Hide resolved
* Fields that exist in imported files but not in the Iceberg schema may omit `field-id`.
* List types should contain a mapping in `fields` for `element`.
* Map types should contain mappings in `fields` for `key` and `value`.
* Struct types should contain mappings inf `fields` for their child fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo "inf"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's me, pushing to the wrong repository

site/docs/spec.md Outdated Show resolved Hide resolved

* A name may contain `.` but this refers to a literal name, not a nested field. For example, `a.b` refers to a field named `a.b`, not child field `b` of field `a`.
* Each child field should be defined with their own `field-mapping` under `fields`. Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files.
* For example, all Avro field aliases should be listed in `names`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be part of the previous bullet because it is clarifying the behavior of names and giving an example where you'd have multiple names.

@@ -212,6 +212,24 @@ Columns in Iceberg data files are selected by field id. The table schema's colum

For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.

Tables may also define a property `schema.name-mapping.default` with a JSON `name mapping` containing a list of `field mapping` objects. These mappings provide fallback field ids to be used when a data file does not contain field id information. Each object should contain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name mapping and field mapping are fixed width, but aren't properties? Why used fixed width font?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this as names of new terms we are defining. We can remove the font

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should.

@rdblue
Copy link
Contributor

rdblue commented Nov 17, 2021

+1 once the fixed-width font for name mapping and field mapping are fixed. Thanks @RussellSpitzer!

@RussellSpitzer
Copy link
Member Author

@electrum Are you +1? Let me know if you have any other mods

@electrum
Copy link
Contributor

@RussellSpitzer Looks good to me.

@RussellSpitzer
Copy link
Member Author

Ok thanks for review everyone! I'll merge this now

@RussellSpitzer RussellSpitzer merged commit cf972cf into apache:master Nov 17, 2021
@RussellSpitzer
Copy link
Member Author

Closes #3542

@RussellSpitzer RussellSpitzer deleted the AddNameMappingToSpec branch November 17, 2021 20:24
Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants