-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: Document NameMapping #3556
Spec: Document NameMapping #3556
Conversation
#3542 - @rymurr + @rdblue + @ajantha-bhat + @aokolnychyi + @openinx Please take a look :) |
site/docs/spec.md
Outdated
@@ -212,6 +212,9 @@ Columns in Iceberg data files are selected by field id. The table schema's colum | |||
|
|||
For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order. | |||
|
|||
Tables may also define a property `schema.name-mapping.default` with a JSON map of `columnName` -> `fieldId` which will be used if a data file was written without field ids. This `NameMapping` will **only** be used on files without field ids. Files imported or added to an Iceberg table from a system that does not generate field ids will fall back to using the table's name mapping to map columns to field ids. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd change "This NameMapping will only" to "This NameMapping may only" because we're not describing behavior, we are setting requirements for behavior.
This is a great start, but should also specify the name mapping itself more formally.
A name mapping is a list of field mapping objects. Each field mapping has the following properties:
names
: A required list of 0 or more names for a field. Note that names may contain.
field-id
: An optional Iceberg field ID to be used for a field with one of the given namesfields
: An optional list of field mappings for child fields of structs, maps, and listsA field mapping may map multiple names to a single field ID to support cases where a name has been updated. For example, Avro field aliases should also be listed in names. Similarly, fields that exist only in the Iceberg schema may be in the field mapping with an empty list of names, and fields that exist in imported files but not in the Iceberg schema may omit
field-id
.Mappings for list types should contain a child mapping for the "element" field and mappings for map types should contain child mappings for "key" and "value" fields.
Fields that are not mapped to IDs must be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, i'll make the mods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and if name does contain .
, then the field name itself must contain .
. A mapping for names: ["a.b"]
will map a field called "a.b"
and does NOT apply to field "b"
nested within a field "a"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about including an example JSON? I was thinking of just adding in the example in NameMappingParser.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a section on serializing these in the appendix and can give examples there, like the others. Does that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's easier to have whole examples, but I added one in the style of the other appendixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a whole example would be rather helpful
Updated |
Updated again |
0e46237
to
5c83c66
Compare
LGTM w/ the addition that a full example would really help the reader |
@rymurr example added! |
site/docs/spec.md
Outdated
|
||
Struct types should contain mappings for their child fields. | ||
|
||
For details on serialization see [Appendix F](#appendix-f-name-mapping-serialization) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid adding another appendix and add this in Appendix C: JSON serialization? https://iceberg.apache.org/#spec/#appendix-c-json-serialization
Looking great! Just a couple things I'd change. |
site/docs/spec.md
Outdated
|
||
Map types should contain mappings in `fields` for `key` and `value`. | ||
|
||
Struct types should contain mappings for their child fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I think these related items could be a single paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last five "paragraphs" might be easier to read as a list (of rules for implementors to follow), similar to how the Commit Conflict Resolution and Retry section is formatted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer that @electrum since I think these are discrete rules that aren't really related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, the new formatting is much easier to read.
site/docs/spec.md
Outdated
Field mapping fields are constrained by the following rules | ||
|
||
* A name may contain `.` but this refers to a literal name, not a nested field. For example, `a.b` refers to a field named `a.b`, not child field `b` of field `a`. | ||
* Each child field should be defined with their own `field-mapping` under `fields`. Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the second sentence should be part of the next item.
- Each child field should be defined with their own
field-mapping
underfields
. - Multiple values for
names
may be mapped to a single field ID to support cases where a field may have different names in different data files. For example, all Avro field aliases should be listed innames
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it field-mapping
rather than "field mapping"? Shouldn't it be "Each child field should be defined with its own field mapping under fields
".
Also, I think it makes sense to combine the "Each child field..." part with the first bullet because it explains how nesting works, not just names. Maybe it makes sense to everyone else this way though?
site/docs/spec.md
Outdated
* Fields that exist in imported files but not in the Iceberg schema may omit `field-id`. | ||
* List types should contain a mapping in `fields` for `element`. | ||
* Map types should contain mappings in `fields` for `key` and `value`. | ||
* Struct types should contain mappings inf `fields` for their child fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo "inf"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's me, pushing to the wrong repository
site/docs/spec.md
Outdated
|
||
* A name may contain `.` but this refers to a literal name, not a nested field. For example, `a.b` refers to a field named `a.b`, not child field `b` of field `a`. | ||
* Each child field should be defined with their own `field-mapping` under `fields`. Multiple values for `names` may be mapped to a single field ID to support cases where a field may have different names in different data files. | ||
* For example, all Avro field aliases should be listed in `names`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be part of the previous bullet because it is clarifying the behavior of names
and giving an example where you'd have multiple names.
site/docs/spec.md
Outdated
@@ -212,6 +212,24 @@ Columns in Iceberg data files are selected by field id. The table schema's colum | |||
|
|||
For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order. | |||
|
|||
Tables may also define a property `schema.name-mapping.default` with a JSON `name mapping` containing a list of `field mapping` objects. These mappings provide fallback field ids to be used when a data file does not contain field id information. Each object should contain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name mapping
and field mapping
are fixed width, but aren't properties? Why used fixed width font?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this as names of new terms we are defining. We can remove the font
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we should.
+1 once the fixed-width font for name mapping and field mapping are fixed. Thanks @RussellSpitzer! |
@electrum Are you +1? Let me know if you have any other mods |
@RussellSpitzer Looks good to me. |
Ok thanks for review everyone! I'll merge this now |
Closes #3542 |
While we have a significant amount of code relying on the
NameMapping
of a table, we don't actually include any information on this table property in the Spec. This PR aims to codify what we already do in most implementations of Iceberg.