Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add column field ID control in parquet writer #10504
Add column field ID control in parquet writer #10504
Changes from 15 commits
29e5334
5f00d1d
97a8ed0
0da0c5e
120e38c
572d2d7
cea5476
4477cc6
4b6d61c
e5eee1f
ba4ed0c
5a85337
c1ed6c8
2f7560f
31fe992
648445f
f657407
de3d17a
2eeccef
64427a4
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the list check required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presumably because "stub" elements can't have a field id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to how we set schema names, my idea was to not set field ID for "intermediate" schemas for lists. Removing it now since it's not necessarily required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on offline discussion with @devavret, we agreed to keep the list check since "stub" elements are not supposed to have field ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's confirm it as well. @jlowe , would spark have a
field_id
corresponding to a stub element? Or is it mostly used for the parent-most level?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the PR description in https://issues.apache.org/jira/browse/SPARK-38094:
I assume it's mainly for the outter-most level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is expecting to find a specified field ID on a child column of a StructType (STRUCT) column. Theoretically there could be a STRUCT of LIST (array) column and the user could specify it on the array column. See https://github.com/apache/spark/pull/35385/files#diff-487304e31da0dcde467c1f8561f42edcb3a811a755d8bc0424e4f3ad084099c3R156-R175
However I don't know whether the underlying Parquet message types allow a field ID on the array column vs. the child column of the array.