-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add column field ID control in parquet writer #10504
Merged
rapids-bot
merged 20 commits into
rapidsai:branch-22.06
from
PointKernel:parquet-field-id-writing
Apr 15, 2022
Merged
Changes from 8 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
29e5334
Add parquet field id setter in column_in_metadata
PointKernel 5f00d1d
Cleanups: use thrust::optional
PointKernel 97a8ed0
Write field id in metadata
PointKernel 0da0c5e
Minor cleanups: Chronotypes include DurationTypes already
PointKernel 120e38c
Update unit tests
PointKernel 572d2d7
Cleanups: remove unused header, do not read field_id, more comprehens…
PointKernel cea5476
Restrict lambda captures
PointKernel 4477cc6
Fix a bug: add field id when it's set
PointKernel 4b6d61c
Clean up unit tests: use empty columns
PointKernel e5eee1f
Add field ID reading
PointKernel ba4ed0c
Revert back to is_set + value APIs
PointKernel 5a85337
Merge remote-tracking branch 'upstream/branch-22.06' into parquet-fie…
PointKernel c1ed6c8
Java JNI to support Parquet field id
2f7560f
Fix code style
31fe992
Add test case
648445f
Remove cpp test
PointKernel f657407
Merge branch 'parquet-field-id-writing' of github.com:PointKernel/cud…
PointKernel de3d17a
Add a boolean sentinel for field ID
2eeccef
Merge branch 'parquet-field-id-writing' of github.com:PointKernel/cud…
64427a4
Fix and add a test case
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the list check required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presumably because "stub" elements can't have a field id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to how we set schema names, my idea was to not set field ID for "intermediate" schemas for lists. Removing it now since it's not necessarily required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on offline discussion with @devavret, we agreed to keep the list check since "stub" elements are not supposed to have field ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's confirm it as well. @jlowe , would spark have a
field_id
corresponding to a stub element? Or is it mostly used for the parent-most level?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the PR description in https://issues.apache.org/jira/browse/SPARK-38094:
I assume it's mainly for the outter-most level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is expecting to find a specified field ID on a child column of a StructType (STRUCT) column. Theoretically there could be a STRUCT of LIST (array) column and the user could specify it on the array column. See https://github.com/apache/spark/pull/35385/files#diff-487304e31da0dcde467c1f8561f42edcb3a811a755d8bc0424e4f3ad084099c3R156-R175
However I don't know whether the underlying Parquet message types allow a field ID on the array column vs. the child column of the array.