-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata is kept in projections for non-derived columns #1378
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @hntd187 ❤️
-
I think we also need to handle
Schema
level metadata and I left a comment to that effect below. -
For some reason your PR appears to change the submodule
testing
pin.
I also did some checking around, and it seems like "project a Schema
" and "project a DFSchema
" are quite common operations (e.g. here) so it might make sense to pull them out into a separate function -- I can do this as a follow on PR too, but wanted to mention it.
e.nullable(&input_schema)?, | ||
)) | ||
match input_schema.field_with_name(&name) { | ||
Ok(f) => f.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think this will get the field level metadata
However, I think there is also Schema
level metadata that still will not get copied over.
In order to copy that metadata, you can call Schema::new_with_metadata
below rather than Schema::new()
https://docs.rs/arrow/6.3.0/arrow/datatypes/struct.Schema.html#method.new_with_metadata
Sounds good lemme make those updates. |
…ta_projection # Conflicts: # datafusion/src/physical_plan/projection.rs
@alamb I think whatever I did messing up the pin on the test dependencies caused the tests to fail. I don't know how I did it in the first place to be honest, would you know how to back it out? |
you can go into the testing folder, checkout the current commit tracked in master (https://github.com/apache/arrow-testing/tree/a8f7be380531758eb7962542a5eb020d8795aa20), then commit the submodule change so the CI can pick it up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- thanks @hntd187 !
Thanks again @hntd187 -- I reviewed some other areas of the code, and I found several places where a projection is done on a RecordBatch and Schemas other than the ProjectionExec. I filed apache/arrow-rs#1014 to move this logic into arrow-rs and I filed a follow on ticket in DataFusion to refactor projection uses to a consolidated implementation: #1425 |
Which issue does this PR close?
Closes #1361 .
Rationale for this change
What changes are included in this PR?
I believe this will carry over the metadata in columns by checking for the column in the schema first and if it doesn't exist using the previous logic of re-constructing the field.
Are there any user-facing changes?
Column metadata will persist in projections