Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata is kept in projections for non-derived columns #1378

Merged
merged 6 commits into from
Dec 9, 2021

Conversation

hntd187
Copy link
Contributor

@hntd187 hntd187 commented Nov 28, 2021

Which issue does this PR close?

Closes #1361 .

Rationale for this change

What changes are included in this PR?

I believe this will carry over the metadata in columns by checking for the column in the schema first and if it doesn't exist using the previous logic of re-constructing the field.

Are there any user-facing changes?

Column metadata will persist in projections

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Nov 28, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @hntd187 ❤️

  1. I think we also need to handle Schema level metadata and I left a comment to that effect below.

  2. For some reason your PR appears to change the submodule testing pin.

I also did some checking around, and it seems like "project a Schema" and "project a DFSchema" are quite common operations (e.g. here) so it might make sense to pull them out into a separate function -- I can do this as a follow on PR too, but wanted to mention it.

e.nullable(&input_schema)?,
))
match input_schema.field_with_name(&name) {
Ok(f) => f.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think this will get the field level metadata

However, I think there is also Schema level metadata that still will not get copied over.

In order to copy that metadata, you can call Schema::new_with_metadata below rather than Schema::new()

https://docs.rs/arrow/6.3.0/arrow/datatypes/struct.Schema.html#method.new_with_metadata

@hntd187
Copy link
Contributor Author

hntd187 commented Nov 29, 2021

Sounds good lemme make those updates.

@houqp houqp added the bug Something isn't working label Nov 29, 2021
@hntd187
Copy link
Contributor Author

hntd187 commented Dec 8, 2021

@alamb I think whatever I did messing up the pin on the test dependencies caused the tests to fail. I don't know how I did it in the first place to be honest, would you know how to back it out?

@houqp
Copy link
Member

houqp commented Dec 9, 2021

you can go into the testing folder, checkout the current commit tracked in master (https://github.com/apache/arrow-testing/tree/a8f7be380531758eb7962542a5eb020d8795aa20), then commit the submodule change so the CI can pick it up.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- thanks @hntd187 !

@alamb
Copy link
Contributor

alamb commented Dec 9, 2021

Thanks again @hntd187 -- I reviewed some other areas of the code, and I found several places where a projection is done on a RecordBatch and Schemas other than the ProjectionExec. I filed apache/arrow-rs#1014 to move this logic into arrow-rs and I filed a follow on ticket in DataFusion to refactor projection uses to a consolidated implementation: #1425

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ProjectionExec Loses Field Metadata
3 participants