-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add metadata to DFSchema, close #1806. #1914
Conversation
} | ||
|
||
#[deprecated(since = "7.0.0", note = "please use `new_with_metadata` instead")] | ||
/// Create a new `DFSchema` | ||
pub fn new(fields: Vec<DFField>) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can directly delete the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a breaking change if we delete it since it's a public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution @jiacai2050 and your review @xudong963
I had a suggestion to keep the APIs backwards compatible. What do you think?
} | ||
|
||
/// Create a new `DFSchema` | ||
pub fn new_with_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think about a more "builder-like" API here that perhaps could remain backwards compatible.
Something like:
/// Sets the metadata on this DFSchema to `metadata`
pub fn with_metadata(mut self, metadata: HashMap<String, String>) -> Self {
...
}
And then instead of code like
Schema::new_with_metadata(
df_schema.fields.iter().map(|f| f.field.clone()).collect(),
df_schema.metadata.clone(),
)
It would look like
Schema::new(
df_schema.fields.iter().map(|f| f.field.clone()).collect(),
)
.with_metadata(df_schema.metadata.clone()),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Builder-style API is good itself, but is not suitable here I think.
As you can see, schema conversion between Array and datafusion is used in many places, the compiler will output an warning when new
is marked as deprecated, but when adopt builder-style API, no warning will be made, and it would be very easy to lost metadata when doing nested conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, metadata
in DFSchema
is a new feature in the next release, if users want to use it, they'll check the API doc. So we don't need to depend on the compiler's warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see in https://github.com/apache/arrow-datafusion/pull/1914/files#diff-c1ef69547042f0c07aa616c9d5d58cbe2a3c5720f7237c948c99012d6cc0024a
Those optimizers will rewrite plan, if any optimizer forget to attach metadata to newly-created plan, then metadata is lost, it's very easy to miss that without compiler's help.
The compiler warning as a hint to developers not to forget the schema is a good point. 👍 |
@jiacai2050 this PR seems to have build failures -- can you update it and I'll review it again? |
@alamb It seems CI need your approve to run. |
🤔 looks like somehow this PR now picked up a merge conflict I have started the CI run |
@alamb I have rebase master, please rerun the CI. |
@jiacai2050 sadly another clippy issue has cropped up I think it is related to serializing the schema in ballista (which actually doesn't preserve the metadata) Is suggest a fix like the following, and file a ticket for the fact that ballista serialization doesn't preserve metadata. I can file if it you would like diff --git a/datafusion-proto/src/from_proto.rs b/datafusion-proto/src/from_proto.rs
index d589ef388..2bf65d4db 100644
--- a/datafusion-proto/src/from_proto.rs
+++ b/datafusion-proto/src/from_proto.rs
@@ -155,6 +155,7 @@ impl TryFrom<&protobuf::DfSchema> for DFSchema {
.iter()
.map(|c| c.try_into())
.collect::<Result<Vec<DFField>, _>>()?;
+ #[allow(deprecated)]
Ok(DFSchema::new(fields)?)
}
} |
@alamb I have fixed the issue, you can link with this PR after you file a new issue. |
Thanks again @jiacai2050 for sticking with this and @xudong963 for the review |
@alamb @xudong963 Thanks for your reviews. |
Which issue does this PR close?
Closes #1806.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?
No