Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metadata to DFSchema, close #1806. #1914

Merged
merged 6 commits into from
Mar 10, 2022
Merged

Conversation

jiacai2050
Copy link
Contributor

Which issue does this PR close?

Closes #1806.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

No

@github-actions github-actions bot added datafusion Changes in the datafusion crate sql SQL Planner labels Mar 3, 2022
datafusion-common/src/dfschema.rs Show resolved Hide resolved
}

#[deprecated(since = "7.0.0", note = "please use `new_with_metadata` instead")]
/// Create a new `DFSchema`
pub fn new(fields: Vec<DFField>) -> Result<Self> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can directly delete the API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a breaking change if we delete it since it's a public API.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @jiacai2050 and your review @xudong963

I had a suggestion to keep the APIs backwards compatible. What do you think?

}

/// Create a new `DFSchema`
pub fn new_with_metadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about a more "builder-like" API here that perhaps could remain backwards compatible.

Something like:

/// Sets the metadata on this DFSchema to `metadata`
pub fn with_metadata(mut self, metadata: HashMap<String, String>) -> Self {
...
}

And then instead of code like

        Schema::new_with_metadata(
            df_schema.fields.iter().map(|f| f.field.clone()).collect(),
            df_schema.metadata.clone(),
        )

It would look like

        Schema::new(
            df_schema.fields.iter().map(|f| f.field.clone()).collect(),
        )
        .with_metadata(df_schema.metadata.clone()),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Builder-style API is good itself, but is not suitable here I think.

As you can see, schema conversion between Array and datafusion is used in many places, the compiler will output an warning when new is marked as deprecated, but when adopt builder-style API, no warning will be made, and it would be very easy to lost metadata when doing nested conversion.

Copy link
Member

@xudong963 xudong963 Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, metadata in DFSchema is a new feature in the next release, if users want to use it, they'll check the API doc. So we don't need to depend on the compiler's warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see in https://github.com/apache/arrow-datafusion/pull/1914/files#diff-c1ef69547042f0c07aa616c9d5d58cbe2a3c5720f7237c948c99012d6cc0024a

Those optimizers will rewrite plan, if any optimizer forget to attach metadata to newly-created plan, then metadata is lost, it's very easy to miss that without compiler's help.

@alamb
Copy link
Contributor

alamb commented Mar 7, 2022

the compiler will output an warning when new is marked as deprecated, but when adopt builder-style API,

The compiler warning as a hint to developers not to forget the schema is a good point. 👍

@alamb
Copy link
Contributor

alamb commented Mar 7, 2022

@jiacai2050 this PR seems to have build failures -- can you update it and I'll review it again?

@jiacai2050
Copy link
Contributor Author

@alamb It seems CI need your approve to run.

@alamb
Copy link
Contributor

alamb commented Mar 8, 2022

🤔 looks like somehow this PR now picked up a merge conflict

I have started the CI run

@jiacai2050
Copy link
Contributor Author

@alamb I have rebase master, please rerun the CI.

@alamb
Copy link
Contributor

alamb commented Mar 10, 2022

@jiacai2050 sadly another clippy issue has cropped up

I think it is related to serializing the schema in ballista (which actually doesn't preserve the metadata)

Is suggest a fix like the following, and file a ticket for the fact that ballista serialization doesn't preserve metadata.

I can file if it you would like

diff --git a/datafusion-proto/src/from_proto.rs b/datafusion-proto/src/from_proto.rs
index d589ef388..2bf65d4db 100644
--- a/datafusion-proto/src/from_proto.rs
+++ b/datafusion-proto/src/from_proto.rs
@@ -155,6 +155,7 @@ impl TryFrom<&protobuf::DfSchema> for DFSchema {
             .iter()
             .map(|c| c.try_into())
             .collect::<Result<Vec<DFField>, _>>()?;
+        #[allow(deprecated)]
         Ok(DFSchema::new(fields)?)
     }
 }

@jiacai2050
Copy link
Contributor Author

@alamb I have fixed the issue, you can link with this PR after you file a new issue.

@alamb
Copy link
Contributor

alamb commented Mar 10, 2022

Add #1977 and will add link in #1978

@alamb alamb merged commit 43b7d49 into apache:master Mar 10, 2022
@alamb
Copy link
Contributor

alamb commented Mar 10, 2022

Thanks again @jiacai2050 for sticking with this and @xudong963 for the review

@jiacai2050
Copy link
Contributor Author

@alamb @xudong963 Thanks for your reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metadata of Arrow Schema is lost when converted to DFSchema
3 participants