-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support consuming Substrait with compound signature function names #10653
Support consuming Substrait with compound signature function names #10653
Conversation
@alamb curious what you think of fixing just the consumer side first, without touching the producer - if that'd be okay, then I can add some unit tests to this PR? |
Substrait 0.32.0+ requires functions to be specified using compound names, which include the function name as well as the arguments it takes. We don't necessarily need that information while consuming the plans, but we need to support those compound names.
f7f69c5
to
f39c63b
Compare
@alamb Sounds good! here's an added test, though curious if you have any thoughts on how to make it simpler - given DF doesn't yet produce those compound names, I cannot use a roundtrip test, and writing the substrait manually feels both annoying and error-prone.. Using base64 encoded binary proto would be a bit easier but then there's some recent bad experiences with people uploading malicious stuff as "test binaries" so I didn't want to do that, but happy to if you'd find it better 😅 |
I don't have any better ideas unfortunately, until DataFusion has roundtrip support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -37,11 +37,13 @@ chrono = { workspace = true } | |||
datafusion = { workspace = true, default-features = true } | |||
itertools = { workspace = true } | |||
object_store = { workspace = true } | |||
pbjson-types = "0.6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be a actual dependency or can it be a dev dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered in more depth here, but yes I think it does need to be actual dep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said we can remove the prost-types dep, we don't need both: e6ff946
@@ -42,7 +42,7 @@ use datafusion::logical_expr::expr::{ | |||
}; | |||
use datafusion::logical_expr::{expr, Between, JoinConstraint, LogicalPlan, Operator}; | |||
use datafusion::prelude::Expr; | |||
use prost_types::Any as ProtoAny; | |||
use pbjson_types::Any as ProtoAny; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHy do we need to change to pbjson_types
here? It seems like the json support is only for testing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this confuses me a bit as well - but the way I understand it, adding the "serde" feature to "substrait-rs" seems to change the types it expects from the original prost
types into the pbjson
types (https://github.com/substrait-io/substrait-rs/blob/d11b22c743df9b02b7e5f4c079091ed133982431/build.rs#L262), probably to make them all serde-serializable (https://docs.rs/pbjson-types/latest/pbjson_types/#).
Given we use the Any
proto type in non-test code, I think this does need to change, even though we don't necessarily need the serializability here.. This is also why we need the pbjson-types dep as normal, not dev, dep.
I'd also prefer to not change this, but I couldn't find another way if we want to have the JSON proto plans :/
// under the License. | ||
|
||
#[cfg(test)] | ||
mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general the idea of having checked in substrait plans (as json protobuf) is actually likely a pretty good way to build up a library of test cases "from the wild" (not round tripped)
This is quite cool @Blizzara
Co-authored-by: Andrew Lamb <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Blizzara
As a big scenario for our substrait support is to inter-cooperation with other systems, it might be a good idea to have some "integration" test, which uses another project like isthmus to produce plan and then consume them in datafusion side. But this is not trivial to achieve, just an idea for now. |
Yeah, I have that kind of integration testing internally, kind of, as part of our use of datafusion-substrait (see #10412 (comment)). I'll see if I can at some point contribute something like it here as well :) |
I've skimmed this patch and it looks good to me 👍 I'll take a detailed look into the test part tomorrow. |
FYI...The substrait project also has a repo that aims to do integration testing between different substrait producers/consumers. https://github.com/substrait-io/consumer-testing |
|
Thanks again everyone! |
…pache#10653) * Support consuming Substrait with compound signature function names Substrait 0.32.0+ requires functions to be specified using compound names, which include the function name as well as the arguments it takes. We don't necessarily need that information while consuming the plans, but we need to support those compound names. * Add a test for using "not:bool" * clippy fixes * Add license to new file * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * remove prost-types dep as it's replaced by pbjson-types --------- Co-authored-by: Andrew Lamb <[email protected]>
Substrait 0.32.0+ requires functions to be specified using compound names, which include the function name as well as the arguments it takes. We don't necessarily need that information while consuming the plans, but we need to support those compound names.
Which issue does this PR close?
Closes #10412, but doesn't fix the part where DataFusion generates those outdated simple names. But that's a bit bigger fix - this small thing would be enough to unblock me if it'd be okay to merge this before fixing the producer side.
Rationale for this change
See issue.
What changes are included in this PR?
When consuming Substrait, checks if given function names are compound names, if so strips the arguments part.
Are these changes tested?
Tested manually, but will still need to add unit tests.
Are there any user-facing changes?