-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16816: [C++] Upgrade Substrait to v0.6.0 #13468
ARROW-16816: [C++] Upgrade Substrait to v0.6.0 #13468
Conversation
|
Could you update |
@vibhatha do you mind taking a look at this PR? |
Of course 👍 |
switch (item.file_format_case()) { | ||
case substrait::ReadRel_LocalFiles_FileOrFiles::kParquet: | ||
format = std::make_shared<dataset::ParquetFileFormat>(); | ||
break; | ||
case substrait::ReadRel_LocalFiles_FileOrFiles::kArrow: | ||
format = std::make_shared<dataset::IpcFileFormat>(); | ||
break; | ||
default: | ||
return Status::NotImplemented( | ||
"unknown substrait::ReadRel::LocalFiles::FileOrFiles::file_format"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about feather
format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are ok. The feather format (v2) and the arrow IPC format are the same thing. Sometimes people use the extension .arrow
and sometimes they use the extension .feather
. However, in both cases they should be specifying kArrow
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That make sense.
@@ -185,7 +185,8 @@ TEST(Substrait, SupportedExtensionTypes) { | |||
ASSERT_OK_AND_ASSIGN( | |||
auto buf, | |||
internal::SubstraitFromJSON( | |||
"Type", "{\"user_defined_type_reference\": " + std::to_string(anchor) + "}")); | |||
"Type", "{\"user_defined\": { \"type_reference\": " + std::to_string(anchor) + | |||
", \"nullability\": \"NULLABILITY_NULLABLE\" } }")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nullability
did we missed including this last time? Or are we planning to expose this too? My knowledge is not quite adequate here, appreciate some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, Substrait itself didn't consider user-defined types to conceptually have nullability, for no particular reason I can think of. See substrait-io/substrait#217
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Let’s go with what you have now.
return std::make_pair(std::move(type_record.type), true); | ||
return std::make_pair(std::move(type_record.type), IsNullable(user_defined)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jvanstraten
Before it was set to true always. Not sure why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @bkietz just set it to true in the initial PR because nullable types are the default in Arrow and he had nowhere to get the flag from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@jvanstraten thank you for working on this one. I added a few comments. @westonpace Looks good to me. |
@jvanstraten some |
I noticed, but I have seen exactly 0 lines of R code in my life before, nor do I have a toolchain set up for it. If the R bindings are using Substrait already and this isn't some weird CI thing, I'm afraid I'll need considerable help from someone in that subcommunity. Do you know anyone I can ping for that? How does the Arrow community generally deal with upgrading dependencies past breaking changes that require interdisciplinary work? I don't think these will be the last breaking changes we'll see from Substrait in the foreseeable future, and considering the fact that Arrow is one of the leading projects that's requesting changes to Substrait it seems prudent to upgrade regularly. |
@jvanstraten I will take a look at the R component and see what’s exactly going on. Later on I will tag for support. |
I spent the time to get R up and running on my machine and (depending on what CI will say) have fixed it already. The problem was just that the plan in the one R test case had to be upgraded to use the new file format specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks for creating this. We had been falling quite far behind.
return Status::NotImplemented( | ||
"only value arguments are currently supported for functions"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: If possible, it might be nice to include the string value of what was specified? E.g. "Only value
is supported but got "enum". But I don't remember if it is easy to go from arg_type_case
to a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a thing, at least not automatically. I could list the options, but if Substrait adds something later it's not going to upgrade automatically. The same applies to plenty of other of switch statements in the code currently. Also, if a plan with a function is passed to Arrow that uses argument features that Arrow doesn't even understand, the bigger issue will be that Arrow would have no idea what the function means to begin with. So I guess I could add the function name to the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no easy enum->string then let's not worry about it. I think this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:ate to the party, but in case it helps, I see there is EnumDescriptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good to know that it does exist. It still wouldn't be very helpful to users if a new option is added since the last time Arrow bumped Substrait, but I guess that's shifting the goalposts a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Arrow code could include in the message the value converted to a string if it is known given the Substrait version the code was built with, and otherwise include the number of the value and explain it is unknown and coming from a different version of Substrait. All this enum-handling is for a separate issue, of course.
@sanjibansg This PR will likely have impacts on function mapping work. @rtpsw You may be interested in these changes as well. I'm not sure how you've been creating plans so far but this change from the old style of specifying arguments to the new style is likely to have impacts. |
Great! Thanks for letting me know, @westonpace. |
Note: I actually upgraded to v0.6.0; it didn't make much sense to me to not just go to the latest release. I guess I'll downgrade if there was a specific reason for going to exactly v0.3.0 that I'm not aware of. Stuff that broke: - `relations.proto` and `expressions.proto` were merged into `algebra.proto` in substrait-io/substrait#136 - Breaking change in how file formats are specified in read relations: substrait-io/substrait#169 - Deprecation in specification of function arguments, switched to the new format (supporting the old one as well would be a bit more work, which I'm not sure is worthwhile at this stage): substrait-io/substrait#161 - Deprecation of `user_defined_type_reference` in `Type`, replacing it with a message that also supports nullability: substrait-io/substrait#217 Authored-by: Jeroen van Straten <[email protected]> Signed-off-by: Weston Pace <[email protected]>
Note: I actually upgraded to v0.6.0; it didn't make much sense to me to not just go to the latest release. I guess I'll downgrade if there was a specific reason for going to exactly v0.3.0 that I'm not aware of.
Stuff that broke:
relations.proto
andexpressions.proto
were merged intoalgebra.proto
in refactor: combine relations and expressions into algebra.proto substrait-io/substrait#136user_defined_type_reference
inType
, replacing it with a message that also supports nullability: feat: support nullable and non-default variation user-defined types substrait-io/substrait#217