Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16816: [C++] Upgrade Substrait to v0.6.0 #13468

Merged

Conversation

jvanstraten
Copy link
Contributor

Note: I actually upgraded to v0.6.0; it didn't make much sense to me to not just go to the latest release. I guess I'll downgrade if there was a specific reason for going to exactly v0.3.0 that I'm not aware of.

Stuff that broke:

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has no components in JIRA, make sure you assign one.

@kou
Copy link
Member

kou commented Jun 29, 2022

Could you update v0.3.0 in pull request and JIRA issue to v0.6.0?

@jvanstraten jvanstraten changed the title ARROW-16816: Upgrade Substrait to v0.3.0 ARROW-16816: Upgrade Substrait to v0.6.0 Jun 30, 2022
@westonpace
Copy link
Member

@vibhatha do you mind taking a look at this PR?

@westonpace westonpace self-requested a review June 30, 2022 19:30
@kou kou changed the title ARROW-16816: Upgrade Substrait to v0.6.0 ARROW-16816: [C++] Upgrade Substrait to v0.6.0 Jul 1, 2022
@vibhatha
Copy link
Collaborator

vibhatha commented Jul 2, 2022

@vibhatha do you mind taking a look at this PR?

Of course 👍

Comment on lines +109 to +118
switch (item.file_format_case()) {
case substrait::ReadRel_LocalFiles_FileOrFiles::kParquet:
format = std::make_shared<dataset::ParquetFileFormat>();
break;
case substrait::ReadRel_LocalFiles_FileOrFiles::kArrow:
format = std::make_shared<dataset::IpcFileFormat>();
break;
default:
return Status::NotImplemented(
"unknown substrait::ReadRel::LocalFiles::FileOrFiles::file_format");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about feather format?

Copy link
Member

@westonpace westonpace Jul 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are ok. The feather format (v2) and the arrow IPC format are the same thing. Sometimes people use the extension .arrow and sometimes they use the extension .feather. However, in both cases they should be specifying kArrow here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense.

@@ -185,7 +185,8 @@ TEST(Substrait, SupportedExtensionTypes) {
ASSERT_OK_AND_ASSIGN(
auto buf,
internal::SubstraitFromJSON(
"Type", "{\"user_defined_type_reference\": " + std::to_string(anchor) + "}"));
"Type", "{\"user_defined\": { \"type_reference\": " + std::to_string(anchor) +
", \"nullability\": \"NULLABILITY_NULLABLE\" } }"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullability did we missed including this last time? Or are we planning to expose this too? My knowledge is not quite adequate here, appreciate some comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, Substrait itself didn't consider user-defined types to conceptually have nullability, for no particular reason I can think of. See substrait-io/substrait#217

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Let’s go with what you have now.

Comment on lines -202 to +203
return std::make_pair(std::move(type_record.type), true);
return std::make_pair(std::move(type_record.type), IsNullable(user_defined));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvanstraten
Before it was set to true always. Not sure why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @bkietz just set it to true in the initial PR because nullable types are the default in Arrow and he had nowhere to get the flag from.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@vibhatha
Copy link
Collaborator

vibhatha commented Jul 2, 2022

@jvanstraten thank you for working on this one. I added a few comments.

@westonpace Looks good to me.

@vibhatha
Copy link
Collaborator

vibhatha commented Jul 2, 2022

@jvanstraten some R test cases are failing. May be it need some work there.

@jvanstraten
Copy link
Contributor Author

I noticed, but I have seen exactly 0 lines of R code in my life before, nor do I have a toolchain set up for it. If the R bindings are using Substrait already and this isn't some weird CI thing, I'm afraid I'll need considerable help from someone in that subcommunity. Do you know anyone I can ping for that?

How does the Arrow community generally deal with upgrading dependencies past breaking changes that require interdisciplinary work? I don't think these will be the last breaking changes we'll see from Substrait in the foreseeable future, and considering the fact that Arrow is one of the leading projects that's requesting changes to Substrait it seems prudent to upgrade regularly.

@vibhatha
Copy link
Collaborator

vibhatha commented Jul 2, 2022

@jvanstraten I will take a look at the R component and see what’s exactly going on. Later on I will tag for support.

@jvanstraten
Copy link
Contributor Author

I spent the time to get R up and running on my machine and (depending on what CI will say) have fixed it already. The problem was just that the plan in the one R test case had to be upgraded to use the new file format specification.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks for creating this. We had been falling quite far behind.

Comment on lines +172 to +173
return Status::NotImplemented(
"only value arguments are currently supported for functions");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: If possible, it might be nice to include the string value of what was specified? E.g. "Only value is supported but got "enum". But I don't remember if it is easy to go from arg_type_case to a string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a thing, at least not automatically. I could list the options, but if Substrait adds something later it's not going to upgrade automatically. The same applies to plenty of other of switch statements in the code currently. Also, if a plan with a function is passed to Arrow that uses argument features that Arrow doesn't even understand, the bigger issue will be that Arrow would have no idea what the function means to begin with. So I guess I could add the function name to the message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no easy enum->string then let's not worry about it. I think this is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:ate to the party, but in case it helps, I see there is EnumDescriptor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good to know that it does exist. It still wouldn't be very helpful to users if a new option is added since the last time Arrow bumped Substrait, but I guess that's shifting the goalposts a little.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arrow code could include in the message the value converted to a string if it is known given the Substrait version the code was built with, and otherwise include the number of the value and explain it is unknown and coming from a different version of Substrait. All this enum-handling is for a separate issue, of course.

@westonpace
Copy link
Member

@sanjibansg This PR will likely have impacts on function mapping work.

@rtpsw You may be interested in these changes as well. I'm not sure how you've been creating plans so far but this change from the old style of specifying arguments to the new style is likely to have impacts.

@westonpace westonpace merged commit ad78a31 into apache:master Jul 5, 2022
@jvanstraten jvanstraten deleted the ARROW-16816-Upgrade-Substrait-to-v0-3-0 branch July 5, 2022 16:44
@rtpsw
Copy link
Contributor

rtpsw commented Jul 5, 2022

Great! Thanks for letting me know, @westonpace.

drin pushed a commit to drin/arrow that referenced this pull request Jul 5, 2022
Note: I actually upgraded to v0.6.0; it didn't make much sense to me to not just go to the latest release. I guess I'll downgrade if there was a specific reason for going to exactly v0.3.0 that I'm not aware of.

Stuff that broke:
 - `relations.proto` and `expressions.proto` were merged into `algebra.proto` in substrait-io/substrait#136
 - Breaking change in how file formats are specified in read relations: substrait-io/substrait#169
 - Deprecation in specification of function arguments, switched to the new format (supporting the old one as well would be a bit more work, which I'm not sure is worthwhile at this stage): substrait-io/substrait#161
 - Deprecation of `user_defined_type_reference` in `Type`, replacing it with a message that also supports nullability: substrait-io/substrait#217

Authored-by: Jeroen van Straten <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants