Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16816: [C++] Upgrade Substrait to v0.6.0 #13468

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 3 additions & 10 deletions cpp/cmake_modules/ThirdpartyToolchain.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -1685,16 +1685,9 @@ endif()
macro(build_substrait)
message(STATUS "Building Substrait from source")

set(SUBSTRAIT_PROTOS
capabilities
expression
extensions/extensions
function
parameterized_types
plan
relations
type
type_expressions)
# Note: not all protos in Substrait actually matter to plan
# consumption. No need to build the ones we don't need.
set(SUBSTRAIT_PROTOS algebra extensions/extensions plan type)

externalproject_add(substrait_ep
CONFIGURE_COMMAND ""
Expand Down
21 changes: 16 additions & 5 deletions cpp/src/arrow/engine/substrait/expression_internal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -160,9 +160,18 @@ Result<compute::Expression> FromProto(const substrait::Expression& expr,
ARROW_ASSIGN_OR_RAISE(auto decoded_function,
ext_set.DecodeFunction(scalar_fn.function_reference()));

std::vector<compute::Expression> arguments(scalar_fn.args_size());
for (int i = 0; i < scalar_fn.args_size(); ++i) {
ARROW_ASSIGN_OR_RAISE(arguments[i], FromProto(scalar_fn.args(i), ext_set));
std::vector<compute::Expression> arguments(scalar_fn.arguments_size());
for (int i = 0; i < scalar_fn.arguments_size(); ++i) {
const auto& argument = scalar_fn.arguments(i);
switch (argument.arg_type_case()) {
case substrait::FunctionArgument::kValue: {
ARROW_ASSIGN_OR_RAISE(arguments[i], FromProto(argument.value(), ext_set));
break;
}
default:
return Status::NotImplemented(
"only value arguments are currently supported for functions");
Comment on lines +172 to +173
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: If possible, it might be nice to include the string value of what was specified? E.g. "Only value is supported but got "enum". But I don't remember if it is easy to go from arg_type_case to a string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a thing, at least not automatically. I could list the options, but if Substrait adds something later it's not going to upgrade automatically. The same applies to plenty of other of switch statements in the code currently. Also, if a plan with a function is passed to Arrow that uses argument features that Arrow doesn't even understand, the bigger issue will be that Arrow would have no idea what the function means to begin with. So I guess I could add the function name to the message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no easy enum->string then let's not worry about it. I think this is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:ate to the party, but in case it helps, I see there is EnumDescriptor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good to know that it does exist. It still wouldn't be very helpful to users if a new option is added since the last time Arrow bumped Substrait, but I guess that's shifting the goalposts a little.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arrow code could include in the message the value converted to a string if it is known given the Substrait version the code was built with, and otherwise include the number of the value and explain it is unknown and coming from a different version of Substrait. All this enum-handling is for a separate issue, of course.

}
}

auto func_name = decoded_function.name.to_string();
Expand Down Expand Up @@ -900,9 +909,11 @@ Result<std::unique_ptr<substrait::Expression>> ToProto(const compute::Expression

auto scalar_fn = internal::make_unique<substrait::Expression::ScalarFunction>();
scalar_fn->set_function_reference(anchor);
scalar_fn->mutable_args()->Reserve(static_cast<int>(arguments.size()));
scalar_fn->mutable_arguments()->Reserve(static_cast<int>(arguments.size()));
for (auto& arg : arguments) {
scalar_fn->mutable_args()->AddAllocated(arg.release());
auto argument = internal::make_unique<substrait::FunctionArgument>();
argument->set_allocated_value(arg.release());
scalar_fn->mutable_arguments()->AddAllocated(argument.release());
}

out->set_allocated_scalar_function(scalar_fn.release());
Expand Down
2 changes: 1 addition & 1 deletion cpp/src/arrow/engine/substrait/expression_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
#include "arrow/engine/substrait/visibility.h"
#include "arrow/type_fwd.h"

#include "substrait/expression.pb.h" // IWYU pragma: export
#include "substrait/algebra.pb.h" // IWYU pragma: export

namespace arrow {
namespace engine {
Expand Down
21 changes: 10 additions & 11 deletions cpp/src/arrow/engine/substrait/relation_internal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -106,17 +106,16 @@ Result<compute::Declaration> FromProto(const substrait::Rel& rel,
path = item.uri_path_glob();
}

if (item.format() ==
substrait::ReadRel::LocalFiles::FileOrFiles::FILE_FORMAT_PARQUET) {
format = std::make_shared<dataset::ParquetFileFormat>();
} else if (util::string_view{path}.ends_with(".arrow")) {
format = std::make_shared<dataset::IpcFileFormat>();
} else if (util::string_view{path}.ends_with(".feather")) {
format = std::make_shared<dataset::IpcFileFormat>();
} else {
return Status::NotImplemented(
"substrait::ReadRel::LocalFiles::FileOrFiles::format "
"other than FILE_FORMAT_PARQUET");
switch (item.file_format_case()) {
case substrait::ReadRel_LocalFiles_FileOrFiles::kParquet:
format = std::make_shared<dataset::ParquetFileFormat>();
break;
case substrait::ReadRel_LocalFiles_FileOrFiles::kArrow:
format = std::make_shared<dataset::IpcFileFormat>();
break;
default:
return Status::NotImplemented(
"unknown substrait::ReadRel::LocalFiles::FileOrFiles::file_format");
Comment on lines +109 to +118
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about feather format?

Copy link
Member

@westonpace westonpace Jul 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are ok. The feather format (v2) and the arrow IPC format are the same thing. Sometimes people use the extension .arrow and sometimes they use the extension .feather. However, in both cases they should be specifying kArrow here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense.

}

if (!util::string_view{path}.starts_with("file:///")) {
Expand Down
2 changes: 1 addition & 1 deletion cpp/src/arrow/engine/substrait/relation_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
#include "arrow/engine/substrait/visibility.h"
#include "arrow/type_fwd.h"

#include "substrait/relations.pb.h" // IWYU pragma: export
#include "substrait/algebra.pb.h" // IWYU pragma: export

namespace arrow {
namespace engine {
Expand Down
118 changes: 66 additions & 52 deletions cpp/src/arrow/engine/substrait/serde_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,8 @@ TEST(Substrait, SupportedExtensionTypes) {
ASSERT_OK_AND_ASSIGN(
auto buf,
internal::SubstraitFromJSON(
"Type", "{\"user_defined_type_reference\": " + std::to_string(anchor) + "}"));
"Type", "{\"user_defined\": { \"type_reference\": " + std::to_string(anchor) +
", \"nullability\": \"NULLABILITY_NULLABLE\" } }"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullability did we missed including this last time? Or are we planning to expose this too? My knowledge is not quite adequate here, appreciate some comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, Substrait itself didn't consider user-defined types to conceptually have nullability, for no particular reason I can think of. See substrait-io/substrait#217

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Let’s go with what you have now.


ASSERT_OK_AND_ASSIGN(auto type, DeserializeType(*buf, ext_set));
EXPECT_EQ(*type, *expected_type);
Expand Down Expand Up @@ -260,8 +261,9 @@ TEST(Substrait, NamedStruct) {
}

TEST(Substrait, NoEquivalentArrowType) {
ASSERT_OK_AND_ASSIGN(auto buf, internal::SubstraitFromJSON(
"Type", R"({"user_defined_type_reference": 99})"));
ASSERT_OK_AND_ASSIGN(
auto buf,
internal::SubstraitFromJSON("Type", R"({"user_defined": {"type_reference": 99}})"));
ExtensionSet empty;
ASSERT_THAT(
DeserializeType(*buf, empty),
Expand Down Expand Up @@ -631,11 +633,11 @@ TEST(Substrait, ReadRel) {
"items": [
{
"uri_file": "file:///tmp/dat1.parquet",
"format": "FILE_FORMAT_PARQUET"
"parquet": {}
},
{
"uri_file": "file:///tmp/dat2.parquet",
"format": "FILE_FORMAT_PARQUET"
"parquet": {}
}
]
}
Expand Down Expand Up @@ -764,7 +766,7 @@ Result<std::string> GetSubstraitJSON() {
"items": [
{
"uri_file": "file://FILENAME_PLACEHOLDER",
"format": "FILE_FORMAT_PARQUET"
"parquet": {}
}
]
}
Expand Down Expand Up @@ -824,7 +826,7 @@ TEST(Substrait, JoinPlanBasic) {
"items": [
{
"uri_file": "file:///tmp/dat1.parquet",
"format": "FILE_FORMAT_PARQUET"
"parquet": {}
}
]
}
Expand All @@ -848,7 +850,7 @@ TEST(Substrait, JoinPlanBasic) {
"items": [
{
"uri_file": "file:///tmp/dat2.parquet",
"format": "FILE_FORMAT_PARQUET"
"parquet": {}
}
]
}
Expand All @@ -857,24 +859,28 @@ TEST(Substrait, JoinPlanBasic) {
"expression": {
"scalarFunction": {
"functionReference": 0,
"args": [{
"selection": {
"directReference": {
"structField": {
"field": 0
"arguments": [{
"value": {
"selection": {
"directReference": {
"structField": {
"field": 0
}
},
"rootReference": {
}
},
"rootReference": {
}
}
}, {
"selection": {
"directReference": {
"structField": {
"field": 5
"value": {
"selection": {
"directReference": {
"structField": {
"field": 5
}
},
"rootReference": {
}
},
"rootReference": {
}
}
}]
Expand Down Expand Up @@ -956,7 +962,7 @@ TEST(Substrait, JoinPlanInvalidKeyCmp) {
"items": [
{
"uri_file": "file:///tmp/dat1.parquet",
"format": "FILE_FORMAT_PARQUET"
"parquet": {}
}
]
}
Expand All @@ -980,7 +986,7 @@ TEST(Substrait, JoinPlanInvalidKeyCmp) {
"items": [
{
"uri_file": "file:///tmp/dat2.parquet",
"format": "FILE_FORMAT_PARQUET"
"parquet": {}
}
]
}
Expand All @@ -989,24 +995,28 @@ TEST(Substrait, JoinPlanInvalidKeyCmp) {
"expression": {
"scalarFunction": {
"functionReference": 0,
"args": [{
"selection": {
"directReference": {
"structField": {
"field": 0
"arguments": [{
"value": {
"selection": {
"directReference": {
"structField": {
"field": 0
}
},
"rootReference": {
}
},
"rootReference": {
}
}
}, {
"selection": {
"directReference": {
"structField": {
"field": 5
"value": {
"selection": {
"directReference": {
"structField": {
"field": 5
}
},
"rootReference": {
}
},
"rootReference": {
}
}
}]
Expand Down Expand Up @@ -1061,7 +1071,7 @@ TEST(Substrait, JoinPlanInvalidExpression) {
"items": [
{
"uri_file": "file:///tmp/dat1.parquet",
"format": "FILE_FORMAT_PARQUET"
"parquet": {}
}
]
}
Expand All @@ -1085,7 +1095,7 @@ TEST(Substrait, JoinPlanInvalidExpression) {
"items": [
{
"uri_file": "file:///tmp/dat2.parquet",
"format": "FILE_FORMAT_PARQUET"
"parquet": {}
}
]
}
Expand Down Expand Up @@ -1128,7 +1138,7 @@ TEST(Substrait, JoinPlanInvalidKeys) {
"items": [
{
"uri_file": "file:///tmp/dat1.parquet",
"format": "FILE_FORMAT_PARQUET"
"parquet": {}
}
]
}
Expand All @@ -1137,24 +1147,28 @@ TEST(Substrait, JoinPlanInvalidKeys) {
"expression": {
"scalarFunction": {
"functionReference": 0,
"args": [{
"selection": {
"directReference": {
"structField": {
"field": 0
"arguments": [{
"value": {
"selection": {
"directReference": {
"structField": {
"field": 0
}
},
"rootReference": {
}
},
"rootReference": {
}
}
}, {
"selection": {
"directReference": {
"structField": {
"field": 5
"value": {
"selection": {
"directReference": {
"structField": {
"field": 5
}
},
"rootReference": {
}
},
"rootReference": {
}
}
}]
Expand Down
13 changes: 9 additions & 4 deletions cpp/src/arrow/engine/substrait/type_internal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -196,10 +196,11 @@ Result<std::pair<std::shared_ptr<DataType>, bool>> FromProto(
field("value", std::move(value_nullable.first), value_nullable.second));
}

case ::substrait::Type::kUserDefinedTypeReference: {
uint32_t anchor = type.user_defined_type_reference();
case ::substrait::Type::kUserDefined: {
const auto& user_defined = type.user_defined();
uint32_t anchor = user_defined.type_reference();
ARROW_ASSIGN_OR_RAISE(auto type_record, ext_set.DecodeType(anchor));
return std::make_pair(std::move(type_record.type), true);
return std::make_pair(std::move(type_record.type), IsNullable(user_defined));
Comment on lines -202 to +203
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvanstraten
Before it was set to true always. Not sure why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @bkietz just set it to true in the initial PR because nullable types are the default in Arrow and he had nowhere to get the flag from.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

default:
Expand Down Expand Up @@ -389,7 +390,11 @@ struct DataTypeToProtoImpl {
template <typename T>
Status EncodeUserDefined(const T& t) {
ARROW_ASSIGN_OR_RAISE(auto anchor, ext_set_->EncodeType(t));
type_->set_user_defined_type_reference(anchor);
auto user_defined = internal::make_unique<::substrait::Type_UserDefined>();
user_defined->set_type_reference(anchor);
user_defined->set_nullability(nullable_ ? ::substrait::Type::NULLABILITY_NULLABLE
: ::substrait::Type::NULLABILITY_REQUIRED);
type_->set_allocated_user_defined(user_defined.release());
return Status::OK();
}

Expand Down
4 changes: 2 additions & 2 deletions cpp/thirdparty/versions.txt
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,8 @@ ARROW_SNAPPY_BUILD_SHA256_CHECKSUM=75c1fbb3d618dd3a0483bff0e26d0a92b495bbe5059c8
# There is a bug in GCC < 4.9 with Snappy 1.1.9, so revert to 1.1.8 for those (ARROW-14661)
ARROW_SNAPPY_OLD_BUILD_VERSION=1.1.8
ARROW_SNAPPY_OLD_BUILD_SHA256_CHECKSUM=16b677f07832a612b0836178db7f374e414f94657c138e6993cbfc5dcc58651f
ARROW_SUBSTRAIT_BUILD_VERSION=e1b4c04a
ARROW_SUBSTRAIT_BUILD_SHA256_CHECKSUM=65f83e5f5d979ede5fc8ac9f8bbaf793e0c72d9c415f1a162ba522f6d0bb5bbe
ARROW_SUBSTRAIT_BUILD_VERSION=v0.6.0
ARROW_SUBSTRAIT_BUILD_SHA256_CHECKSUM=7b8583b9684477e9027f417bbfb4febb8acfeb01923dcaa7cf0fd3f921d69c88
ARROW_THRIFT_BUILD_VERSION=0.16.0
ARROW_THRIFT_BUILD_SHA256_CHECKSUM=f460b5c1ca30d8918ff95ea3eb6291b3951cf518553566088f3f2be8981f6209
ARROW_UCX_BUILD_VERSION=1.12.1
Expand Down
3 changes: 2 additions & 1 deletion python/pyarrow/tests/test_substrait.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,8 @@ def test_run_serialized_query(tmpdir):
"local_files": {
"items": [
{
"uri_file": "file://FILENAME_PLACEHOLDER"
"uri_file": "file://FILENAME_PLACEHOLDER",
"arrow": {}
}
]
}
Expand Down
2 changes: 1 addition & 1 deletion r/tests/testthat/test-query-engine.R
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ test_that("do_exec_plan_substrait can evaluate a simple plan", {
"items": [
{
"uri_file": "file://%s",
"format": "FILE_FORMAT_PARQUET"
"parquet": {}
}
]
}
Expand Down