-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16816: [C++] Upgrade Substrait to v0.6.0 #13468
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -106,17 +106,16 @@ Result<compute::Declaration> FromProto(const substrait::Rel& rel, | |
path = item.uri_path_glob(); | ||
} | ||
|
||
if (item.format() == | ||
substrait::ReadRel::LocalFiles::FileOrFiles::FILE_FORMAT_PARQUET) { | ||
format = std::make_shared<dataset::ParquetFileFormat>(); | ||
} else if (util::string_view{path}.ends_with(".arrow")) { | ||
format = std::make_shared<dataset::IpcFileFormat>(); | ||
} else if (util::string_view{path}.ends_with(".feather")) { | ||
format = std::make_shared<dataset::IpcFileFormat>(); | ||
} else { | ||
return Status::NotImplemented( | ||
"substrait::ReadRel::LocalFiles::FileOrFiles::format " | ||
"other than FILE_FORMAT_PARQUET"); | ||
switch (item.file_format_case()) { | ||
case substrait::ReadRel_LocalFiles_FileOrFiles::kParquet: | ||
format = std::make_shared<dataset::ParquetFileFormat>(); | ||
break; | ||
case substrait::ReadRel_LocalFiles_FileOrFiles::kArrow: | ||
format = std::make_shared<dataset::IpcFileFormat>(); | ||
break; | ||
default: | ||
return Status::NotImplemented( | ||
"unknown substrait::ReadRel::LocalFiles::FileOrFiles::file_format"); | ||
Comment on lines
+109
to
+118
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we are ok. The feather format (v2) and the arrow IPC format are the same thing. Sometimes people use the extension There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That make sense. |
||
} | ||
|
||
if (!util::string_view{path}.starts_with("file:///")) { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -185,7 +185,8 @@ TEST(Substrait, SupportedExtensionTypes) { | |
ASSERT_OK_AND_ASSIGN( | ||
auto buf, | ||
internal::SubstraitFromJSON( | ||
"Type", "{\"user_defined_type_reference\": " + std::to_string(anchor) + "}")); | ||
"Type", "{\"user_defined\": { \"type_reference\": " + std::to_string(anchor) + | ||
", \"nullability\": \"NULLABILITY_NULLABLE\" } }")); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, Substrait itself didn't consider user-defined types to conceptually have nullability, for no particular reason I can think of. See substrait-io/substrait#217 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. Let’s go with what you have now. |
||
|
||
ASSERT_OK_AND_ASSIGN(auto type, DeserializeType(*buf, ext_set)); | ||
EXPECT_EQ(*type, *expected_type); | ||
|
@@ -260,8 +261,9 @@ TEST(Substrait, NamedStruct) { | |
} | ||
|
||
TEST(Substrait, NoEquivalentArrowType) { | ||
ASSERT_OK_AND_ASSIGN(auto buf, internal::SubstraitFromJSON( | ||
"Type", R"({"user_defined_type_reference": 99})")); | ||
ASSERT_OK_AND_ASSIGN( | ||
auto buf, | ||
internal::SubstraitFromJSON("Type", R"({"user_defined": {"type_reference": 99}})")); | ||
ExtensionSet empty; | ||
ASSERT_THAT( | ||
DeserializeType(*buf, empty), | ||
|
@@ -631,11 +633,11 @@ TEST(Substrait, ReadRel) { | |
"items": [ | ||
{ | ||
"uri_file": "file:///tmp/dat1.parquet", | ||
"format": "FILE_FORMAT_PARQUET" | ||
"parquet": {} | ||
}, | ||
{ | ||
"uri_file": "file:///tmp/dat2.parquet", | ||
"format": "FILE_FORMAT_PARQUET" | ||
"parquet": {} | ||
} | ||
] | ||
} | ||
|
@@ -764,7 +766,7 @@ Result<std::string> GetSubstraitJSON() { | |
"items": [ | ||
{ | ||
"uri_file": "file://FILENAME_PLACEHOLDER", | ||
"format": "FILE_FORMAT_PARQUET" | ||
"parquet": {} | ||
} | ||
] | ||
} | ||
|
@@ -824,7 +826,7 @@ TEST(Substrait, JoinPlanBasic) { | |
"items": [ | ||
{ | ||
"uri_file": "file:///tmp/dat1.parquet", | ||
"format": "FILE_FORMAT_PARQUET" | ||
"parquet": {} | ||
} | ||
] | ||
} | ||
|
@@ -848,7 +850,7 @@ TEST(Substrait, JoinPlanBasic) { | |
"items": [ | ||
{ | ||
"uri_file": "file:///tmp/dat2.parquet", | ||
"format": "FILE_FORMAT_PARQUET" | ||
"parquet": {} | ||
} | ||
] | ||
} | ||
|
@@ -857,24 +859,28 @@ TEST(Substrait, JoinPlanBasic) { | |
"expression": { | ||
"scalarFunction": { | ||
"functionReference": 0, | ||
"args": [{ | ||
"selection": { | ||
"directReference": { | ||
"structField": { | ||
"field": 0 | ||
"arguments": [{ | ||
"value": { | ||
"selection": { | ||
"directReference": { | ||
"structField": { | ||
"field": 0 | ||
} | ||
}, | ||
"rootReference": { | ||
} | ||
}, | ||
"rootReference": { | ||
} | ||
} | ||
}, { | ||
"selection": { | ||
"directReference": { | ||
"structField": { | ||
"field": 5 | ||
"value": { | ||
"selection": { | ||
"directReference": { | ||
"structField": { | ||
"field": 5 | ||
} | ||
}, | ||
"rootReference": { | ||
} | ||
}, | ||
"rootReference": { | ||
} | ||
} | ||
}] | ||
|
@@ -956,7 +962,7 @@ TEST(Substrait, JoinPlanInvalidKeyCmp) { | |
"items": [ | ||
{ | ||
"uri_file": "file:///tmp/dat1.parquet", | ||
"format": "FILE_FORMAT_PARQUET" | ||
"parquet": {} | ||
} | ||
] | ||
} | ||
|
@@ -980,7 +986,7 @@ TEST(Substrait, JoinPlanInvalidKeyCmp) { | |
"items": [ | ||
{ | ||
"uri_file": "file:///tmp/dat2.parquet", | ||
"format": "FILE_FORMAT_PARQUET" | ||
"parquet": {} | ||
} | ||
] | ||
} | ||
|
@@ -989,24 +995,28 @@ TEST(Substrait, JoinPlanInvalidKeyCmp) { | |
"expression": { | ||
"scalarFunction": { | ||
"functionReference": 0, | ||
"args": [{ | ||
"selection": { | ||
"directReference": { | ||
"structField": { | ||
"field": 0 | ||
"arguments": [{ | ||
"value": { | ||
"selection": { | ||
"directReference": { | ||
"structField": { | ||
"field": 0 | ||
} | ||
}, | ||
"rootReference": { | ||
} | ||
}, | ||
"rootReference": { | ||
} | ||
} | ||
}, { | ||
"selection": { | ||
"directReference": { | ||
"structField": { | ||
"field": 5 | ||
"value": { | ||
"selection": { | ||
"directReference": { | ||
"structField": { | ||
"field": 5 | ||
} | ||
}, | ||
"rootReference": { | ||
} | ||
}, | ||
"rootReference": { | ||
} | ||
} | ||
}] | ||
|
@@ -1061,7 +1071,7 @@ TEST(Substrait, JoinPlanInvalidExpression) { | |
"items": [ | ||
{ | ||
"uri_file": "file:///tmp/dat1.parquet", | ||
"format": "FILE_FORMAT_PARQUET" | ||
"parquet": {} | ||
} | ||
] | ||
} | ||
|
@@ -1085,7 +1095,7 @@ TEST(Substrait, JoinPlanInvalidExpression) { | |
"items": [ | ||
{ | ||
"uri_file": "file:///tmp/dat2.parquet", | ||
"format": "FILE_FORMAT_PARQUET" | ||
"parquet": {} | ||
} | ||
] | ||
} | ||
|
@@ -1128,7 +1138,7 @@ TEST(Substrait, JoinPlanInvalidKeys) { | |
"items": [ | ||
{ | ||
"uri_file": "file:///tmp/dat1.parquet", | ||
"format": "FILE_FORMAT_PARQUET" | ||
"parquet": {} | ||
} | ||
] | ||
} | ||
|
@@ -1137,24 +1147,28 @@ TEST(Substrait, JoinPlanInvalidKeys) { | |
"expression": { | ||
"scalarFunction": { | ||
"functionReference": 0, | ||
"args": [{ | ||
"selection": { | ||
"directReference": { | ||
"structField": { | ||
"field": 0 | ||
"arguments": [{ | ||
"value": { | ||
"selection": { | ||
"directReference": { | ||
"structField": { | ||
"field": 0 | ||
} | ||
}, | ||
"rootReference": { | ||
} | ||
}, | ||
"rootReference": { | ||
} | ||
} | ||
}, { | ||
"selection": { | ||
"directReference": { | ||
"structField": { | ||
"field": 5 | ||
"value": { | ||
"selection": { | ||
"directReference": { | ||
"structField": { | ||
"field": 5 | ||
} | ||
}, | ||
"rootReference": { | ||
} | ||
}, | ||
"rootReference": { | ||
} | ||
} | ||
}] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -196,10 +196,11 @@ Result<std::pair<std::shared_ptr<DataType>, bool>> FromProto( | |
field("value", std::move(value_nullable.first), value_nullable.second)); | ||
} | ||
|
||
case ::substrait::Type::kUserDefinedTypeReference: { | ||
uint32_t anchor = type.user_defined_type_reference(); | ||
case ::substrait::Type::kUserDefined: { | ||
const auto& user_defined = type.user_defined(); | ||
uint32_t anchor = user_defined.type_reference(); | ||
ARROW_ASSIGN_OR_RAISE(auto type_record, ext_set.DecodeType(anchor)); | ||
return std::make_pair(std::move(type_record.type), true); | ||
return std::make_pair(std::move(type_record.type), IsNullable(user_defined)); | ||
Comment on lines
-202
to
+203
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jvanstraten There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think @bkietz just set it to true in the initial PR because nullable types are the default in Arrow and he had nowhere to get the flag from. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
} | ||
|
||
default: | ||
|
@@ -389,7 +390,11 @@ struct DataTypeToProtoImpl { | |
template <typename T> | ||
Status EncodeUserDefined(const T& t) { | ||
ARROW_ASSIGN_OR_RAISE(auto anchor, ext_set_->EncodeType(t)); | ||
type_->set_user_defined_type_reference(anchor); | ||
auto user_defined = internal::make_unique<::substrait::Type_UserDefined>(); | ||
user_defined->set_type_reference(anchor); | ||
user_defined->set_nullability(nullable_ ? ::substrait::Type::NULLABILITY_NULLABLE | ||
: ::substrait::Type::NULLABILITY_REQUIRED); | ||
type_->set_allocated_user_defined(user_defined.release()); | ||
return Status::OK(); | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: If possible, it might be nice to include the string value of what was specified? E.g. "Only
value
is supported but got "enum". But I don't remember if it is easy to go fromarg_type_case
to a string.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a thing, at least not automatically. I could list the options, but if Substrait adds something later it's not going to upgrade automatically. The same applies to plenty of other of switch statements in the code currently. Also, if a plan with a function is passed to Arrow that uses argument features that Arrow doesn't even understand, the bigger issue will be that Arrow would have no idea what the function means to begin with. So I guess I could add the function name to the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no easy enum->string then let's not worry about it. I think this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:ate to the party, but in case it helps, I see there is EnumDescriptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good to know that it does exist. It still wouldn't be very helpful to users if a new option is added since the last time Arrow bumped Substrait, but I guess that's shifting the goalposts a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Arrow code could include in the message the value converted to a string if it is known given the Substrait version the code was built with, and otherwise include the number of the value and explain it is unknown and coming from a different version of Substrait. All this enum-handling is for a separate issue, of course.