-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Substrait-to-Velox conversion #1048
Conversation
Hi @rui-mo! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
This is great, thanks for contributing this back to Velox! Before I start looking at the code, I wonder if you already had thoughts on how we would add the dependent Substrait headers to ensure these files build on circleCI and that are easily usable by Velox developers. I happy to help brainstorming if you haven't! |
Thanks for you advice. The Substrait hearders can be automatically generated with some commands in CMakeLists. We will soon try to port this part from our implementations to Velox, thanks! Update:
But the latter way may be better to depend on a Substrait release. As Substrait committer mentioned in the PR, they are considering making a release. Which one is better on your opinion? What suggestion do you have for the integration for Substrait into Velox? |
One of the things we need to do in Substrait is to start releasing releases so people don't have to depend on git hash versions. @cpcloud, any chance you could take a look at substrait-io/substrait#141 to help with this? |
a2ee0f8
to
461b040
Compare
Substrait headers can be automatically generated, and the code compilation works now. A unit test was added, in which input would be a dumped Substrait plan, and output would be Velox computing result. Working on solving the multiple reference issues for the compilation of unit test. For now, Substrait proto files were put into Velox. If you have other opinion, please let me know, thanks. |
@pedroerp @jacques-n Quick question: Will this PR depend on Substrait release or not? Your thoughts here. I will suggest to live with what Substrait already has. And rebase on release in a follow-up PR. |
My recommendation would be that if the team wants to merge this before a release, we at least ensure it is from a commit in Substrait main branch. I don't think that it should super long to get to a first release. Thoughts @cpcloud? |
@jacques-n which commmit should be based on? |
Something in Substrait main branch. Preferably latest available commit. |
OK, will rebase this PR to the latest available Substrait commit before getting the release. |
3f7338f
to
f37da8f
Compare
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
f37da8f
to
c379e39
Compare
32ed8d8
to
fb96bf9
Compare
Updated Substrait to the latest available commit and added a unit test to read Json file and test Velox's computing. All Github checks were passed. This PR is ready for review now, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this! A few comments inline:
CMakeLists.txt
Outdated
@@ -41,6 +41,7 @@ option(VELOX_ENABLE_PARSE "Build parser used for unit tests." ON) | |||
option(VELOX_ENABLE_EXAMPLES | |||
"Build examples. This will enable VELOX_ENABLE_EXPRESSION automatically." | |||
ON) | |||
option(VELOX_ENABLE_SUBSTRAIT "Buid Substrait-to-Velox converter." ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we leave this off by default? Looks to me that using Substrait (at least for now) will the exception rather than the norm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turned OFF in the latest commit.
velox/CMakeLists.txt
Outdated
|
||
# substrait converter | ||
if(${VELOX_ENABLE_SUBSTRAIT}) | ||
add_subdirectory(substrait_converter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this directory to just "substrait", to follow the conversion pattern we use for "arrow" and ""dubckdb"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to "substrait" in the latest commit.
class SubstraitVeloxExprConverter { | ||
public: | ||
SubstraitVeloxExprConverter( | ||
const std::shared_ptr<SubstraitParser>& sub_parser, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use lower camel base for argument names - check out our coding style here:
https://github.com/facebookincubator/velox/blob/main/CODING_STYLE.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed them in the latest commit.
std::shared_ptr<const core::FieldAccessTypedExpr> toVeloxExpr( | ||
const substrait::Expression::FieldReference& sfield, | ||
const int32_t& input_plan_node_id); | ||
std::shared_ptr<const core::ITypedExpr> toVeloxExpr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: blank line in between methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// This class is used to convert Substrait representations to Velox expressions. | ||
class SubstraitVeloxExprConverter { | ||
public: | ||
SubstraitVeloxExprConverter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a comment here explaining what functions_map is and how it's used in this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments were added in the latest commit.
SubstraitVeloxExprConverter::toVeloxExpr( | ||
const substrait::Expression::ScalarFunction& sfunc, | ||
const int32_t& input_plan_node_id) { | ||
std::vector<std::shared_ptr<const core::ITypedExpr>> params; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: reserve the vector first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many should be reserved for the vector? Because the number of items seems to be unknown.
const substrait::Expression::ScalarFunction& sfunc, | ||
const int32_t& input_plan_node_id) { | ||
std::vector<std::shared_ptr<const core::ITypedExpr>> params; | ||
for (auto& sarg : sfunc.args()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto&
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the cases like this.
auto expr = toVeloxExpr(sarg, input_plan_node_id); | ||
params.push_back(expr); | ||
} | ||
auto function_id = sfunc.function_reference(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
camel case throughout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
SubstraitVeloxExprConverter::toVeloxExpr( | ||
const substrait::Expression::Literal& slit) { | ||
switch (slit.literal_type_case()) { | ||
case substrait::Expression_Literal::LiteralTypeCase::kFp64: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make it shorter:
case substrait::Expression_Literal::LiteralTypeCase::kFp64:
return std::make_shared<core::ConstantTypedExpr>(slit.fp64());
case ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
return std::make_shared<core::ConstantTypedExpr>(val); | ||
break; | ||
} | ||
case substrait::Expression_Literal::LiteralTypeCase::kBoolean: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default:
VELOX_NYI("Substrait conversion not supported for type '{}'", toString(slit));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
9da443b
to
5a8d146
Compare
|
||
/// The Substrait parser used to convert Substrait representations into | ||
/// recognizable representations. | ||
std::shared_ptr<SubstraitParser> subParser_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: blank line before the comment block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the cases like this.
std::shared_ptr<SubstraitParser> subParser_; | ||
/// The map storing the relations between the function id and the function | ||
/// name. | ||
std::unordered_map<uint64_t, std::string> functionMap_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
void SubstraitVeloxExprConverter::getFlatConditions( | ||
const ::substrait::Expression& sFilter, | ||
std::vector<::substrait::Expression_ScalarFunction>* scalarFunctions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pointers are usually to specify this parameter is optional, but it doesn't seem to be the case. Could we take it as a non-const ref instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the cases like this.
break; | ||
} | ||
default: | ||
VELOX_NYI("GetFlatConditions not supported for type '{}'", typeCase); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a toString()
sort of method for typeCase? I'm assuming this is a enum, so we would end up printing the integer value, which is not very helpful to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the typeCase is just integer value. But after searching out, did not find an elegant way to map this integer value to the key word in enum. I found two options not quite simple:
- Construct a map to store the mapping relations
- Use a tool to map the integer back to key word: https://github.com/Neargye/magic_enum
SubstraitVeloxExprConverter::toVeloxExpr( | ||
const ::substrait::Expression::ScalarFunction& sFunc, | ||
const int32_t& inputPlanNodeId) { | ||
std::vector<std::shared_ptr<const core::ITypedExpr>> params; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reserve it first?
params.reserve(sFunc.args().size());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the cases like this.
auto typeCase = sExpr.rex_type_case(); | ||
switch (typeCase) { | ||
case ::substrait::Expression::RexTypeCase::kLiteral: { | ||
veloxExpr = toVeloxExpr(sExpr.literal()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: return here directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the cases like this.
/// name. | ||
std::unordered_map<uint64_t, std::string> functionMap_; | ||
/// This class contains the needed infos for Filter Pushdown. | ||
class FilterInfo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to declare this class here? Seems like we could get away with just defining it as a free class in the .cpp file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the declaration from header.
// TODO: Support different types here. | ||
class SubstraitVeloxExprConverter::FilterInfo { | ||
public: | ||
FilterInfo() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need to explicitly specify it in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the cases like this.
/// the relation key words, including AND, OR, and etc. Currently, only | ||
/// AND is supported. This function is used to extract all the Substrait | ||
/// conditions in the binary tree structure into a vector. | ||
void getFlatConditions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe flattenConditions()
could be more descriptive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
const int32_t& inputPlanNodeId); | ||
|
||
/// Used to convert Substrait Filter into Velox SubfieldFilters. | ||
connector::hive::SubfieldFilters toVeloxFilter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused not. If this class only converts expressions, shouldn't filter conversion be handled by the plan conversion api in the other file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the filter conversion will be used in TableScan, so it is actually related to the plan conversion. Moved this part into the plan converter.
Thanks for addressing the comments! Just made a few more smaller comments here and there, but my main question is about where we place the filter conversion code. |
|
||
/// The map storing the relations between the function id and the function | ||
/// name. Will be constructed based on the Substrait representation. | ||
std::unordered_map<uint64_t, std::string> functionMap_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern of a function map doesn't seem to correctly represent the variability possible in a Substrait plan. How is this map constructed? It should be done by reading specific other parts of the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we need to follow your suggestion to construct the function name into below format:
<function name>:<short_arg_type0><short_arg_type1>..._<short_arg_typeN>
And on your opinion, when should we validate those functions based on yaml specifications? In Spark or in Velox? I think we need to validate them before starting the computing.
Pushed one commit to fix the issues mentioned above. For the next, we will change the used function key words to be aligned with the specifications in Substrait yaml files. |
} | ||
}; | ||
|
||
// This test will firstly generate mock TPC-H lineitem ORC file. Then, Velox's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this test suite has quite a lot of dependencies. Would it be possible to have a leaner test suite with multiple smaller cases exercising only the plan and expr conversion code?
My point being that we don't necessarily need to execute the full query here, but rather just test that given an input substrait plan/expr, it generates the expect Velox plan/expr. Verifying that Velox correctly executes a given Velox plan/expr is already covered by other test suites.
We could also build e2e tests like this one, but my suggestion would be to add this along other tpch test execution code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your insightful comments.
This is just our first PR. Later, my colleagues and I will submit a series of PRs to improve the UT and add more function support. This UT ensures that the subsequent PRs would not break the functionalities currently supported. We plan to add the UTs you mentioned in our following PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pheeew, getting there! A few more comments:
velox/substrait/SubstraitUtils.h
Outdated
|
||
namespace facebook::velox::substrait { | ||
|
||
/// This class contains some common funcitons used to parse Substrait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
velox/substrait/SubstraitUtils.h
Outdated
struct SubstraitType { | ||
std::string type; | ||
bool nullable; | ||
SubstraitType(const std::string& subType, const bool& subNullable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use constructor intializers? Also, you might not need to explicitly specify this constructor - it's available by default (SubstraitType{a, b} type
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
velox/substrait/SubstraitUtils.h
Outdated
/// from a pre-constructed function map. | ||
std::string findSubstraitFunction( | ||
const std::unordered_map<uint64_t, std::string>& functionMap, | ||
const uint64_t& id) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just pass primitive types by value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the cases like this.
velox/substrait/SubstraitUtils.h
Outdated
/// from a pre-constructed function map. | ||
std::string findVeloxFunction( | ||
const std::unordered_map<uint64_t, std::string>& functionMap, | ||
const uint64_t& id) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
velox/substrait/SubstraitUtils.h
Outdated
/// words. Key: the Substrait function key word, Value: the Velox function key | ||
/// word. | ||
const std::unordered_map<std::string, std::string> substraitVeloxFunctionMap = | ||
{{"MULTIPLY", "multiply"}, {"SUM", "sum"}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we need to add a mapping to every function available in Velox here at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those functions with different names in Substrait and Velox, a mapping relation is needed. Since we change to use the unified function names with Substrait, the two mapping relations here are not needed any more. Because in Substrait, they are also specified as "multiply" and "sum".
std::shared_ptr<const core::PlanNode> toVeloxPlan( | ||
const ::substrait::FilterRel& sFilter); | ||
|
||
/// Used to convert Substrait ReadRel into Velox PlanNode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we explain what the parameters below mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added explanation for them.
} | ||
|
||
/// Will add the paths of the files to be scanned into a vector. | ||
void getPaths(std::vector<std::string>& paths) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps a more idiomatic way to achieve this is to have a function exposing the path_ vector as a const ref, and if you need to reuse this pattern inside the .cpp, this could be a free function inside an anonymous namespace on that file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your advice.
Actually these methods are called outside the .cpp, and in this PR, they are called in the UT. Changed them to expose the vector as a const ref.
auto groupingExprs = grouping.grouping_expressions(); | ||
for (const auto& groupingExpr : groupingExprs) { | ||
auto fieldExpr = | ||
exprConverter_->toVeloxExpr(groupingExpr, inputPlanNodeId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make the function return the specialized type directly (core::FieldAccessTypedExpr) so we don't need the cast below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in this way.
std::vector<std::string> projectOutNames; | ||
std::vector<std::shared_ptr<const core::CallTypedExpr>> aggExprs; | ||
aggExprs.reserve(sAgg.measures().size()); | ||
// Construct Velox Aggregate expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe use a blank line to separate code blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
std::shared_ptr<const core::PlanNode> SubstraitVeloxPlanConverter::toVeloxPlan( | ||
const ::substrait::RelRoot& sRoot) { | ||
const auto& sNames = sRoot.names(); | ||
int nameIdx = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int nameIdx = sNames.size();
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. It is not used currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this diff. There will likely be more improvements needed, but let's get this in and improve as we go.
Thanks for the contribution!
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hi @pedroerp, thanks for your approval. |
No worries, I'll have to add the internal building details, so I can take care of the linter warning. Thanks again for looking into this! |
Summary: This pr is targeted to add the base layout for Substrait-to-Velox conversion. With this change, the base layout of Substrait-to-Velox conversion would be decided. Some of Substrait plan representations can be converted into Velox plan for computing. More conversions support will be added in following pull requests. Based on [Substrait Commit](substrait-io/substrait@9d9805b). The used function names are based on: [Substrait PR](substrait-io/substrait#147). Pull Request resolved: facebookincubator#1048 Reviewed By: kagamiori Differential Revision: D34773052 Pulled By: pedroerp fbshipit-source-id: c4c56a43486519b567aa1d715a0d64ce02a447d1
Summary: This pr is targeted to add the base layout for Substrait-to-Velox conversion. With this change, the base layout of Substrait-to-Velox conversion would be decided. Some of Substrait plan representations can be converted into Velox plan for computing. More conversions support will be added in following pull requests. Based on [Substrait Commit](substrait-io/substrait@9d9805b). The used function names are based on: [Substrait PR](substrait-io/substrait#147). Pull Request resolved: facebookincubator#1048 Reviewed By: kagamiori Differential Revision: D34773052 Pulled By: pedroerp fbshipit-source-id: c4c56a43486519b567aa1d715a0d64ce02a447d1
This pr is targeted to add the base layout for Substrait-to-Velox conversion. With this change, the base layout of Substrait-to-Velox conversion would be decided. Some of Substrait plan representations can be converted into Velox plan for computing. More conversions support will be added in following pull requests.
Based on Substrait Commit.
The used function names are based on: Substrait PR.