-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Substrait-to-Velox] Capture the file format specified in Substrait plan #1683
Conversation
Hi @JkSelf! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf Would you explain what is the problem you are fixing and what is the solution? Storing file format, starts, lengths, etc. in the plan converter is strange and won't work when there are multiple file formats. Also, please, add a test.
Thanks for your review! This PR is follow-up of #1048. We need the file format, starts, lengths, etc. when creating the |
Perhaps, one option is to define a struct to hold split information and store a mapping from plan node ID to a list of splits. |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Agree with your suggestions. I have updated and please help to review again. |
@rui-mo Please help to review. Thanks for your help! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf Looks good. A few comments.
@@ -25,6 +25,35 @@ namespace facebook::velox::substrait { | |||
/// This class is used to convert the Substrait plan into Velox plan. | |||
class SubstraitVeloxPlanConverter { | |||
public: | |||
struct SplitStats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: Since there are not statistics, but rather information about the split, perhaps, rename to SplitInfo.
std::vector<std::string> paths, | ||
std::vector<u_int64_t> starts, | ||
std::vector<u_int64_t> lengths, | ||
int fileFormat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why int and not int32_t?
@@ -50,7 +79,8 @@ class SubstraitVeloxPlanConverter { | |||
u_int32_t& index, | |||
std::vector<std::string>& paths, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider changing this API to take a reference to SplitInfo struct to return a pair of PlanNodePtr and a SplitInfo
return lengths_; | ||
/// Return the splitStats map used by this plan converter. | ||
const std::unordered_map<core::PlanNodeId, std::shared_ptr<SplitStats>>& | ||
getSplitStatsMap() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the convention is not to add 'get' prefix to getters, e.g. splitInfos()
@@ -46,21 +46,35 @@ class Substrait2VeloxPlanConversionTest | |||
const std::shared_ptr<const core::PlanNode>& planNode, | |||
const std::vector<std::string>& paths, | |||
const std::vector<u_int64_t>& starts, | |||
const std::vector<u_int64_t>& lengths) | |||
const std::vector<u_int64_t>& lengths, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use SplitInfo struct for readability
@@ -136,9 +151,14 @@ class Substrait2VeloxPlanConversionTest | |||
facebook::velox::substrait::SubstraitVeloxPlanConverter>(); | |||
// Convert to Velox PlanNode. | |||
auto planNode = planConverter->toVeloxPlan(substraitPlan, pool_.get()); | |||
auto splitStatsMap = planConverter->getSplitStatsMap(); | |||
auto leafPlanNodeIds = planNode->leafPlanNodeIds(); | |||
// Here only one leaf node is expected here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps, assert that there is exactly one node
@@ -149,11 +169,11 @@ class Substrait2VeloxPlanConversionTest | |||
absolutePaths.emplace_back(fmt::format("{}{}", tempPath, path)); | |||
} | |||
|
|||
std::vector<u_int64_t> starts = planConverter->getStarts(); | |||
std::vector<u_int64_t> lengths = planConverter->getLengths(); | |||
std::vector<u_int64_t> starts = splitStats->starts_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is hard to read. Please, consider refactoring as in #1686
@mbasmanova |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf Looks good. Perhaps, squash commits and rebase on top of 1720. A few small comments and it will be ready to land.
@@ -25,6 +25,25 @@ namespace facebook::velox::substrait { | |||
/// This class is used to convert the Substrait plan into Velox plan. | |||
class SubstraitVeloxPlanConverter { | |||
public: | |||
struct SplitInfo { | |||
SplitInfo() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining default constructor explicitly not necessary.
SplitInfo() {} | ||
|
||
/// The Partition index. | ||
u_int32_t partitionIndex_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: members of the struct should not have underscore at the end, i.e. partitionIndex
lengths.reserve(fileList.size()); | ||
splitInfo->paths_.reserve(fileList.size()); | ||
splitInfo->starts_.reserve(fileList.size()); | ||
splitInfo->lengths_.reserve(fileList.size()); | ||
for (const auto& file : fileList) { | ||
// Expect all Partitions share the same index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a check to make sure this is indeed the case.
/// The unique identification for each PlanNode. | ||
int planNodeId_ = 0; | ||
|
||
/// The map storing the relations between the function id and the function | ||
/// name. Will be constructed based on the Substrait representation. | ||
std::unordered_map<uint64_t, std::string> functionMap_; | ||
|
||
/// The map storing the split stats for each PlanNode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- stats -> information
- for each -> per
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf Looks good % one question.
std::vector<u_int64_t> lengths; | ||
|
||
/// The file format of the files to be scanned. | ||
int32_t fileFormat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of the integer here? Should we use dwio::common::FileFormat enum instead?
In the Substrait proto, I'm seeing file format defined as a struct:
https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L118
oneof file_format {
ParquetReadOptions parquet = 9;
ArrowReadOptions arrow = 10;
OrcReadOptions orc = 11;
google.protobuf.Any extension = 12;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oneof file_format {
ParquetReadOptions parquet = 9;
ArrowReadOptions arrow = 10;
OrcReadOptions orc = 11;
google.protobuf.Any extension = 12;
}
The above code is newly committed in PR#169. And the FileFormat is described as follow in velox.
enum FileFormat{
FILE_FORMAT_UNSPECIFIED = 0;
FILE_FORMAT_PARQUET = 1;
}
I think we can change this after applying PR#169.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let's change int32_t to dwio::common::FileFormat in this PR and update Substrait proto in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf Thank you for the contribution.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…r#1683) Summary: Pull Request resolved: facebookincubator#1683 Reviewed By: amitkdutta Differential Revision: D37062244 Pulled By: mbasmanova fbshipit-source-id: 0177a0330040de4b357e721c7ea43a17b88605ba
…acebookincubator#1683) What changes were proposed in this pull request? This pr fix that if the current block is empty, it should return false for hasNext. (Fixes: facebookincubator#1682) How was this patch tested? This patch was tested manually.
No description provided.