Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Substrait-to-Velox] Capture the file format specified in Substrait plan #1683

Closed
wants to merge 8 commits into from

Conversation

JkSelf
Copy link
Collaborator

@JkSelf JkSelf commented May 24, 2022

No description provided.

@facebook-github-bot
Copy link
Contributor

Hi @JkSelf!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf Would you explain what is the problem you are fixing and what is the solution? Storing file format, starts, lengths, etc. in the plan converter is strange and won't work when there are multiple file formats. Also, please, add a test.

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 24, 2022

@mbasmanova

Thanks for your review!

This PR is follow-up of #1048. We need the file format, starts, lengths, etc. when creating the HiveConnectorSplit for source operator. This PR can specific the file format in HiveConnectorSplit and then calling the corresponding reader to read the file. I will add tests later.

@mbasmanova
Copy link
Contributor

We need the file format, starts, lengths, etc. when creating the HiveConnectorSplit for source operator.

Perhaps, one option is to define a struct to hold split information and store a mapping from plan node ID to a list of splits.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 25, 2022
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 26, 2022

Perhaps, one option is to define a struct to hold split information and store a mapping from plan node ID to a list of splits.

Agree with your suggestions. I have updated and please help to review again.

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 26, 2022

@rui-mo Please help to review. Thanks for your help!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf Looks good. A few comments.

@@ -25,6 +25,35 @@ namespace facebook::velox::substrait {
/// This class is used to convert the Substrait plan into Velox plan.
class SubstraitVeloxPlanConverter {
public:
struct SplitStats {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: Since there are not statistics, but rather information about the split, perhaps, rename to SplitInfo.

std::vector<std::string> paths,
std::vector<u_int64_t> starts,
std::vector<u_int64_t> lengths,
int fileFormat)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why int and not int32_t?

@@ -50,7 +79,8 @@ class SubstraitVeloxPlanConverter {
u_int32_t& index,
std::vector<std::string>& paths,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing this API to take a reference to SplitInfo struct to return a pair of PlanNodePtr and a SplitInfo

return lengths_;
/// Return the splitStats map used by this plan converter.
const std::unordered_map<core::PlanNodeId, std::shared_ptr<SplitStats>>&
getSplitStatsMap() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the convention is not to add 'get' prefix to getters, e.g. splitInfos()

@@ -46,21 +46,35 @@ class Substrait2VeloxPlanConversionTest
const std::shared_ptr<const core::PlanNode>& planNode,
const std::vector<std::string>& paths,
const std::vector<u_int64_t>& starts,
const std::vector<u_int64_t>& lengths)
const std::vector<u_int64_t>& lengths,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use SplitInfo struct for readability

@@ -136,9 +151,14 @@ class Substrait2VeloxPlanConversionTest
facebook::velox::substrait::SubstraitVeloxPlanConverter>();
// Convert to Velox PlanNode.
auto planNode = planConverter->toVeloxPlan(substraitPlan, pool_.get());
auto splitStatsMap = planConverter->getSplitStatsMap();
auto leafPlanNodeIds = planNode->leafPlanNodeIds();
// Here only one leaf node is expected here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, assert that there is exactly one node

@@ -149,11 +169,11 @@ class Substrait2VeloxPlanConversionTest
absolutePaths.emplace_back(fmt::format("{}{}", tempPath, path));
}

std::vector<u_int64_t> starts = planConverter->getStarts();
std::vector<u_int64_t> lengths = planConverter->getLengths();
std::vector<u_int64_t> starts = splitStats->starts_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is hard to read. Please, consider refactoring as in #1686

@JkSelf
Copy link
Collaborator Author

JkSelf commented May 26, 2022

@mbasmanova
All comments have been updated except the test. There may be some repetitive work to refactor as in #1686. Can I update the test after #1686 is merged.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf Looks good. Perhaps, squash commits and rebase on top of 1720. A few small comments and it will be ready to land.

@@ -25,6 +25,25 @@ namespace facebook::velox::substrait {
/// This class is used to convert the Substrait plan into Velox plan.
class SubstraitVeloxPlanConverter {
public:
struct SplitInfo {
SplitInfo() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining default constructor explicitly not necessary.

SplitInfo() {}

/// The Partition index.
u_int32_t partitionIndex_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: members of the struct should not have underscore at the end, i.e. partitionIndex

lengths.reserve(fileList.size());
splitInfo->paths_.reserve(fileList.size());
splitInfo->starts_.reserve(fileList.size());
splitInfo->lengths_.reserve(fileList.size());
for (const auto& file : fileList) {
// Expect all Partitions share the same index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a check to make sure this is indeed the case.

/// The unique identification for each PlanNode.
int planNodeId_ = 0;

/// The map storing the relations between the function id and the function
/// name. Will be constructed based on the Substrait representation.
std::unordered_map<uint64_t, std::string> functionMap_;

/// The map storing the split stats for each PlanNode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • stats -> information
  • for each -> per

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf Looks good % one question.

std::vector<u_int64_t> lengths;

/// The file format of the files to be scanned.
int32_t fileFormat;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of the integer here? Should we use dwio::common::FileFormat enum instead?

In the Substrait proto, I'm seeing file format defined as a struct:

https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L118

      oneof file_format {
        ParquetReadOptions parquet = 9;
        ArrowReadOptions arrow = 10;
        OrcReadOptions orc = 11;
        google.protobuf.Any extension = 12;
      }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  oneof file_format {
    ParquetReadOptions parquet = 9;
    ArrowReadOptions arrow = 10;
    OrcReadOptions orc = 11;
    google.protobuf.Any extension = 12;
  }

The above code is newly committed in PR#169. And the FileFormat is described as follow in velox.

  enum FileFormat{
    FILE_FORMAT_UNSPECIFIED = 0;
    FILE_FORMAT_PARQUET  = 1;
  }

I think we can change this after applying PR#169.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Let's change int32_t to dwio::common::FileFormat in this PR and update Substrait proto in a follow-up PR.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf Thank you for the contribution.

@mbasmanova mbasmanova changed the title Store file format of the files to be scanned when convert the substrait plan to velox plan [Substrait-to-Velox] Capture the file format specified in Substrait plan Jun 10, 2022
@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

shiyu-bytedance pushed a commit to shiyu-bytedance/velox-1 that referenced this pull request Aug 18, 2022
…r#1683)

Summary: Pull Request resolved: facebookincubator#1683

Reviewed By: amitkdutta

Differential Revision: D37062244

Pulled By: mbasmanova

fbshipit-source-id: 0177a0330040de4b357e721c7ea43a17b88605ba
marin-ma pushed a commit to marin-ma/velox-oap that referenced this pull request Dec 15, 2023
…acebookincubator#1683)

What changes were proposed in this pull request?
This pr fix that if the current block is empty, it should return false for hasNext.

(Fixes: facebookincubator#1682)

How was this patch tested?
This patch was tested manually.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants