Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset #43473

Closed
wants to merge 3 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Oct 21, 2023

What changes were proposed in this pull request?

Currently, SparkConnectPlanner.transformRelation always return the LogicalPlan of Dataset.
The SparkConnectPlanExecution.handlePlan constructs a new Dataset for it.

Sometimes, SparkConnectPlanExecution.handlePlan could reuse the Dataset created by SparkConnectPlanner.transformRelation.

This PR creates a common method transformRelationAsDataset, so as the Dataset could be reused in many places.

Why are the changes needed?

Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset.

Does this PR introduce any user-facing change?

'No'.
Just update inner implementation.

How was this patch tested?

Exists test cases.

Was this patch authored or co-authored using generative AI tooling?

'No'.

@beliefer beliefer force-pushed the SPARK-43829_new branch 2 times, most recently from 9c51ed4 to 13b8449 Compare October 23, 2023 09:54
@beliefer
Copy link
Contributor Author

Copy link

github-actions bot commented Feb 3, 2024

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 3, 2024
@beliefer beliefer removed the Stale label Feb 3, 2024
@beliefer beliefer force-pushed the SPARK-43829_new branch 3 times, most recently from b11a9d8 to d919877 Compare February 5, 2024 08:12
@@ -115,6 +115,49 @@ class SparkConnectPlanner(
private lazy val pythonExec =
sys.env.getOrElse("PYSPARK_PYTHON", sys.env.getOrElse("PYSPARK_DRIVER_PYTHON", "python3"))

// Some relation transform need to create Dataset, then get the logical plan from the Dataset.
// This method used to reuse the Dataset instead to discard it.
def transformRelationAsDataset(rel: proto.Relation): Dataset[Row] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I want to move away from constructing Datasets wholesale. In many cases there is no real need, and it is also expensive to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something you want to work on?

Copy link
Contributor Author

@beliefer beliefer Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I want to move away from constructing Datasets wholesale. In many cases there is no real need, and it is also expensive to do.

Yes. the current implementation create man duplicate datasets and then discard them. We should reuse these datasets and reduce the overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something you want to work on?

Yes.

Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 18, 2024
@github-actions github-actions bot closed this May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants