[SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset #43473

beliefer · 2023-10-21T09:23:04Z

What changes were proposed in this pull request?

Currently, SparkConnectPlanner.transformRelation always return the LogicalPlan of Dataset.
The SparkConnectPlanExecution.handlePlan constructs a new Dataset for it.

Sometimes, SparkConnectPlanExecution.handlePlan could reuse the Dataset created by SparkConnectPlanner.transformRelation.

This PR creates a common method transformRelationAsDataset, so as the Dataset could be reused in many places.

Why are the changes needed?

Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset.

Does this PR introduce any user-facing change?

'No'.
Just update inner implementation.

How was this patch tested?

Exists test cases.

Was this patch authored or co-authored using generative AI tooling?

'No'.

beliefer · 2023-10-25T09:37:39Z

ping @hvanhovell cc @HyukjinKwon @zhengruifeng @ueshin

github-actions · 2024-02-03T00:18:23Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

…nd avoid construct new Dataset

hvanhovell · 2024-02-06T18:57:14Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

@@ -115,6 +115,49 @@ class SparkConnectPlanner(
  private lazy val pythonExec =
    sys.env.getOrElse("PYSPARK_PYTHON", sys.env.getOrElse("PYSPARK_DRIVER_PYTHON", "python3"))

+  // Some relation transform need to create Dataset, then get the logical plan from the Dataset.
+  // This method used to reuse the Dataset instead to discard it.
+  def transformRelationAsDataset(rel: proto.Relation): Dataset[Row] = {


TBH I want to move away from constructing Datasets wholesale. In many cases there is no real need, and it is also expensive to do.

Is this something you want to work on?

TBH I want to move away from constructing Datasets wholesale. In many cases there is no real need, and it is also expensive to do.

Yes. the current implementation create man duplicate datasets and then discard them. We should reuse these datasets and reduce the overhead.

Is this something you want to work on?

Yes.

github-actions · 2024-05-18T00:19:24Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added SQL CONNECT labels Oct 21, 2023

beliefer force-pushed the SPARK-43829_new branch 2 times, most recently from 9c51ed4 to 13b8449 Compare October 23, 2023 09:54

beliefer requested review from ueshin and zhengruifeng October 25, 2023 04:31

github-actions bot added the Stale label Feb 3, 2024

beliefer removed the Stale label Feb 3, 2024

[SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset a…

54d174e

…nd avoid construct new Dataset

beliefer force-pushed the SPARK-43829_new branch 3 times, most recently from b11a9d8 to d919877 Compare February 5, 2024 08:12

Update comment

b6e0183

beliefer force-pushed the SPARK-43829_new branch from d919877 to b6e0183 Compare February 5, 2024 11:27

Update code

9d4c19d

beliefer force-pushed the SPARK-43829_new branch from dd0504c to 9d4c19d Compare February 6, 2024 11:50

hvanhovell reviewed Feb 6, 2024

View reviewed changes

beliefer mentioned this pull request Apr 12, 2024

[SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests #46012

Closed

github-actions bot added the Stale label May 18, 2024

github-actions bot closed this May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset #43473

[SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset #43473

beliefer commented Oct 21, 2023 •

edited

Loading

beliefer commented Oct 25, 2023

github-actions bot commented Feb 3, 2024

hvanhovell Feb 6, 2024

hvanhovell Feb 6, 2024

beliefer Feb 7, 2024 •

edited

Loading

beliefer Feb 7, 2024

github-actions bot commented May 18, 2024

[SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset #43473

[SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset #43473

Conversation

beliefer commented Oct 21, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

beliefer commented Oct 25, 2023

github-actions bot commented Feb 3, 2024

hvanhovell Feb 6, 2024

Choose a reason for hiding this comment

hvanhovell Feb 6, 2024

Choose a reason for hiding this comment

beliefer Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

beliefer Feb 7, 2024

Choose a reason for hiding this comment

github-actions bot commented May 18, 2024

beliefer commented Oct 21, 2023 •

edited

Loading

beliefer Feb 7, 2024 •

edited

Loading