-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset #43473
Conversation
9c51ed4
to
13b8449
Compare
ping @hvanhovell cc @HyukjinKwon @zhengruifeng @ueshin |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
…nd avoid construct new Dataset
b11a9d8
to
d919877
Compare
d919877
to
b6e0183
Compare
dd0504c
to
9d4c19d
Compare
@@ -115,6 +115,49 @@ class SparkConnectPlanner( | |||
private lazy val pythonExec = | |||
sys.env.getOrElse("PYSPARK_PYTHON", sys.env.getOrElse("PYSPARK_DRIVER_PYTHON", "python3")) | |||
|
|||
// Some relation transform need to create Dataset, then get the logical plan from the Dataset. | |||
// This method used to reuse the Dataset instead to discard it. | |||
def transformRelationAsDataset(rel: proto.Relation): Dataset[Row] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I want to move away from constructing Datasets wholesale. In many cases there is no real need, and it is also expensive to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something you want to work on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I want to move away from constructing Datasets wholesale. In many cases there is no real need, and it is also expensive to do.
Yes. the current implementation create man duplicate datasets and then discard them. We should reuse these datasets and reduce the overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something you want to work on?
Yes.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Currently,
SparkConnectPlanner.transformRelation
always return theLogicalPlan
ofDataset
.The
SparkConnectPlanExecution.handlePlan
constructs a newDataset
for it.Sometimes,
SparkConnectPlanExecution.handlePlan
could reuse theDataset
created bySparkConnectPlanner.transformRelation
.This PR creates a common method
transformRelationAsDataset
, so as theDataset
could be reused in many places.Why are the changes needed?
Improve
SparkConnectPlanner
by reuseDataset
and avoid construct newDataset
.Does this PR introduce any user-facing change?
'No'.
Just update inner implementation.
How was this patch tested?
Exists test cases.
Was this patch authored or co-authored using generative AI tooling?
'No'.