Improved pipeline translation in SparkStructuredStreamingRunner #22446

mosche · 2022-07-26T09:34:41Z

Improved pipeline translation in SparkStructuredStreamingRunner (closes #22445, #22382):

Make use of Spark Encoders to leverage structural information in translation (and potentially benefit from Catalyst optimizer). Though note, the possible benefit is limited as every ParDo is a black box and a hard boundary for anything that could be optimized.
Improved translation of GroupByKey. When applicable, group also by window to better scale out and/or use Spark native collect_list to collect values of group.
Make use of specialised Spark Aggregators for combine (per key / globally), particularly Sessions can be improved significantly.
Dedicated translation for Combine.Globally to avoid additional shuffle of data.
Remove additional serialization roundtrip when reading from a Beam BoundedSource.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

…StreamingRunner (also closes apache#22382)

mosche · 2022-07-26T09:35:10Z

R: @aromanenko-dev
R: @echauchot

github-actions · 2022-07-26T11:05:21Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

mosche · 2022-08-02T07:07:36Z

ping @echauchot @aromanenko-dev 😀

aromanenko-dev · 2022-08-10T14:37:13Z

Run Spark ValidatesRunner

aromanenko-dev · 2022-08-10T14:37:23Z

Run Spark StructuredStreaming ValidatesRunner

aromanenko-dev

Thanks! Great work, it looks very promising in terms of potential performance improvements.

I gave a brief look on this, quite a lot of code tbh that make difficult to review in one shot. Just left minor comments. I'd leave it for @echauchot, as a main author of this code, to do the other part of review.

Also, could you add any perf (or any other) test results that you did?

...n/java/org/apache/beam/runners/spark/structuredstreaming/SparkStructuredStreamingRunner.java

...n/java/org/apache/beam/runners/spark/structuredstreaming/translation/TranslationContext.java

mosche · 2022-08-11T08:58:01Z

@aromanenko-dev

mosche · 2022-09-05T07:43:48Z

@echauchot Kind ping :)

mosche · 2022-09-14T14:39:24Z

@echauchot @aromanenko-dev Btw, I was thinking a bit about a better name for this runner. I'd suggest to rename it to SparkSqlRunner taking into account it's build on top of the Spark SQL module:

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

https://www.databricks.com/glossary/what-is-spark-sql

aromanenko-dev · 2022-09-14T15:16:32Z

@mosche I'm totally agree that the current name is not very practical in a way that it's quite long and, even worse, very confusing since it contains a Streaming word in its name but this runner doesn't support streaming mode at all (we know the reasons but it is what it is).

So, it would be better to rename it, though, I'm not sure about SparkSqlRunner as a new name. IMHO, it may be also confusing and give some false expectations that it supports only Spark (or Beam?) SQL pipelines.

I'd suggest the name SparkDatasetRunner since it's based on Spark Dataset API. This name is quite short and gives the basic idea of what to expect from this runner. Old runner could be called SparkRDDRunner but let's keep it as it is - just SparkRunner.

On the other hand, this renaming will require many incompatible changes, starting from new packages and artifacts names. However, I'm pretty sure that the most users, that run Beam pipelines on Spark, still use the old classical Spark(RDD)Runner. We can check it out on user@ and twitter, if needed.

mosche · 2022-09-14T15:43:01Z

I agree, that leaves room for potential new confusion. Giving this a 2nd thought I suppose you're right and SparkDatasetRunner is the better name with less ambiguity ... nevertheless it's a rather technical name which i'd usually rather avoid.

Regarding the rename or any other incompatible changes I'm personally fairly relaxed at this stage:

it's clearly marked as experimental and such changes are to be expected
the runner isn't optimised in any way yet, so there's little to no reason to use an experimental runner over a proven existing one (besides that, there's potential scalability issues that make me doubt a bit it would work well on decent sized datasets)
users typically don't (and probably shouldn't) interact with runner packages / classes (the metrics sink might be the only exception)
and last, in case it is used, changing the runner name is trivial enough... There could be a dummy runner with the old name that calls out to the new one and asks users to change their configuration ....

mosche · 2022-09-15T06:54:22Z

@echauchot Will you be able to review this? Otherwise I'd suggest to merge this to not further block follow ups. Looking forward to your feedback.

aromanenko-dev · 2022-09-15T08:15:48Z

Run Java PreCommit

aromanenko-dev · 2022-09-15T08:17:39Z

Run Spark ValidatesRunner

aromanenko-dev · 2022-09-15T08:17:52Z

Run Spark StructuredStreaming ValidatesRunner

aromanenko-dev

I took a glance on this change and LGTM for me.
Taking into account that this PR really improves the performance of some transforms while running it on Spark (according to Nexmark results), I believe we have to merge it once all tests will be green.

echauchot · 2022-09-15T08:49:58Z

I took a glance on this change and LGTM for me. Taking into account that this PR really improves the performance of some transforms while running it on Spark (according to Nexmark results), I believe we have to merge it once all tests will be green.

I would like to review this before merging but it is very long and I'm stuck on other thinks. I'll try my best to take a look ASAP

echauchot · 2022-09-15T08:51:09Z

I agree, that leaves room for potential new confusion. Giving this a 2nd thought I suppose you're right and SparkDatasetRunner is the better name with less ambiguity ... nevertheless it's a rather technical name which i'd usually rather avoid.

Regarding the rename or any other incompatible changes I'm personally fairly relaxed at this stage:

it's clearly marked as experimental and such changes are to be expected

the runner isn't optimised in any way yet, so there's little to no reason to use an experimental runner over a proven existing one (besides that, there's potential scalability issues that make me doubt a bit it would work well on decent sized datasets)

users typically don't (and probably shouldn't) interact with runner packages / classes (the metrics sink might be the only exception)

and last, in case it is used, changing the runner name is trivial enough... There could be a dummy runner with the old name that calls out to the new one and asks users to change their configuration ....

I agree, the name needs to change. I also agree with @aromanenko-dev SparkSQLRunner is confusing. I agree on the proposal of SparkDatasetRunner

echauchot · 2022-09-15T12:38:28Z

@mosche reviewing ...
cc: @aromanenko-dev

echauchot · 2022-09-15T12:39:11Z

@mosche: did you rebase this PR on top of the previous merged code about the Encoders? I have the impression it contains the same changes ?

mosche · 2022-09-15T12:40:57Z

There was no such PR yet @echauchot ... maybe you already had a look at that code on the branch

mosche · 2022-09-15T12:44:07Z

oh, I remember ... you mean this one #22157?
Yes, that's rebased ... but obviously this one here contains lots of changes to encoders to use encoders that are aware of the structure rather than just using binary encoders.

echauchot · 2022-09-15T12:52:52Z

oh, I remember ... you mean this one #22157? Yes, that's rebased ... but obviously this one here contains lots of changes to encoders to use encoders that are aware of the structure rather than just using binary encoders.

Yes I meant #22157.
Ok so my intuition was incorrect: the encoders changes are not the same as in #22157.

echauchot · 2022-09-16T08:16:52Z

@aromanenko-dev

I think you should also run the TPCDS suite on this PR (ask @aromanenko-dev ) because when we compared the 2 spark runners in the past we've seen big differences between nexmark and tpcds suites (nexmark was slighly in fravor of dataset runner for some queries but tpcds was way in favor of RDD runner for almost all queries).

aromanenko-dev · 2022-09-16T08:29:50Z

We can run it on Jenkins against this PR, if needed.

echauchot

@mosche very minor iterative review: I just took a look at Sessions Aggregator. Only minor nits on comments for readability / clarification

.../java/org/apache/beam/runners/spark/structuredstreaming/translation/TransformTranslator.java

...in/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/Aggregators.java

mosche · 2022-09-19T09:23:27Z

Run Spark StructuredStreaming ValidatesRunner

mosche · 2022-09-19T09:23:45Z

Run Spark ValidatesRunner

mosche · 2022-09-19T12:14:08Z

Run Spark ValidatesRunner

echauchot · 2022-09-19T14:47:05Z

We can run it on Jenkins against this PR, if needed.

@mosche did you manage to run TPCDS suite on this PR ?

echauchot · 2022-09-20T14:48:53Z

I see that Nexmark query 5 and 7 have improved quite a lot. They are mainly based on combiners and windows. Nice !

echauchot

Partial review: for now I have looked at

the general architecture
the aggregators
the combiners translatios (globally and per key)
the source
started GBK
This looks very good to me.
I need to finish taking a look at the GBK and the Encoders and we could merge if it is all good. The changes are minor (except the one on triggers in batch)

...in/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/Aggregators.java

...beam/runners/spark/structuredstreaming/translation/batch/CombineGloballyTranslatorBatch.java

...e/beam/runners/spark/structuredstreaming/translation/batch/CombinePerKeyTranslatorBatch.java

...rc/main/java/org/apache/beam/runners/spark/structuredstreaming/io/BoundedDatasetFactory.java

...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java

echauchot · 2022-09-20T15:20:36Z

...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java


+/**
+ * Translator for {@link GroupByKey} using {@link Dataset#groupByKey} with the build-in aggregation
+ * function {@code collect_list} when applicable.


good idea, avoiding materialization like with ReduceFnRunner and using a spark native instead is better because it allows spark to spill to disk instead of throwing OOM.

unfortunately that's not the case, that's way in both cases above is important... either way there's a risk of OOMs, collect_list is just more efficient...
the alternative is the iterableOnce, though that will through if users attempt to iterate multiple times

ouch I was hopping on collect_list. But at least you managed to avoid OOM in some cases compared to previous impl

echauchot

finished GBK review. Thanks for great work ! Only encoders left to review.

echauchot · 2022-09-21T08:52:46Z

...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java

+      result =
+          input
+              .groupBy(col("value.key").as("key"))
+              .agg(collect_list(col("value.value")).as("values"), timestampAggregator(tsCombiner))


...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java

echauchot · 2022-09-21T09:41:14Z

...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java


+/**
+ * Translator for {@link GroupByKey} using {@link Dataset#groupByKey} with the build-in aggregation
+ * function {@code collect_list} when applicable.


ouch I was hopping on collect_list. But at least you managed to avoid OOM in some cases compared to previous impl

...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java

.../org/apache/beam/runners/spark/structuredstreaming/translation/batch/GroupingTranslator.java

echauchot

Finished my review, took a look at the Encoders part. Now need to take a look at your latest comments / commits and the TPCDS run of at least Q3 (because I don't trust nexmark results 100% about being representative of user pipelines) and we will merge pretty soon.

Thanks a lot for the great work !

...va/org/apache/beam/runners/spark/structuredstreaming/translation/helpers/EncoderHelpers.java

echauchot · 2022-09-22T05:56:55Z

Alternatively you could run the load tests for combiners and GBK available in sdk/testing they are per transform

mosche · 2022-09-22T06:06:46Z

@echauchot As said I don't see the value of spending more time on load tests / benchmark at this point. Correctness is tested by the VR tests. One thing at a time

echauchot · 2022-09-22T08:45:03Z

We can run it on Jenkins against this PR, if needed.

@mosche did you manage to run TPCDS suite on this PR ?

Maybe avoiding jenkins as overloaded is a good idea and run either TPCDS Q3 or combine load test and GBK load test and compare RDD runner and Dataset runner

echauchot · 2022-09-22T08:46:33Z

@echauchot As said I don't see the value of spending more time on load tests / benchmark at this point. Correctness is tested by the VR tests. One thing at a time

I'll run them

mosche · 2022-09-22T08:47:40Z

@echauchot Please focus on what's important ... that's not the scope of this PR anymore!

mosche · 2022-09-22T08:50:04Z

Running additional benchmarks makes sense if you plan to take actions, if not ... what's the point?

echauchot · 2022-09-22T10:16:17Z

@echauchot Please focus on what's important ... that's not the scope of this PR anymore!

I disagree: the point of this PR is too improve performance of the runner. Problem is that It contains only nexmark performance results. As I wrote I don't trust nexmark test suite as it showed optimistic results that prove wrong when we ran TPCDS on this runner. So I'd like to ensure the performance of the changes. As it is the whole point of the PR I'm totally focusing on the scope !

Closes apache#22445: Improved pipeline translation in SparkStructured…

5834de2

…StreamingRunner (also closes apache#22382)

fix rat

16492ab

github-actions bot added runners spark labels Jul 26, 2022

aromanenko-dev reviewed Aug 10, 2022

View reviewed changes

Review comments

041640e

aromanenko-dev approved these changes Sep 15, 2022

View reviewed changes

echauchot approved these changes Sep 16, 2022

View reviewed changes

update comments

b48641a

echauchot requested changes Sep 20, 2022

View reviewed changes

update comments

e4a4384

echauchot reviewed Sep 21, 2022

View reviewed changes

Moritz Mack added 2 commits September 21, 2022 15:22

review feedback

0bd3158

Add test case

ee16e0e

echauchot reviewed Sep 21, 2022

View reviewed changes

echauchot approved these changes Sep 22, 2022

View reviewed changes

echauchot merged commit 762edd7 into apache:master Sep 22, 2022

mosche deleted the 22445-ImprovedSparkStructuredStreamingRunner branch November 9, 2022 12:53

Improved pipeline translation in SparkStructuredStreamingRunner #22446

Improved pipeline translation in SparkStructuredStreamingRunner #22446

Conversation

mosche commented Jul 26, 2022 • edited Loading

GitHub Actions Tests Status (on master branch)

mosche commented Jul 26, 2022

github-actions bot commented Jul 26, 2022

mosche commented Aug 2, 2022

aromanenko-dev commented Aug 10, 2022

aromanenko-dev commented Aug 10, 2022

aromanenko-dev left a comment

Choose a reason for hiding this comment

mosche commented Aug 11, 2022

mosche commented Sep 5, 2022

mosche commented Sep 14, 2022

aromanenko-dev commented Sep 14, 2022

mosche commented Sep 14, 2022 • edited Loading

mosche commented Sep 15, 2022

aromanenko-dev commented Sep 15, 2022

aromanenko-dev commented Sep 15, 2022

aromanenko-dev commented Sep 15, 2022

aromanenko-dev left a comment

Choose a reason for hiding this comment

echauchot commented Sep 15, 2022

echauchot commented Sep 15, 2022

echauchot commented Sep 15, 2022

echauchot commented Sep 15, 2022

mosche commented Sep 15, 2022

mosche commented Sep 15, 2022

echauchot commented Sep 15, 2022

echauchot commented Sep 16, 2022

aromanenko-dev commented Sep 16, 2022

echauchot left a comment

Choose a reason for hiding this comment

mosche commented Sep 19, 2022

mosche commented Sep 19, 2022

mosche commented Sep 19, 2022

echauchot commented Sep 19, 2022

echauchot commented Sep 20, 2022

echauchot left a comment

Choose a reason for hiding this comment

echauchot Sep 20, 2022

Choose a reason for hiding this comment

mosche Sep 21, 2022

Choose a reason for hiding this comment

echauchot Sep 21, 2022

Choose a reason for hiding this comment

echauchot left a comment

Choose a reason for hiding this comment

echauchot Sep 21, 2022

Choose a reason for hiding this comment

echauchot Sep 21, 2022

Choose a reason for hiding this comment

echauchot left a comment • edited Loading

Choose a reason for hiding this comment

echauchot commented Sep 22, 2022

mosche commented Sep 22, 2022

echauchot commented Sep 22, 2022 • edited Loading

echauchot commented Sep 22, 2022

mosche commented Sep 22, 2022

mosche commented Sep 22, 2022

echauchot commented Sep 22, 2022

mosche commented Jul 26, 2022 •

edited

Loading

mosche commented Sep 14, 2022 •

edited

Loading

echauchot left a comment •

edited

Loading

echauchot commented Sep 22, 2022 •

edited

Loading