-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved pipeline translation in SparkStructuredStreamingRunner #22446
Improved pipeline translation in SparkStructuredStreamingRunner #22446
Conversation
…StreamingRunner (also closes apache#22382)
R: @aromanenko-dev |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
ping @echauchot @aromanenko-dev 😀 |
Run Spark ValidatesRunner |
Run Spark StructuredStreaming ValidatesRunner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Great work, it looks very promising in terms of potential performance improvements.
I gave a brief look on this, quite a lot of code tbh that make difficult to review in one shot. Just left minor comments. I'd leave it for @echauchot, as a main author of this code, to do the other part of review.
Also, could you add any perf (or any other) test results that you did?
...n/java/org/apache/beam/runners/spark/structuredstreaming/SparkStructuredStreamingRunner.java
Show resolved
Hide resolved
...n/java/org/apache/beam/runners/spark/structuredstreaming/translation/TranslationContext.java
Outdated
Show resolved
Hide resolved
...n/java/org/apache/beam/runners/spark/structuredstreaming/translation/TranslationContext.java
Outdated
Show resolved
Hide resolved
...n/java/org/apache/beam/runners/spark/structuredstreaming/translation/TranslationContext.java
Outdated
Show resolved
Hide resolved
...n/java/org/apache/beam/runners/spark/structuredstreaming/translation/TranslationContext.java
Outdated
Show resolved
Hide resolved
@echauchot Kind ping :) |
@echauchot @aromanenko-dev Btw, I was thinking a bit about a better name for this runner. I'd suggest to rename it to
|
@mosche I'm totally agree that the current name is not very practical in a way that it's quite long and, even worse, very confusing since it contains a So, it would be better to rename it, though, I'm not sure about I'd suggest the name On the other hand, this renaming will require many incompatible changes, starting from new packages and artifacts names. However, I'm pretty sure that the most users, that run Beam pipelines on Spark, still use the old classical Spark(RDD)Runner. We can check it out on user@ and twitter, if needed. |
I agree, that leaves room for potential new confusion. Giving this a 2nd thought I suppose you're right and Regarding the rename or any other incompatible changes I'm personally fairly relaxed at this stage:
|
@echauchot Will you be able to review this? Otherwise I'd suggest to merge this to not further block follow ups. Looking forward to your feedback. |
Run Java PreCommit |
Run Spark ValidatesRunner |
Run Spark StructuredStreaming ValidatesRunner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a glance on this change and LGTM for me.
Taking into account that this PR really improves the performance of some transforms while running it on Spark (according to Nexmark results), I believe we have to merge it once all tests will be green.
I would like to review this before merging but it is very long and I'm stuck on other thinks. I'll try my best to take a look ASAP |
I agree, the name needs to change. I also agree with @aromanenko-dev SparkSQLRunner is confusing. I agree on the proposal of SparkDatasetRunner |
@mosche reviewing ... |
@mosche: did you rebase this PR on top of the previous merged code about the Encoders? I have the impression it contains the same changes ? |
There was no such PR yet @echauchot ... maybe you already had a look at that code on the branch |
oh, I remember ... you mean this one #22157? |
Yes I meant #22157. |
I think you should also run the TPCDS suite on this PR (ask @aromanenko-dev ) because when we compared the 2 spark runners in the past we've seen big differences between nexmark and tpcds suites (nexmark was slighly in fravor of dataset runner for some queries but tpcds was way in favor of RDD runner for almost all queries). |
We can run it on Jenkins against this PR, if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mosche very minor iterative review: I just took a look at Sessions Aggregator. Only minor nits on comments for readability / clarification
.../java/org/apache/beam/runners/spark/structuredstreaming/translation/TransformTranslator.java
Show resolved
Hide resolved
...in/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/Aggregators.java
Show resolved
Hide resolved
...in/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/Aggregators.java
Show resolved
Hide resolved
...in/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/Aggregators.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/Aggregators.java
Show resolved
Hide resolved
...in/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/Aggregators.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/Aggregators.java
Show resolved
Hide resolved
...in/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/Aggregators.java
Show resolved
Hide resolved
Run Spark StructuredStreaming ValidatesRunner |
Run Spark ValidatesRunner |
1 similar comment
Run Spark ValidatesRunner |
@mosche did you manage to run TPCDS suite on this PR ? |
I see that Nexmark query 5 and 7 have improved quite a lot. They are mainly based on combiners and windows. Nice ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review: for now I have looked at
- the general architecture
- the aggregators
- the combiners translatios (globally and per key)
- the source
- started GBK
This looks very good to me.
I need to finish taking a look at the GBK and the Encoders and we could merge if it is all good. The changes are minor (except the one on triggers in batch)
...in/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/Aggregators.java
Show resolved
Hide resolved
...beam/runners/spark/structuredstreaming/translation/batch/CombineGloballyTranslatorBatch.java
Show resolved
Hide resolved
...e/beam/runners/spark/structuredstreaming/translation/batch/CombinePerKeyTranslatorBatch.java
Outdated
Show resolved
Hide resolved
...e/beam/runners/spark/structuredstreaming/translation/batch/CombinePerKeyTranslatorBatch.java
Show resolved
Hide resolved
...rc/main/java/org/apache/beam/runners/spark/structuredstreaming/io/BoundedDatasetFactory.java
Show resolved
Hide resolved
...rc/main/java/org/apache/beam/runners/spark/structuredstreaming/io/BoundedDatasetFactory.java
Show resolved
Hide resolved
...rc/main/java/org/apache/beam/runners/spark/structuredstreaming/io/BoundedDatasetFactory.java
Show resolved
Hide resolved
...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java
Outdated
Show resolved
Hide resolved
...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java
Show resolved
Hide resolved
|
||
/** | ||
* Translator for {@link GroupByKey} using {@link Dataset#groupByKey} with the build-in aggregation | ||
* function {@code collect_list} when applicable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, avoiding materialization like with ReduceFnRunner and using a spark native instead is better because it allows spark to spill to disk instead of throwing OOM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately that's not the case, that's way in both cases
above is important... either way there's a risk of OOMs, collect_list
is just more efficient...
the alternative is the iterableOnce
, though that will through if users attempt to iterate multiple times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch I was hopping on collect_list. But at least you managed to avoid OOM in some cases compared to previous impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finished GBK review. Thanks for great work ! Only encoders left to review.
result = | ||
input | ||
.groupBy(col("value.key").as("key")) | ||
.agg(collect_list(col("value.value")).as("values"), timestampAggregator(tsCombiner)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clever !
...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java
Show resolved
Hide resolved
...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java
Show resolved
Hide resolved
...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java
Show resolved
Hide resolved
...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java
Show resolved
Hide resolved
...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java
Show resolved
Hide resolved
|
||
/** | ||
* Translator for {@link GroupByKey} using {@link Dataset#groupByKey} with the build-in aggregation | ||
* function {@code collect_list} when applicable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch I was hopping on collect_list. But at least you managed to avoid OOM in some cases compared to previous impl
...ache/beam/runners/spark/structuredstreaming/translation/batch/GroupByKeyTranslatorBatch.java
Show resolved
Hide resolved
.../org/apache/beam/runners/spark/structuredstreaming/translation/batch/GroupingTranslator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished my review, took a look at the Encoders part. Now need to take a look at your latest comments / commits and the TPCDS run of at least Q3 (because I don't trust nexmark results 100% about being representative of user pipelines) and we will merge pretty soon.
Thanks a lot for the great work !
...va/org/apache/beam/runners/spark/structuredstreaming/translation/helpers/EncoderHelpers.java
Show resolved
Hide resolved
...va/org/apache/beam/runners/spark/structuredstreaming/translation/helpers/EncoderHelpers.java
Show resolved
Hide resolved
...va/org/apache/beam/runners/spark/structuredstreaming/translation/helpers/EncoderHelpers.java
Show resolved
Hide resolved
...va/org/apache/beam/runners/spark/structuredstreaming/translation/helpers/EncoderHelpers.java
Show resolved
Hide resolved
Alternatively you could run the load tests for combiners and GBK available in sdk/testing they are per transform |
@echauchot As said I don't see the value of spending more time on load tests / benchmark at this point. Correctness is tested by the VR tests. One thing at a time |
Maybe avoiding jenkins as overloaded is a good idea and run either TPCDS Q3 or combine load test and GBK load test and compare RDD runner and Dataset runner |
I'll run them |
@echauchot Please focus on what's important ... that's not the scope of this PR anymore! |
Running additional benchmarks makes sense if you plan to take actions, if not ... what's the point? |
I disagree: the point of this PR is too improve performance of the runner. Problem is that It contains only nexmark performance results. As I wrote I don't trust nexmark test suite as it showed optimistic results that prove wrong when we ran TPCDS on this runner. So I'd like to ensure the performance of the changes. As it is the whole point of the PR I'm totally focusing on the scope ! |
Improved pipeline translation in
SparkStructuredStreamingRunner
(closes #22445, #22382):Encoder
s to leverage structural information in translation (and potentially benefit from Catalyst optimizer). Though note, the possible benefit is limited as everyParDo
is a black box and a hard boundary for anything that could be optimized.GroupByKey
. When applicable, group also by window to better scale out and/or use Spark nativecollect_list
to collect values of group.Aggregator
s for combine (per key / globally), particularlySessions
can be improved significantly.Combine.Globally
to avoid additional shuffle of data.BoundedSource
.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.