-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flink-runner] Improve Datastream for batch performances #32440
base: master
Are you sure you want to change the base?
Conversation
fea7323
to
893c19f
Compare
Assigning reviewers. If you would like to opt out of this review, comment R: @damccorm for label website. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
@kennknowles would you mind taking a look at this one? |
Reminder, please take a look at this pr: @damccorm |
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @melap for label website. Available commands:
|
R: @kennknowles |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
To test this thoroughly, let us add some of the postcommits by touching trigger files. In #32648 you can see how I edited the JSON files (including some new ones) and I think these are all the Flink-specific postcommit jobs. |
d88deed
to
e2fc26e
Compare
e060213
to
ae0f3b4
Compare
@kennknowles done. I also rebased master but some of the tests seem to be quite flaky now. There are test failing on things I did not touch (direct runner) and the Flink tests that are failing here are not failing on my machine... Any idea how I could make them work ? |
I opened jto#236 with some more trigger files. The "PVR" trigger files stands for "Portable Validates Runner" that isn't as directly impacted. I think the non-portable ValidatesRunner tests should test that the runner still complies with the model and passes the basic tests. |
Thanks! I just merged it. |
Hey there! I rebased master into my branch and a few tests are failing however: In beam_PreCommit_Java (Run Java PreCommit)
In beam_PostCommit_Java_ValidatesRunner_Flink (Run Flink ValidatesRunner)
PostCommit Go VR Flink / beam_PostCommit_Go_VR_Flink (Run Go Flink ValidatesRunner)Logs are truncated. I don't know if there's an actual failure or what it might be... |
Hey @kennknowles! The Python PostCommits are failing but the error is:
which I think is unrelated to those changes and I could not find anything in the logs suggesting otherwise. |
Hey @kennknowles @jto does this PR have any next steps? |
I agree with your change to sickbay that Lifecycle test. It sounds like it is incorrectly specified, not taking into account allowable runner variation. Sorry for the delay which has led to conflicts - if you resolve them I think this is good to merge. |
cfdfbfd
to
7f83bbe
Compare
7f83bbe
to
24a5a51
Compare
Context
Flink will drop support for the dataset API in 2.0 which should be released by EOY so it quite important for Beam to support Datastream well.
The PR
This PR improves the performances of Batch jobs executed with
--useDatastreamForBatch
by porting the following performance optimizations already present inFlinkBatchTransformTranslators
but lacking inFlinkStreamingTransformTranslators
.It also implements the following optimizations:
maxParallelism
toparallelism
as the total number of key groups is equal tomaxParallelism
. Again this reduces skew.ToKeyedWorkItem
part ofDoFnOperator
which reduces the size of the job graph and avoid unnecessary inter-task communication.GBK -> map -> CombinePerKey
). Add a flag to control this feature (defaults to active).Benchmarks
The patched version was tested against a few of Spotify's production batch workflows. All settings were left unchanged except for the followings:
--useDatastreamForBatch=true
jobmanager.scheduler: default
(otherwise datastream default to adaptive scheduler).Note
Job 3 fails with a stackoverflow exception because of a bug in versions of Kryo < 3.0. I believe this is because the job uses
taskmanager.runtime.large-record-handler: true
and it should be fixed in Flink 2.0 since Kryo is upgraded to a more recent version.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.