Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement]: Improved pipeline translation in experimental SparkStructuredStreamingRunner #22445

Closed
mosche opened this issue Jul 26, 2022 · 0 comments · Fixed by #22446
Closed

Comments

@mosche
Copy link
Member

mosche commented Jul 26, 2022

What would you like to happen?

At the time being (batch) pipeline translation in the SparkStructuredStreamingRunner is rather simple and not optimized in any way. These optimizations should help to significantly improve the performance of the experiemental runner.

  • Make use of Spark Encoders to leverage structural information in translation (and potentially benefit from Catalyst optimizer). Though note, the possible benefit is limited as every ParDo is a black box and a hard boundary for anything that could be optimized.
  • Improved translation of GroupByKey. When applicable, group also by window to better scale out and/or use Spark native collect_list to collect values of group.
  • Make use of specialised Spark Aggregators for combine (per key / globally), particularly Sessions can be improved significantly.
  • Dedicated translation for Combine.Globally to avoid additional shuffle of data.
  • Remove additional serialization roundtrip when reading from a Beam BoundedSource.

Issue Priority

Priority: 2

Issue Component

Component: runner-spark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant