[Improvement]: Improved pipeline translation in experimental SparkStructuredStreamingRunner #22445

mosche · 2022-07-26T09:28:39Z

What would you like to happen?

At the time being (batch) pipeline translation in the SparkStructuredStreamingRunner is rather simple and not optimized in any way. These optimizations should help to significantly improve the performance of the experiemental runner.

Make use of Spark Encoders to leverage structural information in translation (and potentially benefit from Catalyst optimizer). Though note, the possible benefit is limited as every ParDo is a black box and a hard boundary for anything that could be optimized.
Improved translation of GroupByKey. When applicable, group also by window to better scale out and/or use Spark native collect_list to collect values of group.
Make use of specialised Spark Aggregators for combine (per key / globally), particularly Sessions can be improved significantly.
Dedicated translation for Combine.Globally to avoid additional shuffle of data.
Remove additional serialization roundtrip when reading from a Beam BoundedSource.

Issue Priority

Priority: 2

Issue Component

Component: runner-spark

The text was updated successfully, but these errors were encountered:

…StreamingRunner (also closes apache#22382)

* Closes #22445: Improved pipeline translation in SparkStructuredStreamingRunner (also closes #22382)

mosche added new feature awaiting triage spark structured-streaming runners improvement and removed awaiting triage new feature labels Jul 26, 2022

mosche pushed a commit to mosche/beam that referenced this issue Jul 26, 2022

Closes apache#22445: Improved pipeline translation in SparkStructured…

5834de2

…StreamingRunner (also closes apache#22382)

mosche mentioned this issue Jul 26, 2022

Improved pipeline translation in SparkStructuredStreamingRunner #22446

Merged

4 tasks

mosche self-assigned this Jul 26, 2022

github-actions bot added the P2 label Jul 26, 2022

echauchot closed this as completed in #22446 Sep 22, 2022

echauchot pushed a commit that referenced this issue Sep 22, 2022

Improved pipeline translation in SparkStructuredStreamingRunner (#22446)

762edd7

* Closes #22445: Improved pipeline translation in SparkStructuredStreamingRunner (also closes #22382)

This was referenced Oct 21, 2022

Put windows inside the key to avoid having all values for the same key in memory #19999

Closed

Optimize translation when Schema information is available in Spark Structured Streaming runner #19989

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement]: Improved pipeline translation in experimental SparkStructuredStreamingRunner #22445

[Improvement]: Improved pipeline translation in experimental SparkStructuredStreamingRunner #22445

mosche commented Jul 26, 2022

[Improvement]: Improved pipeline translation in experimental SparkStructuredStreamingRunner #22445

[Improvement]: Improved pipeline translation in experimental SparkStructuredStreamingRunner #22445

Comments

mosche commented Jul 26, 2022

What would you like to happen?

Issue Priority

Issue Component