You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the time being (batch) pipeline translation in the SparkStructuredStreamingRunner is rather simple and not optimized in any way. These optimizations should help to significantly improve the performance of the experiemental runner.
Make use of Spark Encoders to leverage structural information in translation (and potentially benefit from Catalyst optimizer). Though note, the possible benefit is limited as every ParDo is a black box and a hard boundary for anything that could be optimized.
Improved translation of GroupByKey. When applicable, group also by window to better scale out and/or use Spark native collect_list to collect values of group.
Make use of specialised Spark Aggregators for combine (per key / globally), particularly Sessions can be improved significantly.
Dedicated translation for Combine.Globally to avoid additional shuffle of data.
Remove additional serialization roundtrip when reading from a Beam BoundedSource.
Issue Priority
Priority: 2
Issue Component
Component: runner-spark
The text was updated successfully, but these errors were encountered:
What would you like to happen?
At the time being (batch) pipeline translation in the SparkStructuredStreamingRunner is rather simple and not optimized in any way. These optimizations should help to significantly improve the performance of the experiemental runner.
Encoder
s to leverage structural information in translation (and potentially benefit from Catalyst optimizer). Though note, the possible benefit is limited as every ParDo is a black box and a hard boundary for anything that could be optimized.GroupByKey
. When applicable, group also by window to better scale out and/or use Spark nativecollect_list
to collect values of group.Aggregator
s for combine (per key / globally), particularlySessions
can be improved significantly.Combine.Globally
to avoid additional shuffle of data.BoundedSource
.Issue Priority
Priority: 2
Issue Component
Component: runner-spark
The text was updated successfully, but these errors were encountered: