Optimize Datastream for batch #31950

jto · 2024-07-23T08:08:28Z

This PR contains several optimizations for the Datastream API when used in Batch mode. In it's current state, using Datastream for batch is much slower than Dataset, this is an attempt to get it to the same performance level.

The following optimizations are implemented:

Same optimisation as #28045 for Datastream.

It has the same benefits with Datastream as with Dataset (up to 20% speedup).
I discovered the patch was also necessary for Datastream while migrating existing workflows from dataset to datastream by passing --useDataStreamForBatch.

Use a lazy enumerator for bounded IOs reads

The existing enumerator eagerly distributes splits to workers. When splits are not all equal in size, the distribution causes a lot of skew. The new implementation is mimicking the behaviours of Flink's StaticFileSplitEnumerator where splits are lazily distributed to workers as they are consumed which results in better load balancing.

Set the serializer on Bounded reads.

For some reason serializer was not set on Bounded reads.

TODO

Fix BQ IO issue

BQ writes do not behave the same with Datastream and garbage collection is much much slower. In dataset the IO will create 1 temp file per worker, this is not true with Datastream where it creates a lot (20x) more files.

Fix double encoding of window in GBK and CombinePerKey

Before shuffle KV are converted to KeyedWorkItem, however the actual stream type is:
DataStream<WindowedValue<KeyedWorkItem<K, byte[]>>>
Both KeyedWorkItem and WindowedValue serialize the window. Since the conversion happens before keyBy (shuffle), the duplication directly results in network overhead.

I tried the simplest fix: move conversion after keyBy but WindowDoFnOperator needs the stream it transforms be keyed so turning ToBinaryKeyedWorkItem -> keyBy -> transform(doFnOperator) into keyBy -> ToBinaryKeyedWorkItem -> transform(doFnOperator) is not possible.

I also tried a similar fix using reinterpretAsKeyedStream to avoid this problem. The chain becomes: ToBinaryKV -> keyBy -> ToKeyedWorkItem -> reinterpretAsKeyedStream -> transform(doFnOperator) but reinterpretAsKeyedStream breaks operator chaining between ToBinaryKeyedWorkItem and the following operator which degrades performances even more.

The best fix would be to not need KeyedWorkItem but that'd be a large change in Beam.

Missing pre-shuffle combine on redure operator

The Dataset translation will translate reduce into a partial reduce -> shuffle -> reduce. The Datastream translator is missing this optimization which make reduce operations much slower.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

This reverts commit 44d1271.

This reverts commit 5e7b170.

This reverts commit 78da517.

…WindowSetNewDoFn

This reverts commit 7e7b300.

This reverts commit e885cb2.

Limit max split size for bounded sources in Datastream API

44d1271

github-actions bot added python java go build website model examples docker io runners dataflow gcp flink labels Jul 23, 2024

jto added 2 commits July 23, 2024 11:58

[Flink] Set return type of bounded sources

82e4a78

[Flink] Use a lazy split enumerator for bounded sources

9641b2a

jto force-pushed the julient/patched-2.56 branch from 568e44c to 9641b2a Compare July 24, 2024 09:43

spotless

4c71518

jto changed the title ~~Reduce the maximum size of input splits in Flink to better distribute work in Datastream API~~ Optimize Datastream API for batch Jul 26, 2024

jto changed the title ~~Optimize Datastream API for batch~~ Optimize Datastream for batch Jul 26, 2024

jto changed the base branch from master to release-2.56.0 July 26, 2024 10:06

Revert "Limit max split size for bounded sources in Datastream API"

362b649

This reverts commit 44d1271.

github-actions bot removed python java go build website model examples docker labels Jul 26, 2024

jto added 12 commits August 13, 2024 13:32

[WIP] combine before reduce (again)

78da517

[WIP] Also fire timer on finished bundle

5e7b170

Revert "[WIP] Also fire timer on finished bundle"

7e82123

This reverts commit 5e7b170.

Revert "[WIP] combine before reduce (again)"

6be49e0

This reverts commit 78da517.

[FLink] Set default maxParallelism to parallelism in batch to avoid skew

7da29a8

[Flink] Implement partial reduce

4faa6d0

[Flink] remove toBinaryKV

4cbd239

[Flink] fastercopy everywhere

7e7b300

[Flink] Avoid re-evaluating options every time a new state is stored

0a634bf

Avoid re-serializing trigger on every element in GroupAlsoByWindowVia…

03a1996

…WindowSetNewDoFn

Cache stringKey in StateNamespaces.WindowNamespace

e885cb2

Revert "[Flink] fastercopy everywhere"

164973c

This reverts commit 7e7b300.

github-actions bot added the core label Aug 20, 2024

jto added 16 commits August 20, 2024 15:24

Revert "Cache stringKey in StateNamespaces.WindowNamespace"

b37aa7d

This reverts commit e885cb2.

[Flink] dead code cleanup

7e423e7

[Flink] spotless

7f0983d

[Flink] Only serialize namespace keys if necessary

8751150

[Flink] pre-combine before GBK

dec75f6

[Flink] persistent PartialReduceBundleOperator operator state

b3b7c2c

[Flink] lower default max bundle size in batch mode

d0e7cb3

[Flink] disable bundles in batch mode

fb9ef12

[Flink] force slot sharing group for source in batch mode

7fc687f

[Flink] fix lazy enumerator package

0b07e4a

[Flink] implement combine for reduce with side input

686a100

[Flink] further reduce default bundle size in batch

772f4af

[Flink] fix licence

06dd507

[Flink] spotless

3b1c6e9

[Flink] checkstyle

d7d3a7f

[Flink] spotless

2c7f737

jto closed this Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Datastream for batch #31950

Optimize Datastream for batch #31950

jto commented Jul 23, 2024 •

edited

Loading

Optimize Datastream for batch #31950

Optimize Datastream for batch #31950

Conversation

jto commented Jul 23, 2024 • edited Loading

The following optimizations are implemented:

Same optimisation as #28045 for Datastream.

Use a lazy enumerator for bounded IOs reads

Set the serializer on Bounded reads.

TODO

Fix BQ IO issue

Fix double encoding of window in GBK and CombinePerKey

Missing pre-shuffle combine on redure operator

GitHub Actions Tests Status (on master branch)

jto commented Jul 23, 2024 •

edited

Loading