You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CoGroupByKey when being run on Flink runner with the default settings throws IllegalStateException: GBK result is not re-iterable. when trying to iterate over any of CoGbkResult iterarables. This only happens on large collections, when Beam doesn't load the CoGBK result into memory.
Using --reIterableGroupByKeyResult as suggested by the error message isn't always an option since it requires the collection to fit in memory, which isn't always the case.
My preliminary investigation suggests that the bug could have been introduced by #30851. The problem is that the result of GBK with the Flink runner is once-iterable. That means that that the .iterator() method can only be called once for the GBK value. However, currently, it is called twice in CoGBKResult.
The first time it happens when checking whether the result fits in memory.
The second time it happens when creating an instance of RecordingFilteringIterator.
A possible fix would be amending RecordingFilteringIterator in such a way that it accepts an Iterator, not an Iterable. Also, with this approach already retrieved elements from the iterator (during an attempt to load everything to memory) should be accounted for - possibly, by using iteratorChain or something similar.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
Component: Python SDK
Component: Java SDK
Component: Go SDK
Component: Typescript SDK
Component: IO connector
Component: Beam YAML
Component: Beam examples
Component: Beam playground
Component: Beam katas
Component: Website
Component: Infrastructure
Component: Spark Runner
Component: Flink Runner
Component: Samza Runner
Component: Twister2 Runner
Component: Hazelcast Jet Runner
Component: Google Cloud Dataflow Runner
The text was updated successfully, but these errors were encountered:
What happened?
CoGroupByKey
when being run on Flink runner with the default settings throwsIllegalStateException: GBK result is not re-iterable.
when trying to iterate over any ofCoGbkResult
iterarables. This only happens on large collections, when Beam doesn't load the CoGBK result into memory.Minimum reproduction (I used Kotlin for readability and the project size): https://github.com/a-kazakov/beam_flink_failure_repro/blob/main/src/main/kotlin/Main.kt - this example creates two collections of 20K KV pairs each, all elements with the same key, and then runs CoGBK on it. Runs fine on
DirectRunner
, but fails onFlinkRunner
.Using
--reIterableGroupByKeyResult
as suggested by the error message isn't always an option since it requires the collection to fit in memory, which isn't always the case.My preliminary investigation suggests that the bug could have been introduced by #30851. The problem is that the result of GBK with the Flink runner is once-iterable. That means that that the
.iterator()
method can only be called once for the GBK value. However, currently, it is called twice inCoGBKResult
.RecordingFilteringIterator
.A possible fix would be amending
RecordingFilteringIterator
in such a way that it accepts anIterator
, not anIterable
. Also, with this approach already retrieved elements from the iterator (during an attempt to load everything to memory) should be accounted for - possibly, by usingiteratorChain
or something similar.Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: