Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We have identified a race condition when setting up goka tables. The investigation begun with an error
That topic/partition is already being consumed
.The error was checked on versions
1.1.2
,1.1.5
and1.1.6
.sarama
when a consumer tried to consume topic partition before previous partitionConsumer was removed (sarama
,consumer.addChild
andconsumer.removeChild
).goka
callssarama.Consumer.ConsumePartition
two times for each tablePartitionTable.SetupAndRecover
PartitionTable.CatchupForever
SetupAndRecover
and beforeCatchupForever
there should be no consumers of a partition. The partition is released by callingsarama.PartitionConsumer.AsyncClose
in defer ofPartitionTable.load
.AsyncClose
itself is asynchronous but in the defer,PartitionTable
actually synchronizes usingdrainConsumer
.drainConsumer
waits for messages and errors channels for max 1 second - we don't have any guarantees that the partition consumer is actually closed and removed after 1 secondReproduction:
sarama
to force invalid order of executionTest results:
I think
TestView_Reconnect
fails only due to too low timeout in testsSolution
My proposal to fix the issue it to just drain the errors/messages channels until actually closed without timeout.
We are testing this change in our application and we haven't found any issues yet.
After this change tests are passing with just one change - increasing the timeout.
I might not fully understand what else might be affected