[EventHub] Race condition when buffered mode is enabled drops message #34711
Labels
Client
This issue points to a problem in the data-plane of the library.
customer-reported
Issues that are reported by GitHub users external to the Azure organization.
Event Hubs
Messaging
Messaging crew
needs-team-attention
Workflow: This issue needs attention from Azure service team or SDK team
question
The issue doesn't require a change to the product in order to be resolved. Most issues start as that
azure-eventhub
5.11.6
Linux
Python 3.10
Describe the bug
An event passed to async
EventHubProducerClient
withbuffered_mode=True
may, sometimes, never get flushed (i.e., actually sent to the Event Hub's topic).This has actively impacted one use case I know, implying in losing messages randomly: some "sent" messages never arrived at the other end.
EventHubProducerClient
'smax_wait_time
config directly contributes to the frequency of dropping a message. And given the race condition nature I mentioned, I think it's safe to say that factors such as network latency and co-routine/thread scheduling order can also affect the frequency of this occuring.In this issue, I will mainly talk about the
EventHubProducerClient
async version, though the same should apply to the sync, thread-basedEventHubProducerClient
version.To Reproduce
Since I ended up blaming a race condition issue, it is expected to occur only under specific conditions, which might be hard to reproduce.
Even so, I was able to trigger the critical path fairly often using some specific combination of thresholds. Of course, a real scenario is harder to reproduce, but I can tell that the same idea applies.
You can see the code at https://gist.github.com/falcaopetri/18a65d316f0f2cc12ee65f3e6939976d. Here I will focus on explaining the log and the critical path.
The log at the end of the reproduction code is:
A couple of log lines...
Which can be simplified as:
The issue happens to
seq_id=4
, which gets "enqueued", but never flushed.The critical path is a classic consumer/producer race condition between
BufferedProducer.check_max_wait_time_worker
andBufferedProducer.put_events
and looks like this:BufferedProducer._check_max_wait_time_worker
loop get's executed by the event loop and decides that it's time for flushing, acquires the BufferedProducer object's lock and callsBufferedProducer._flush
:azure-sdk-for-python/sdk/eventhub/azure-eventhub/azure/eventhub/aio/_buffered_producer/_buffered_producer_async.py
Lines 216 to 221 in 23121a5
_flush
start's by adding the_cur_batch
to thequeue
, and "cleaning" the_cur_batch
by pointing it to a newly created object:azure-sdk-for-python/sdk/eventhub/azure-eventhub/azure/eventhub/aio/_buffered_producer/_buffered_producer_async.py
Lines 150 to 152 in 23121a5
_flush
starts calling_producer.send
asynchronously, which might sometimes take a few moments before returning:azure-sdk-for-python/sdk/eventhub/azure-eventhub/azure/eventhub/aio/_buffered_producer/_buffered_producer_async.py
Lines 163 to 166 in 23121a5
A new event is requested to be added to the buffer batch (e.g. by user code):
azure-sdk-for-python/sdk/eventhub/azure-eventhub/samples/async_samples/send_buffered_mode_async.py
Line 48 in 23121a5
The call goes all the way down to
await BufferedProducer.put_events
(while the event loop is still preferring this path, sinceproducer.send
has not completed yet):azure-sdk-for-python/sdk/eventhub/azure-eventhub/azure/eventhub/aio/_buffered_producer/_buffered_producer_dispatcher_async.py
Line 74 in 23121a5
BufferedProducer.put_events
gets called, and adds an event to the object's internal buffer without any lock (some operations are protected by lock as introduced in [Eventhub] Fix Blocking Behavior of Buffered Producer Flush #25406, but not this operation):azure-sdk-for-python/sdk/eventhub/azure-eventhub/azure/eventhub/aio/_buffered_producer/_buffered_producer_async.py
Line 110 in 23121a5
BufferedProducer.put_events
finished its job, and so didBufferedProducer._flush
becauseproducer.send
finished. Just some clean up before exiting: assign_cur_batch
to a new object, losing the previous reference to which user's event was addedazure-sdk-for-python/sdk/eventhub/azure-eventhub/azure/eventhub/aio/_buffered_producer/_buffered_producer_async.py
Line 205 in 23121a5
Expected behavior
A message added to a buffered
EventHubProducerClient
should always get sent.Additional context
I am also submitting a PR (#34712) with some more, implementation specific, documentation.
This issue seems to exist since the first release of
buffered_mode
in5.10.0
via #24653.The text was updated successfully, but these errors were encountered: