-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] EventHub Consumer stops consuming messages until we restart #18070
Comments
We met with the same problem. We even set the RetryOptions to make it keep trying when error happens. While our log shows the same exception periodically throws out and for several days no message was received. When we restart the application, the message can be consumed normally again. |
Could anyone throw some light here, please? |
In 5.3.0, we added a watchdog functionality that would check to see if connection is alive. If it is not, we’ll return the partition to the pool. That way, another processor can reclaim it and begin processing again. |
Thanks for your information. we are already using 5.3.1 version as shown below, still, we face this issue.
|
We've another customer who has been able to repro the same issue with latest version 5.5.0. They could repro the issue by disconnecting network for couple of mins and then connecting back.
|
@conniey Exact same issue seen in azure-messaging-eventhubs dependency version - 5.3.1 Can we expect a fix on this shortly or any temporary workaround that you can suggest? |
It's not a very satisfactory workaround, but we track the time when we last received messages on each partition and trigger a restart using a liveness probe if too much time has elapsed since the last message. In Event Hubs with low message volumes this does cause a lot of unnecessary restarts, but this is far better than losing messages altogether. |
So this isn't lost in the fray, we dug some more into this and the existing Track 1 library and noticed that our third party dependency doesn't always propagate to its children that the underlying transport is closed. It's possible our consumers believe they are still alive, even though the connection is not. I'm looking into a fix where we can propagate the connection error to all its children (receiver links) and close them. |
@conniey From the perspective of a consumer client application that continuously listens to the eventhub to process any incoming message, this is a super critical issue. Any message that is not able to be picked by the consumer client is basically a kind of data loss and processing resumes only after we do an application restart which we can't afford to do in production environment. Can we expect a fix for this issue anytime soon as we have a production release in week's time and this issue is a blocker for us? |
Can you explain this a little more? Event Hubs doesn't remove any events from the stream. Events leave the stream when your "Message Retention" policy has elapsed for your Event Hub. That's one of the reasons we have a durable store for checkpointing, so if your application restarts, you'll understand where in the stream you last processed a message from the hub. |
@conniey Hi, Thanks for the response. I do agree that the message is retained in event hub till the retention period is elapsed. But in case of near real time processing of events from eventhub, if the consumer client stops processing incoming messages till application restart then it's actually a kind of data loss for us. In our use case, we are routing all the incoming messages to the IoT hubs as events to an external event hub. Also we have a spring boot consumer client application that listens to the external event hub and processes the incoming events to the eventhub near real time. The processing of incoming events is time bound as it has IoT device info at that point in time and actually it's obsolete to process these messages at later point in time. So in our use case, we can't afford to stop processing incoming messages to eventhub abruptly. Can we expect a SDK fix for this anytime soon? |
@conniey If the connection is not alive but the consumers still hold it, will the server detect it and reassign the partition to other consumers? |
Hey @conniey, could you please share updates on this issue? I believe we're facing a similar issue. We're using BlobContainerAsyncClient and EventProcessorClientBuilder. We have iot hub built-in endpoint. We need to keep eventProcessorClient open to receive the message in the built-in endpoint., but we notice that load balancing is done regularly, but until I restart the service it is not picking up new events. But it's hard to reproduce and debug, as it is intermittent and usually happens 10-14 days after we restart the service/client. Also, the log always shows "Load balancing completed successfully", even when it cannot pick any new messages.
|
@conniey Hi, will the fix take care of automatically reestablishing the connection to the partition returned to the pool, or do the application need to implement any additional logic on top of this fix? |
Yes. The partition load balancer periodically checks for ownership and it'll notice that this partition is unclaimed, then will re claim it automatically. |
We released 5.7.0 this morning. This should recover the consumer after closing when using the EventProcessorClient. https://repo1.maven.org/maven2/com/azure/azure-messaging-eventhubs/5.7.0/ Cheers, |
@conniey Hi Conniey, I followed the instruction from https://docs.microsoft.com/en-us/java/api/overview/azure/messaging-eventhubs-readme?view=azure-java-stable, when I used the 5.7.0 azure-messagin-eventhubs, I failed to start my spring boot app, I got these errors, everything works fine when I use 5.5.0 azure-messagin-eventhubs. Could you please give me some suggestions? Thanks!
|
@jyyy-57 Hi, We had to exclude the 'reactor-core' dependency from eventhub dependency and included 'reactor-core 3.4.3' to workaround this issue. Changes as follows:
@conniey, hope this will be taken care at SDK end. |
Hi @gandhirajan, thank you so much!! Your method works! I can start the application now. Reactor-core might be the key to issues. However, I got another
|
Usually /cc @saragluna may have more insights for when they will update with 5.7.0 |
SRP Swagger Updates for Sep21 [2021-09-01] Api Version GA (Azure#18070) * Full copy from 2021-08-01 to 2021-09-01 without any changes * Updating API version for Service types, Updating readme files * Feature: Premium Access Tier for PBB accounts, with example, prettier and validation check done. * Feature: Add LastTierChangeTime LCM Action support, Added Examples, Ran Prettier and Validation check * Feature: Add LCM BaseBlob daysAfterCreationTime actions with example * Feature : Async Sku Conversion Status Object for Customer controlled Migration and SCGRS operations * Feature: AllowPermanentDelete property to allow deletion for soft deleted versions and snapshotr * Linting S360 - Enum Mismatch, https://portal.azure-devex-tools.com/amekpis/correctness/detail?errorId=A2DB3DB8-3A59-44C9-85C1-7C9C66AED0AD, https://portal.azure-devex-tools.com/amekpis/correctness/detail?errorId=A1CA6F05-42DC-4D72-A0D6-928FB6BEEC54 * S365 Swagger Linting, Blob Invnetory, Add destination container back at policy level with deprecated comment. * Move SignedIdentifiers to Common.json, Add SI to Table, Add Examples, Run prettier check and validation * update the api version for the test * S365: Return CMK version expiration time with example, Return TenantID for UA Identities, Update existing Example * Feature: DnsEndpointType For Account Level PDNS * SpellCheck, Move Location header to 202 response for existing example (CI CD check failure) * Feature: Add ExcludePrefix, IncludeDeleted and Additional Scheams for BlobInventory. * remove asunc location header from example, latest model validation requires it to be removed * Account PDNS Updates Change AzureDnsZone type, Address comment for allowPermanentDelete * update the description and x-ms-clientname for currentCmkVersion * Update the Description for the DnsEndpointType. * Add input parameter to the Table API * Create separate definition for Table Signed Identifiers, Update Tests * Feature: Change daysAfterLastTierChangeTimeGreaterThan To daysAfterLastTierChangeGreaterThan * Add type=object and description for TableAccessPolicy, Update descriptions, Add the Async header back to the example. * Spell correction for file shares * Updating descriptions * Remove tenantId from the UAI * Change Header to Headers for failing check
Describe the bug
EventHub consumer stops consuming messages until we restart the consumer. Have deployed my consumer in Azure Kubernetes (AKS). Initially, it consumes a few messages, all of a sudden it stops consuming. If we restart the consumer, it works like charm. Until we restart it, all the messages sit in EH. Even it is 2-3 days, none of the messages is consumed.
Exception or Stack Trace
This is the consumer log just captured before restarting:
consumerlogmessages.txt
Picked up this stack trace from log:
To Reproduce
This has happened very often.
Code Snippet
Expected behavior
I could see a few connection exception errors in logs, but I am wondering why it stopped consuming messages forever. Immediately after restarting messages started consuming.
Setup (please complete the following information):
Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
The text was updated successfully, but these errors were encountered: