-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ServiceBusTrigger - lock renewals stop suddenly/randomly #37414
Comments
Hi @pzaj2. Thank you for reaching out and we regret that you're experiencing difficulties. Beyond the exception, is there any reason that you suspect that lock renewal has stopped taking place? I suspect that you're seeing this error intermittently because of a transient network failure that requires the AMQP link to be recreated. Messages must be settled on the link from which they were received and having a link be recreated would manifest in a lock lost error, despite the lock being renewed and in a valid state. To confirm, we'll need to take a look at some additional SDK logs in a +/- 5-minute window around the time you're seeing the exception. Specifically, we're looking for:
I'd suggest capturing at the Verbose level and filtering down to that set. Information about capturing SDK logs can be found here. |
Hi @pzaj2. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
Hi @jsquire , I'll try my best to provide these additional logs as soon as possible. The reason I think it stops renewing the lock is that it stops producing log entries associated with lock renewal. If it is, indeed an intermittent issue, how would I go about mitigating it? Given the consequences of this failure, I'd think it's still worth fixing. The workaround I implemented at the moment looks like this (first lines of code of my method):
where |
Hi @pzaj2. To be clear - if this does root cause to a failure for locks being renewed, then it is absolutely a client bug that we should fix. However, if it is the result of intermittent network issues causing the connection or link to drop, there's nothing that your application or the client can do to directly prevent it, unfortunately. It is something that would need to be mitigated by ensuring that the application's processing is idempotent and can ignoring duplicate data. Thus far, we haven't been able to repro and are not seeing stress test failures for this scenario. Logs are going to be our best bet, assuming that you're able to repro. The long-term solution is for Service Bus to support AMQP's durable terminus, which allows for link state to be persistent across connections. Once the service has support, we'll add it to the client which would mitigate the "I lost my connection and now my lock is invalid" scenario. I do not have insight into the timing for the service feature, however, only that it is on the roadmap. |
Hi @pzaj2. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
Hi @jsquire , please see the logs below - I hope that's enough, since I followed the guidelines to the letter!
I've only seen two instances of
You'll notice that they only happened once around the start-up time (almost 2 hrs before the exception) and never re-occurred. Let me know if I can be of any more help! |
Thanks, @pzaj2. I don't see the pattern that I was looking for, so we'll need to dig deeper. I'm going to ask a colleague to step in and take the lead for the investigation. |
@JoshLove-msft : I wasn't able to repro locally or in Azure. Would you take point on digging into this, please? |
@jsquire sure thing. @JoshLove-msft should you need any further help from me please let me know! |
Sounds good - will take a look. |
I haven't been able to reproduce this. The only thing that I can think may be causing this is that there is thread starvation in the application and the renew lock task is not being scheduled in time. A few things to try: |
Hi @pzaj2. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue. |
Right, I'll try reproducing it with concurrency switched off. I'll also try to collect the process dump while I'm on that (should I manage to reproduce it). A couple of questions in the meantime:
Anyhow, I'll let you know of any developments as I try to reproduce it and collect the process dump. P.S. The workaround I mentioned earlier works fine - we haven't observed any retries nor LockLostException occurrences. Would thread starvation not affect this workaround? Just for context, the delay is configured to 4.5 minutes and the lock duration to 5 minutes, so perhaps that 30s gap makes it immune to potential thread starvation. |
@JoshLove-msft following your request, I've had it running since early morning today. I only came across an exception around 1.45 pm (UK time) but I can confirm that it's still happening while I've attached unfiltered logs from the moment the repro app started, until about 2 pm (1 pm UK time) when I realised there's been an exception. I also have the process dump (or at least I hope that's what you wanted!) but it's too big to be uploaded here, how can I share it? I uploaded it to Google Drive and here's the link: https://drive.google.com/file/d/1KdDd5bbC1iP_3EfP8pIyalJtULyOMvW8/view?usp=drive_link. Unfortunately, I can't share it publicly due to company's Google Drive limitations, but if you request access I'll grant it asap. If for any reason you'd prefer me to share it in a different way, please let me know. It's also from 20 minutes after the initial exception, as I only then noticed that there's been an exception. |
It would depend on a lot of factors, like your machine, the .NET version being used, and your application code in addition to the SDK code.
The lock renewal interval is based on the lock duration for the entity. You can make the lock duration longer via the Portal or management APIs so that the lock renewals will be less frequent. |
Looking at these logs, there definitely seems to be something related to starvation. The below log entries should be emitted ever 59 seconds:
Note that there is almost 4 minutes between the 2nd and 3rd. Could you share your updated app? |
Oh, I'll be honest - I haven't really paid much attention to these logs, nice catch! Apologies for the delays, I've had to prioritise other things over the last 2 days. Here's the repro app including the updates. Please bear in mind that the exceptions occur both with and without the file logging provider - I've only added it to collect the logs. |
And one more thing on that since I just remembered. I should've been more clear on that, what I'm looking for is the ability to increase the gap between the renewal and the expiry, i.e. if the renewal starts 5 seconds before the lock would've otherwise expired, I'd like to increase that to 30 seconds. While this isn't ideal, it'd most likely resolve the issue in the short/mid-term for people who struggle with this, until the long-term solution (AMQP's durable terminus) is in place. |
This isn't configurable. I've been unable to repro the issue but we have recently made a fix to how settlement works so that you are less likely to hit these lock lost errors - #37704. This will be released in August. |
Closing this out as I haven't been able to repro and we have made some improvements to make lock lost issues less likely to occur. |
Library name and version
Microsoft.Azure.WebJobs.Extensions.ServiceBus 5.11.0
Describe the bug
I have a WebJobs project with continuous jobs and Service Bus trigger (topic). The trigger/function runs for quite some time (up to 40 minutes) and after some time, the calls to renew the lock on the message (
PeekLock
mode) stop being made. This then leads to failures when the function auto-completes (attempts to) the message.MaxAutoLockRenewalDuration
is not exceeded as per the project attached.This is especially hard to deal with because all the work carried out by my function completes but the message goes back to the queue and is re-processed in a retries loop. You can see the output from AppInsights (in Rider) in the screenshot below.
Expected behavior
The message lock should be renewed. Auto-complete should successfully complete the message.
Actual behavior
The message lock is not renewed. Auto-complete fails to complete the message.
Reproduction Steps
You'll need to set up Service Bus in
appsettings.json
and create the required messaging entities (checkTestTrigger.cs
). It takes a while - I've seen this 5-6 times in the last 2 days when running this project locally.ServiceBusLockLostRepro.zip
Environment
Hosting: Azure AppService (but the same can be reproduced locally, running against the below
The issue is IDE-agnostic. I ran it from within Rider as well as using CLI.
The text was updated successfully, but these errors were encountered: