-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orchestration getting stuck while getting the lock #2534
Comments
@vany0114: while we start this investigation, can you share with us when this issue first started (rough timestamp in UTC)? Does it correlate with any code changes (i.e partial classes, or trying out the new partition manager)? |
@davidmrdavid It seems the issue started around 2023-08-07 13:34 UTC |
Thanks! Any known code/config changes around that time? @vany0114? |
@davidmrdavid Nope, we haven't touched anything since the last deployment on August 2nd which btw that was the last time the issue showed up and we had to reset everything. |
@davidmrdavid BTW I'll re-deploy the func app today to try to mitigate this issue, but this time it will use Netherite as the storage provider, meaning that the old |
@vany0114: that shouldn't be an issue on our end as our historical telemetry is stored independently of our app's live status, so you're good to go ahead :). I'm not certain at this point in time that this is an issue with the storage provider itself, so I can't comment on whether changing to Netherite will help, but it probably won't hurt to try it out and may tell us more about the problem. Just an FYI. |
Agree I'm not sure either, just wanna try something different this time, and as you said it may tell us more about the problem. |
@davidmrdavid BTW if you guys can also take a look at this microsoft/durabletask-netherite#294 would be great since that's related to the Netherite implementation. |
A few small updates: (2) I'm not immediately spotting a partition manager-type issue. In other words, the queues that are hosting your orchestrator and entity messages seem to be receiving new messages just fine. So I'm starting to think the issue here might have to do with the entity message processing, not with receiving messages. Still need to validate this. |
Actually, I'm seeing abnormally high message ages in "control queue 3". Let me drill into that. If that queue is hosting any of the entities you're trying to lock, maybe that's to blame. |
Yeah I rolled back that change because I thought that was the culprit, but it wasn't, sorry I forgot to mention that. |
I'm not able to draw a correlation from this observation to the symptoms you experienced, so let's ignore this for now. Also, I've been focusing on this set of data, from the other thread:
So that orchestrator is stuck from about ~11:33 to ~11:54, and I see in my telemetry that EntityId Well, I see that entity is managed by control queue 01. According to our telemetry, control queue 01 was being listened to "VM 19" from ~13:15 to ~13:31. Then, no VM is actively listening to new messages from that queue until ~13:54, when "VM 20" takes over. To summarize, it appears that no new messages are being processed from that queue from ~13:31 to ~13:54. That's what I've been able to find so far. Assuming I understand the problem correctly, this means, to me, that simply restarting your app should suffice to mitigate this situation when using the Azure Storage backend. I'll continue looking into the logs to answer the (2) open questions above, I'll keep you posted. |
@davidmrdavid I much appreciate the update 🙏 |
Unfortunately, it didn't work, I'm gonna reset it again. |
Hi @vany0114 - if you have a new timerange of when your app was stuck, as well as the timestamp of your restart, that would help us greatly |
@davidmrdavid It has been stuck since yesterday (2023-08-07 13:34 UTC), I restarted it when I leave the previous comment. Now the func app is using Netherite and it's working fine, but I guess you'll be able to find a lot of instances stuck in the old storage account. |
@vany0114: I'll be looking at the logs but I need a bit more details. So when you say that it's been stuck since 2023-08-07 13:34 UTC, do you mean that there are some instanceIDs that never recovered? Or that you've been experiencing intermittent delayed orchestrations since that time. If you have an orchestrator that has been stuck since 2023-08-07 13:34 UTC and remains stuck until now, that would be a useful instanceID for us to have. Thanks! |
It's not intermittent, it was happening basically with every request processed by the orchestrator, here are two more examples where those instances were stuck for hours.
As a result of that our processed orders rate was affected, you can see here how dramatically it dropped because the orchestrator was holding almost all of them. @davidmrdavid please let me know if the information provided helps |
Hi, @vany0114 . My team member @davidmrdavid is OOF and I am taking over this case. Can you tell me what kind of id is this id is? "id":"3107849517381639" And is this time "businessDate":"2023-08-07T00:00:00" UTC time? Thanks! I queried with function app name and time stamp since 2023-08-07T00:00:00 UTC, I didn't find any message with an Age of several hours. So, I think the orchestration instnaceId of the two above examples would be helpful for me to investigate. |
Hi @nytian, the ids I've shared are actually the orchestration ids:
The orchestration ids are a small serialized JSON basically. |
@nytian not sure if you already read the whole thread and all the documentation provided but the problem here to figure out is why that event took that long, that's not an event our orchestrator is waiting for, it seems to be that internally when an entity is locked it waits for some sort of ack or something and that is what causes the orchestrations to get hang. |
@nytian @davidmrdavid any updates? |
More examples from today
FWIW it was working great after we started using Netherite, however yesterday we made a deployment around 2023-08-18T12:45 UTC, and after that, it seems the problem arises again. @davidmrdavid can we get some help asap please, this is happening in our production environment. |
@davidmrdavid got it, thanks! and what about the partition count? increasing it to the maximum (16) would help with throughput? |
@vany0114: If starting with a brand new task hub - then yes it can help increase throughput when scaled out, but will also increase storage costs (more queues to load balance and poll). But please note that increasing the partition count is not a safe thing to do for a pre-existing task hub (instanceIDs land on a queue based on partition count, so if you change it, you risk having old messages now "in the wrong queue"), so if you change that please do so for a brand new and empty task hub. |
@vany0114 sorry for the delay here, we've been under a lot of deadlines (still are) but I carved out some time to tackle this now. I think I understand the problem. So I looked at the "stuck orchestrator" with instanceID "{"orderId":"27250874282721285","storeNumber":"0292","brandId":"ca4c7477-4000-e811-80dd-0e1f910c8464"}" this tried to take a lock on this entity "@order@{"orderId":"27250874282721285","storeNumber":"0292","brandId":"ca4c7477-4000-e811-80dd-0e1f910c8464"}'" (+ 4 others) at 2024-06-03 23:22:03.7759128 UTC and did not receive the lock for all entities until 2024-06-03 23:32:04.5303100 UTC, so 10 minutes, as you said. From the logs, I can actually see that the lock over all entities was obtained at "2024-06-03 23:27:04.4272879", but the message to notify the orchestrator of this successful locking took ~5 minutes to be dequeued. So the real lock time took about ~5 minutes, but it took another ~5 to deliver the message. I think the most critical thing to debug is why it took ~5 min to deliver the message. From what I see in my telemetry, it appears your queues, at times, get pretty backed up to the point of creating this "5 minute wait" (sometimes it's much longer) that newly inserted messages are paying before getting processed. This is all I have so far. I have some rough ideas for how to optimize this, but I'll need to think it through. I'll post another update tomorrow with some ideas, as well as instructions on how you can view your queue backlog yourself. |
Hey @vany0114 - are you using the Durable Functions v3.0.0-rc extension by any chance? My logs indicate that this is the case. Please don't change anything in your app just yet / attempt to downgrade if you are (it is not safe to downgrade from v3 to v2 due to breaking changes!) -but if so then you'd be a preview package that isn't production ready and may be experiencing some known bugs (none that would explain the queue backlog by themselves, but they wouldn't help nonetheless). Please let me know if this is the case and we can figure out a strategy to bring you back to a production-ready package. |
Hey @davidmrdavid that's correct, we're using |
@davidmrdavid FWIW we're experiencing that high latency for around ~5% of requests, below you can see the p95, p50, and average of the processing time. The weird thing is that when it happens the delay is almost always consistent, 5 minutes, I thought that maybe there was some sort of threshold that you guys have internally but may be just a coincidence as sometimes it takes longer. |
Hmm, yeah I think this may be part of the problem. Like I mentioned, this is a preview package (as per the suffix) with breaking changes. Please do not downgrade to a My first recommendation is: can you please upgrade to the I'm seeing some errors in the scale controller (the component that scales your app to keep up with traffic) that may be explained by the use of this I think we can bypass the scale controller by enabling "runtime scale monitoring", which essentially makes the app itself make scaling decisions and send them to the scale controller, instead of having the scale controller do that by itself. You can turn that on as described here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-azure-sql-trigger?tabs=isolated-process%2Cpython-v2%2Cportal&pivots=programming-language-csharp#enable-runtime-driven-scaling . Using your staging/dev environment, where I assume you also have the 3.x package, could you try this? Then please let me know, I'll monitor that the errors disappear, and from there I'll give you the green light to make the change in prod as well. I think that'll help with throughput (As scaling should improve) though it may not be our final fix. |
I just did it in our dev environment.
I'm afraid we cannot do so since the func app is running under a consumption plan, my understanding is that the "runtime scale monitoring" is only available in a premium tier. |
You may be right, now that I think about it. Let me see what I can do in the backend. Can you please send me your dev and prod app names again, it's been a bit. |
@davidmrdavid 👇 |
Hi @vany0114 - sorry to keep you waiting. Ok so the situation in the backend is a bit complicated: the scaling component is in the midst of a large upgrade that we can't delay and so we're not going to have an easy time providing you with a backend-side way to make this preview DF extension work with it. Ultimately, I think the issues in the scale controller are a big contributor of the delays you're seeing. We have 2 options: Please let me know if you prefer (1) or (2) and we'll proceed accordingly. After the fact, I'll review your app's performance again to see what surfaces after that. |
@davidmrdavid I could reset the app again, this is how the configuration in the "extensions": {
"durableTask": {
"hubName": "ordersagav4",
"useGracefulShutdown": true,
"storageProvider": {
"partitionCount": 16,
"useLegacyPartitionManagement": false,
"useTablePartitionManagement": true,
"connectionStringName": "SAGA_DURABLETASK_STORAGE_CONNECTIONSTRING"
},
"tracing": {
"DistributedTracingEnabled": true,
"Version": "V2"
}
}
}, Also, can you explain please what do you mean by: "change the taskhub locally, and then deploy that payload". Note: |
@davidmrdavid can you please help us reviewing what happened yesterday ASAP? yesterday the orchestrations experienced a huge delay between ~5 and ~15 hours 😨...the only thing that has changed recently was that we upgraded from |
@vany0114 - I'm taking a look. In the meantime - let's get you mitigated by "resetting your app". All you need to do is two things:
I meant to change your |
@vany0114: did you reset your app? |
Not yet in production, just in our dev environment, I'm waiting for our next release which will probably take place tomorrow, I'll keep you posted. The reason I haven't done so in production yet, is because it seems the func app auto-healed on Sunday after it experienced that huge latency all Saturday long. Have you found something so far? |
@davidmrdavid I did reset it. |
Thanks @vany0114. I've also instructed the team to reach out to you internally to see if we can work more directly to optimize your Azure Storage app. I think we wanted to do this with Netherite, but then that didn't happen once you switched back to Azure Storage. I think it may be worth pursuing a more direct engagement nonetheless, irrespective of backend, to make sure you have the right debugging tools as well as the right set of settings for performance. In any case - I'll investigate on both fronts, I'll send you an analysis tomorrow and we'll be reaching out directly via email as well. |
@vany0114: What's odd to me is that this "5 minute" ceiling seems very consistent. Just to confirm - are you by any chance sending scheduled entity signals (requests meant to fire at a particular time in an entity)? That could explain some of this consistency in the delay. In any case - I'm building an internal list of observations and will follow up with some suggestions. |
Hi @davidmrdavid , thanks for the update
Yes, we're doing so sometimes (under certain conditions) to perform a check in the next 5 minutes. Regarding the version downgrade, so far it has been working way better, the kinda consistent 5 minutes delay went away. |
Hi @vany0114. I'm happy to hear the perf is looking better - this makes sense to me, the scaling logic simply wasn't working before (it is not compatible with the DF v3 package) so things should be better with the GA package :-) . Regarding the scheduled entity signals: thanks for clarifying that you do use them. That explains the "delayed" messages I was seeing (they're delayed because they're scheduled to fire in ~5 minutes). @vany0114 - you mentioned you upgraded to the DF v3 package because of perf issues. Since you downgraded back to DF v2, have you encountered any of these perf issues again? I could really use a recent reproduction with DF v2. |
@davidmrdavid So far I haven't seen any perf issues, fortunately, we'll see how long this performance lasts, what I've seen is that as time goes by the response time starts getting degrading. |
@vany0114: and usually the perf issues you see are related to entity locking, right? In any case, as soon as you get another repro, please let me know. I think the perf issues you had with DF v3 are easy attributed to the subpar scaling you were experiencing, as well as downstream effects of that. In the meantime, I'll look to re-review your report here (#2534 (comment)), assuming that the telemetry is still available to me. |
@davidmrdavid Most of the time yes, I've also seen that the other scenario is that you kick off an orchestrator and it might take minutes to start. |
Just a heads up - this is still on my radar, I owe you a package with a small bug fix that I think could have added a few seconds on your last "entity lock took too long" report. Just have not had cycles yet to get to it. ETA: mid-next week. |
Hey @vany0114, I started working on fixing that performance bug you experienced on 6/3 with orchestrator ID As I mentioned during our call, I had just fixed the suspicious exception you encountered in the v3 package before we met; and I was looking to backporting to v2 just now but it appears to me the exception is not possible in the v2 package. From inspecting your logs more closely, I noticed they call out that you were using DF v3 at the time, which explains the bug in the first place. Please let me know if this doesn't match your records. If this is the case, then there's no hotfix package I can provide (as there's no bug to hotfix for the GA package) and I'll be waiting for another report of slowdowns. Please let me know if this checks out on your end, thanks! |
Hi @davidmrdavid! that's correct, the hours delay issue we experienced was in the v3 package. since we downgraded to v2 it has been working well 🤞 but I'm glad you fixed it in the v3. |
@vany0114 - I assume things are still working well? Would it make sense to resolve this thread and open a new one in case a new issue occurs? |
Closing since things have been working fine with the v2 package. |
Description
The full description with code snippets, screenshots, issue samples, etc is here #2530
Expected behavior
Acquire the lock in seconds at most, not minutes or hours.
Actual behavior
It seems the orchestration is getting stuck while acquiring the locks of the entities intervening in the orchestration.
Known workarounds
Reset the durable storage account and the func app storage account
App Details
If deployed to Azure
orders-saga
ordersagav2
The text was updated successfully, but these errors were encountered: