-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.NET 7 application hangs during GC #80073
Comments
Tagging subscribers to this area: @dotnet/gc Issue DetailsDescriptionWe host our application in Azure App Service in the host dependent mode. Each instance of our application can end up getting stuck in a busy loop where it eats 100% of the CPU and no other HTTP calls are able to make it into the app. This only started happening when we upgraded to .NET 7. It looks like this is due to one thread being stuck performing a GC, so all the other managed threads are paused. Our application has health checks that run against it. One of those checks runs ~1/second. This check ends up querying a pgsql database and pulling down records (10s-1000s). This is the place we have seen the hang occur, although this same code can be called by other parts of the system. Likely it is just the health check that is triggering due to the much higher frequency of that call being made. Here are a few of the unmanaged portion of the GC allocate from two separate dumps taken during the high CPU times (this is me doing dump debugging in Visual Studio 2022): I believe it is running the Workstation GC based on the In one case I generated three dumps 5 minutes apart. In all three cases the managed code had not moved and was in the same exact place. The second unmanaged stack above is the top of the stack in those cases (although the very top most function changes in each, which is why I thought allocate_in_condemned_generations was the culprit). Here is the managed portion of the stack from those dumps: Here are some outputs from the CrashHangAnalysis MHT which AppService also kindly created (this is from one of the three dumps mentioned): As you can see in the thread times almost all of the time was eaten up by Thread 48. In each successive 5 minute dump you can see it has eaten up 5 more minutes of CPU time: Here is the full call stack for Thread 48 reported by the CrashHangAnalysis: There are also multiple CLR runtimes loaded in the process. I believe that AppService itself loads the .NET Framework 4.8, we are using .NET 7.0.0: I attempted to find this issue reported already but my searching turned up nothing :( Apologies if I missed it. Reproduction StepsAt this point I don't have a way to reproduce it reliably. One of our systems seems to be falling into this particular hole once a day or so. The SQL query on that system is a bit larger than other systems, but not massively so. Expected behaviorTo just have the GC complete its work and return to our application. Actual behaviorIt appears the GC gets stuck in WKS::gc_heap::allocate_in_condemned_generations. I see a retry label in there, perhaps it is stuck in an infinite loop? Regression?I believe so. We never saw this on .NET 6 (several sub-versions) and .NET 5. We ran similar code with earlier versions of .NET Core and never saw any issues either. Known WorkaroundsPerhaps downgrading to .NET 6? Configuration.NET 7.0.0 (in the dumps this shows up as CLR version v7.0.22). Other informationIt seems like some sort of infinite loop issue in the
|
@KenKozman-GE Can you share a dump or something useful so that we can investigate? |
Yeah sorry... Here are the crash hung analyses: Let me see if I can push the dumps here or not (they are full memory dumps, so quite large). |
Ah yep, 25MB limit. Let me try putting them somewhere that I can share, one second. |
Okay here are a couple of full memory dumps (mentioned above) as well as the crash hang analyses. |
One other workaround here would be to switch to an App Service Plan SKU which would show up as having multiple processors and thus use the Server GC (which would hopefully not exhibit the same issue). Right now we are using P1V2 for some instances which have a single vCPU. |
One further update, we used to use S2 SKUs and switched to P1V2 in the middle of 2022. So this still seems like a regression from .NET 6 to .NET 7 for us, for the Workstation GC. But we would have been using the Server GC at the start of 2022 when we were still on S2 SKUs. We were on S2 SKUs during the .NET 5 timeframe as well. (S2 SKUs have 2 vCPUs, P1V2 SKUs have only 1 vCPU) |
I took a quick look at the dump and hypothesize that it may be an infinite loop because we forgot about the possibility of pinning. Roughly speaking, the
The key idea is that the There are no synchronization primitives involved in this loop, so this is pretty much a case of an infinite loop. There are no more pins in the pin queue (i.e. Looking at the implementation of We have a just-fit allocation context:
Note that But then We are probably stuck in that situation and that is why the loop isn't breaking. What baffles me is that the |
I don't know how to interpret this. from the debug info given above, it doesn't look like it has to do with pinning - it says we have a huge plug where every single object on that region survived. and we are asking for an additional min object size when fitting in this case. can you check to see if it's asking for an additional min_obj_size because of |
It is because of
It is a gen1 GC
This is on thread |
I looked at the 2 dumps and the really odd thing with both dumps is the first region in gen0 has no free objects whatsoever but all the other gen0 regions look normal, ie with free objects. this explains why we have such a huge plug in a gen1 GC but this is not what should happen - we are supposed to have free objects as padding between alloc contexts so we don't form potentially huge plugs like this. @KenKozman-GE do you happen to have a dump where it's running normally, ie, not getting this hang? that would be helpful. |
I can make one, let me go do that, one second. |
thank you! |
Apologies the download and re-upload process here is somewhat onerous (IT security goons making sure I am not downloading or uploading various secrets and/or malware no doubt). Should be done in maybe an hour. |
Random possible workaround (if it is indeed a GC regression, we shall see): we could use the I did not test that before filing the issue here, apologies for that. |
Okay here is an example dump of "normal processing": dump |
yep, this one looks normal. if I look at the first region in gen0, I see 513 free objects (a region is 4mb and each alloc context is about 8k). other gen0 regions look fine too. so something is causing us to not allocate these free objects in that 1st gen0 region in the bad case. via a brief code review I don't see how that can happen. I'll take another look tomorrow but I wanted to ask - would it be possible to use a private version of clrgc.dll to help with debugging this if it comes to that? you could of course trying use the shipped version of clrgc.dll by setting COMPlus_GCName=clrgc.dll (which would revert back to the .net 6.0 behavior) but we'd like to figure out why the 7.0 GC is failing for you. we could share a private version of clrgc.dll that includes some instrumentation to help and you can use via the same way. would that be feasible? thanks! |
Oh yeah we could do that I think. I only mention the clrgc.dll bit above to help me remember and in case others see this type of issue in the future. Anyway, just let me know and we can try to help however. |
just an update - I'm doing some testing on my side for the private build and also trying to see if I can repro this. will let you know when it's ready for you to pick up. |
Sounds good. We have seen it happen still once every day or two on at least one tenant. So hopefully we can get a useful dump pretty quickly. |
if I make it so that it incurs an access violation when it detects this situation, would your system capture a dump? or would it be easier if I also make it hang? I usually do an AV in this situation but wanted to check if that's convenient for you. |
I think an AV will be easiest. The hosting we use (Azure App Service) will capture a dump and restart it in that case. This seems nicer than just making it hang given less potential downtime for the one instance. |
cool! I have the bits ready at https://github.com/Maoni0/clrgc/tree/main/issues/80073. there's a readme that explains how to use the files (in the "How to test" section). I also included explanation of changes included in the private builds and src/symbols in case you want to look at/use them. |
Thanks @Maoni0! I likely won't be able to install it and test it until next Tuesday at the earliest (vacation, etc.). But will try to get to it as soon as I am able. |
that's totally fine, thanks! |
Okay sorry @Maoni0 I am just getting back to this. I've never tried to install/use a I say this as I see no coreclr.dll anywhere I can make changes. So I think it is just in the system area. Sounds like the Also I've noticed that some optimization work we deployed Monday has seemed to make it where we aren't getting those hangs anymore, likely due to just much less GC pressure. I can add back the inefficient code to a QA system and try to test there, but I'm also not sure how exactly to go about getting our bits to use the bits you made. Sounds like the clrgc.dll has to be in the same dir as coreclr.dll? I could try to publish a self-contained piece for this, but I have never done that. |
We are all patched with this as well (and I removed the I was wondering the same thing as @Simon-Gregory-LG as to what to expect now :) |
Okay @Maoni0 got a crash in I uploaded the dump to my Google drive and gave you access. (Side note: and I thought I had some large source files!) |
oops, sorry I should have been clear - you are not expected to see AV if the fix indeed addresses the problem. but if the fix didn't work, and it's due to the same bug, it would AV but would put us closer to failure so it's easier to see what happened. I just downloaded Ken's dump - this is again in verify_region and complaining that there are no regions in gen0 (same symptom as what the 1st issue I discovered that's only specific to Workstation GC but not the same cause. I'll look more later this afternoon. |
@KenKozman-GE your app is a good stress test for the GC 🙀 I think you are hitting an issue that we fixed in 8.0 - I believe the reason why it has 0 regions in gen0 is because we failed to get a new region for gen0. we are doing a sweeping gen1 GC in this case and there's plenty of space get a new region for gen0, but due to a bug we mistakenly didn't find enough space to commit the bookkeeping data for that region so we failed to get a new region -
that's not enough for a new region. let me make a new build with the fix for that. sorry about all the trouble! |
"Inadvertent stress test" sounds like a decent band name! Just let me know when there is a fixed clrgc DLL and we can install it. |
I agree :) I've ported 2 fixes from 8.0 in the latest clrgc.dll #77480 and #80640 (it's 2 fixes because in the 1st fix we got rid of one of 2 bookkeeping fields I mentioned above and that's relevant in the 2nd fix).
thanks again for your patience! |
@KenKozman-GE I also agree. I think there should be three albums: 'gen0', 'gen1' & 'gen2' then leave a gap to do a best of + unreleased materials called 'Unmanaged Heap'. On a side note mine is still running fine with the first patch. Approaching 24 hours, but it's still within the window I've seen the issue occur. I might give it another 24 hours before I switch to try the new dll. |
@Maoni0 Would it be possible to have a patched libclrgc.so for linux-x64? |
@rbouallou, absolutely. I just put libclrgc.so along with src and symbols at https://github.com/Maoni0/clrgc/tree/main/issues/80073/demotion_fix/v2/linux. please let me know how this works out for you. |
Just an update from me that the first patch ( I will now install the newer ( @Maoni0, thanks so much for the responsiveness and extremely helpful interactions on this issue so far, it's been excellent! Patch test summary:Health Check Metrics
Crash Monitoring Summary |
@Simon-Gregory-LG that's great to hear! and thank you so much for verifying :) |
@Maoni0 no problem! Lots going on today so forgot to post the confirmation actually, so here it is all up and running ready to enjoy the weekend!: |
Wait... they give you guys weekends? I got to talk to my agent. I just looked at our guinea pig instance over here and it is still rumbling along no crashes or infinite loops since the latest install. But I don't think our replication cycle was a reliable as @Simon-Gregory-LG. Maybe it will hit something this weekend. |
So it's still going strong with 7,0,323,56101 and no unresponsiveness or exceptions (see below). Whereas our instances without the patch have gone down several times in this period. We restarted it once to deploy the second patch, but we're again 3 days+ of stability with the new patch, so this looks very promising - amazing job @Maoni0 :)! I take it that this hotfix is going to take a while before it makes the main .NET 7.0 runtimes and gets deployed to all the AppServices? In the interim should we consider this patch suitable for production deployment, or should we consider rolling back to .NET 6,0 for the time being? Health Check MetricsRed - .NET 7.0 version deployed |
@Simon-Gregory-LG: do you guys do self-contained deployments? If so I think the version of the framework is "baked in", so when/if this fix gets into .NET 7.0 we could just include that. If not, we have seen where there is an app service extension that tends to be published (updated?) on/about when the latest version drop comes (e.g. 7.0.2 came out on Jan 10, 2023). Adding that extension has let us get the latest/greatest as the AppService included runtime seems to lag by a month or two (for testing and validation and what not I assume). |
we will backport all fixes to 7.0 (the fix causing the infinite loop in meanwhile, @Simon-Gregory-LG, the general recommendation is you shouldn't be running private builds. however, since this is a build you need to use a config to invoke which means you can control exactly which processes use it and I did include all the necessary debug info (src+symbols) should something happen, I think you could use this till the 7.0 servicing release comes out, as long as you are comfortable with it. |
We've been running with the patched libclrgc.so for just over a week now and it's still running with no issues - the process would hang every ~2 days with the default runtime. @Maoni0 You mentioned fixes will be backported to 7.0. Do you have any idea when the release will come out? |
@Maoni0 The next release for version 7.0 is planned on March 14th. Any chance these fixes will be part of it? |
@manandre the reason why I haven't checked in is because I've been running more stress tests and hit multiple problems with that, - I hit some 8.0 problems (not in GC) when I tried to run stress on 8.0 so I ran this on 7.0 and I did hit another issue in the vicinity so I'd like to make a fix of that as well. should have it this week. |
Hi @Maoni0, just wanted to quickly check if this fix has made its way into .NET 7 yet and if so, which version? I see this is referenced above in the PRs which were merged in April, but they look to be tagged with 7.0.7. So am I right in thinking that it's probably not quite yet, or did the fixes make it into the 7.0.5 one that's publicly available from April? (https://versionsof.net/core/7.0/) |
This fix should be available in 7.0.7 released last week: https://devblogs.microsoft.com/dotnet/june-2023-updates/ |
@mangod9 oh fantastic, hopefully that'll make it's way onto my Azure AppService soon. I'll keep an eye out for that. Thanks for the quick reply! :) |
Closing since the fix has been made |
Description
We host our application in Azure App Service in the host dependent mode.
Each instance of our application can end up getting stuck in a busy loop where it eats 100% of the CPU and no other HTTP calls are able to make it into the app. This only started happening when we upgraded to .NET 7.
It looks like this is due to one thread being stuck performing a GC, so all the other managed threads are paused.
Our application has health checks that run against it. One of those checks runs ~1/second. This check ends up querying a pgsql database and pulling down records (10s-1000s). This is the place we have seen the hang occur, although this same code can be called by other parts of the system. Likely it is just the health check that is triggering due to the much higher frequency of that call being made.
Here are a few of the unmanaged portion of the GC allocate from two separate dumps taken during the high CPU times (this is me doing dump debugging in Visual Studio 2022):
I believe it is running the Workstation GC based on the
WKS::
namespace (and because the other threads are frozen waiting on the GC and because the GC is being performed directly on the user thread). I believe this is happening because the app service plan instance shows up as having 1 processor. We specify to use the Server GC in our configuration but I believe it will use the workstation one if there is only one processor detected.In one case I generated three dumps 5 minutes apart. In all three cases the managed code had not moved and was in the same exact place. The second unmanaged stack above is the top of the stack in those cases (although the very top most function changes in each, which is why I thought allocate_in_condemned_generations was the culprit). Here is the managed portion of the stack from those dumps:
Here are some outputs from the CrashHangAnalysis MHT which AppService also kindly created (this is from one of the three dumps mentioned):
As you can see in the thread times almost all of the time was eaten up by Thread 48. In each successive 5 minute dump you can see it has eaten up 5 more minutes of CPU time:
Here is the full call stack for Thread 48 reported by the CrashHangAnalysis:
There are also multiple CLR runtimes loaded in the process. I believe that AppService itself loads the .NET Framework 4.8, we are using .NET 7.0.0:
I attempted to find this issue reported already but my searching turned up nothing :( Apologies if I missed it.
Reproduction Steps
At this point I don't have a way to reproduce it reliably. One of our systems seems to be falling into this particular hole once a day or so. The SQL query on that system is a bit larger than other systems, but not massively so.
Expected behavior
To just have the GC complete its work and return to our application.
Actual behavior
It appears the GC gets stuck in WKS::gc_heap::allocate_in_condemned_generations. I see a retry label in there, perhaps it is stuck in an infinite loop?
Regression?
I believe so. We never saw this on .NET 6 (several sub-versions) and .NET 5. We ran similar code with earlier versions of .NET Core and never saw any issues either.
Known Workarounds
Perhaps downgrading to .NET 6?
It is uncertain if this would happen with .NET 7.0.1, I have not gotten to test that yet.
It is unclear if this is a workstation GC issue only; my guess is yes.
Configuration
.NET 7.0.0 (in the dumps this shows up as CLR version v7.0.22).
Azure App Service Platform version: 99.0.7.620
Windows Server 2016 - 14393
Number of Processors - 1
Architecture: x64
Other information
It seems like some sort of infinite loop issue in the
gc_heap::allocate_in_condemned_generations
function. But that is me using my "Jump to Conclusions" mat.The text was updated successfully, but these errors were encountered: