.NET 7 application hangs during GC #80073

KenKozman-GE · 2022-12-30T18:34:05Z

Description

We host our application in Azure App Service in the host dependent mode.

Each instance of our application can end up getting stuck in a busy loop where it eats 100% of the CPU and no other HTTP calls are able to make it into the app. This only started happening when we upgraded to .NET 7.

It looks like this is due to one thread being stuck performing a GC, so all the other managed threads are paused.

Our application has health checks that run against it. One of those checks runs ~1/second. This check ends up querying a pgsql database and pulling down records (10s-1000s). This is the place we have seen the hang occur, although this same code can be called by other parts of the system. Likely it is just the health check that is triggering due to the much higher frequency of that call being made.

Here are a few of the unmanaged portion of the GC allocate from two separate dumps taken during the high CPU times (this is me doing dump debugging in Visual Studio 2022):

I believe it is running the Workstation GC based on the WKS:: namespace (and because the other threads are frozen waiting on the GC and because the GC is being performed directly on the user thread). I believe this is happening because the app service plan instance shows up as having 1 processor. We specify to use the Server GC in our configuration but I believe it will use the workstation one if there is only one processor detected.

In one case I generated three dumps 5 minutes apart. In all three cases the managed code had not moved and was in the same exact place. The second unmanaged stack above is the top of the stack in those cases (although the very top most function changes in each, which is why I thought allocate_in_condemned_generations was the culprit). Here is the managed portion of the stack from those dumps:

Here are some outputs from the CrashHangAnalysis MHT which AppService also kindly created (this is from one of the three dumps mentioned):

As you can see in the thread times almost all of the time was eaten up by Thread 48. In each successive 5 minute dump you can see it has eaten up 5 more minutes of CPU time:

Here is the full call stack for Thread 48 reported by the CrashHangAnalysis:

There are also multiple CLR runtimes loaded in the process. I believe that AppService itself loads the .NET Framework 4.8, we are using .NET 7.0.0:

I attempted to find this issue reported already but my searching turned up nothing :( Apologies if I missed it.

Reproduction Steps

At this point I don't have a way to reproduce it reliably. One of our systems seems to be falling into this particular hole once a day or so. The SQL query on that system is a bit larger than other systems, but not massively so.

Expected behavior

To just have the GC complete its work and return to our application.

Actual behavior

It appears the GC gets stuck in WKS::gc_heap::allocate_in_condemned_generations. I see a retry label in there, perhaps it is stuck in an infinite loop?

Regression?

I believe so. We never saw this on .NET 6 (several sub-versions) and .NET 5. We ran similar code with earlier versions of .NET Core and never saw any issues either.

Known Workarounds

Perhaps downgrading to .NET 6?
It is uncertain if this would happen with .NET 7.0.1, I have not gotten to test that yet.
It is unclear if this is a workstation GC issue only; my guess is yes.

Configuration

.NET 7.0.0 (in the dumps this shows up as CLR version v7.0.22).
Azure App Service Platform version: 99.0.7.620
Windows Server 2016 - 14393
Number of Processors - 1
Architecture: x64

Other information

It seems like some sort of infinite loop issue in the gc_heap::allocate_in_condemned_generations function. But that is me using my "Jump to Conclusions" mat.

The text was updated successfully, but these errors were encountered:

ghost · 2022-12-30T18:34:12Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details