Reduce time in .NET Threadpool's WorkStealingQueue.TrySteal method and ThreadPoolWorkQueue.Dequeue methods. #10752

vancem · 2018-07-23T16:56:12Z

A number of different users have noted a large amount of time in the .NET ThreadPool WorkStealingQueue.TrySteal method (Being called from ThredPoolWorkQueue.Dequeue method).

From what we can tell the scenario that causes is bursty workloads. For bursty workloads our guidance is to set a MinWorkerThreads high enough so that there are threads available to handle the burst. For high scale machines (e.g. 16 Proc), it is not uncommon then to set this minimum in the 160-320 thread range.

When a burst (lets say it needs 100 threads to do the work), then those 100 threads do the work, then calle Dequeue to get the next work. However the burst is over, and thus they all don't find any work left, and go through a loop lin the ThredPoolWorkQueue.Dequeue method to find work to steal from other threads. (which will fail).

Thus you have 100 threads spinning through 160-320 worker threads looking for more work, thus requring 16K to 32K checks. These threads 'fight' over the memory to check that the queues are empty, and thus even though the check is short, it consumes a non-trivial amount of CPU time. If these bursts come frequently (e.g. every 10-100 msec), then the CPU adds up.

Here is where we see the CPU time spent (This is on Desktop framework, but the code is very similar for .NET core). Here is the code in Dequeue.

                   if (null == callback)
                   {
100.0 |                WorkStealingQueue[] otherQueues = allThreadQueues.Current;
  1.9K|                int i = tl.random.Next(otherQueues.Length);
                       int c = otherQueues.Length;
                       while (c > 0)
                       {
 49.1K|                    WorkStealingQueue otherQueue = Volatile.Read(ref otherQueues[i % otherQueues.Length]);
141.2K|                    if (otherQueue != null &&
                               otherQueue != wsq &&
                               otherQueue.TrySteal(out callback, ref missedSteal))

And here we see the hot code in TrySteal (That 141K, broken down)

                   private bool TrySteal(out IThreadPoolWorkItem obj, ref bool missedSteal, int millisecondsTimeout)
                   {
 23.8K|                obj = null;
       
                       while (true)
                       {
 92.5K|                    if (m_headIndex >= m_tailIndex)
  8.1K|                        return false;

In .NET Core the code is a bit different because we have created a helper called 'CanSteal' that does m_headIndex >= m_tailIndex, and we call this helper in Dequeue before calling TrySteal. This helps cut the cost per iteration, but does not mitigate the fact that we are doing an O(n) operation, and we have to 'fight' over the memory representing m_headIndex and m_tailIndex variables) Thus on .NET Core the problem will not show up in TrySteal and should be less severe, but probably still problematic.

To really fix the problem we need to be less aggressive about checking for stealing. Ideally want to do some checking, but we want to be much less aggressive if we know that other threads will shortly come along and do a more aggressive check. This avoids O(N) behavior, which is the fundamental problem.

The solutions probably looks like only looking for work to steal for a subset unless we have been asked to be 'aggressive'. We are aggressive only after a certain amount of time.

@kouvel @stephentoub

The text was updated successfully, but these errors were encountered:

vancem · 2018-07-25T21:47:26Z

dotnet/coreclr#18403 would naturally fix this problem because it would cut the number of work stealing queues to the number of processors which is modest.

kouvel · 2018-07-25T22:36:43Z

There were some attempts at fixing the aggressiveness of checking other threads' local queues, but they didn't pan out. See https://github.com/dotnet/coreclr/issues/14017 and items linked to it. My understanding as described in that issue is that it is currently aggressive to compensate for necessary synchronization, which when added, decreases work item throughput.

I was thinking about dotnet/coreclr#18403 but to me some issues are that threads can move from core to core, and checking which core it's currently on looks like it would need a p/invoke which would likely regress work item throughput.

Another thing I was thinking about is having a linked list that contains thread-local queues that contain items, and have threads link and unlink an item to/from the list upon going from empty to non-empty or vice versa. It would probably regress short-burst work throughput (which is not uncommon) but it would hopefully replace the O(n) work with a lock and O(1) work.

kouvel · 2018-07-25T22:41:29Z

it is currently aggressive to compensate for necessary synchronization

This is made worse by the fact that more than one thread is requested for each work item up to procCount threads, for the same reason.

mangod9 · 2020-05-05T20:57:54Z

This came up in some discussions so will determine feasibility to do in .net 5

joaocpaiva · 2020-11-25T22:20:35Z

This problem is exacerbated on multi-core machines and the impact is exponential as we double the number of cores.

This stack is taking 2% of CPU on 4 core machines in Microsoft Graph and >6% of CPU on 8 core machines.

Moving from 4 cores to 8 cores, doubles MinWorkerThread, but the lock contention is about 3 times worse. We use MinWorkerThread=100 (per core). Based on #threads metrics, I think we might have some room to adjust this value down as a workaround to soften some of the impact but we can only go down so much.

This likely impacts many high throughput services. Does it not make the cut for .NET 6 @StephenBonikowsky @stephentoub ?

benaadams · 2020-11-26T00:58:46Z

@joaocpaiva #44265 may have helped with this?

joaocpaiva · 2020-12-01T00:11:43Z

@joaocpaiva #44265 may have helped with this?

@benaadams yes but that fix was applied to ConcurrentQueueSegment, however ThreadPoolWorkQueue implementation seems to use an internal QueueSegment implementation, which still has the busy wait.

joaocpaiva · 2021-01-26T02:26:41Z

As expected, setting MinWorkerThreads from 100 down to 50 helped bring down the cost close to 3% (as opposed to 6%). We don't have more room to reduce it further, without taking a risk of causing issues and performance degradation during bursty workloads.

mangod9 · 2021-07-22T05:00:14Z

This is no longer a partner ask.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020

stephentoub removed the untriaged New issue has not been triaged by the area owner label Feb 28, 2020

mangod9 modified the milestones: Future, 5.0 May 5, 2020

danmoseley changed the title ~~Reduce time in .NET Threadpool's WorkStealingQueue.TrySteal method and ThredPoolWorkQueue.Dequeue methods.~~ Reduce time in .NET Threadpool's WorkStealingQueue.TrySteal method and ThreadPoolWorkQueue.Dequeue methods. May 12, 2020

kouvel modified the milestones: 5.0.0, 6.0.0 Jul 22, 2020

davidfowl mentioned this issue Mar 15, 2021

"async" fails as the number of threads increases dotnet/aspnetcore#28480

Closed

kouvel mentioned this issue May 14, 2021

Tracking issue for threading work items #52701

Open

17 tasks

mangod9 removed this from the 6.0.0 milestone Jul 22, 2021

mangod9 added this to the 7.0.0 milestone Jul 22, 2021

kouvel mentioned this issue May 19, 2022

Fix some scaling issues with the global queue in the thread pool #69386

Merged

mangod9 modified the milestones: 7.0.0, Future Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce time in .NET Threadpool's WorkStealingQueue.TrySteal method and ThreadPoolWorkQueue.Dequeue methods. #10752

Reduce time in .NET Threadpool's WorkStealingQueue.TrySteal method and ThreadPoolWorkQueue.Dequeue methods. #10752

vancem commented Jul 23, 2018

vancem commented Jul 25, 2018

kouvel commented Jul 25, 2018

kouvel commented Jul 25, 2018 •

edited

Loading

mangod9 commented May 5, 2020

joaocpaiva commented Nov 25, 2020 •

edited

Loading

benaadams commented Nov 26, 2020

joaocpaiva commented Dec 1, 2020

joaocpaiva commented Jan 26, 2021

mangod9 commented Jul 22, 2021

Reduce time in .NET Threadpool's WorkStealingQueue.TrySteal method and ThreadPoolWorkQueue.Dequeue methods. #10752

Reduce time in .NET Threadpool's WorkStealingQueue.TrySteal method and ThreadPoolWorkQueue.Dequeue methods. #10752

Comments

vancem commented Jul 23, 2018

vancem commented Jul 25, 2018

kouvel commented Jul 25, 2018

kouvel commented Jul 25, 2018 • edited Loading

mangod9 commented May 5, 2020

joaocpaiva commented Nov 25, 2020 • edited Loading

benaadams commented Nov 26, 2020

joaocpaiva commented Dec 1, 2020

joaocpaiva commented Jan 26, 2021

mangod9 commented Jul 22, 2021

kouvel commented Jul 25, 2018 •

edited

Loading

joaocpaiva commented Nov 25, 2020 •

edited

Loading