Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce time in .NET Threadpool's WorkStealingQueue.TrySteal method and ThreadPoolWorkQueue.Dequeue methods. #10752

Open
vancem opened this issue Jul 23, 2018 · 9 comments

Comments

@vancem
Copy link
Contributor

vancem commented Jul 23, 2018

A number of different users have noted a large amount of time in the .NET ThreadPool WorkStealingQueue.TrySteal method (Being called from ThredPoolWorkQueue.Dequeue method).

From what we can tell the scenario that causes is bursty workloads. For bursty workloads our guidance is to set a MinWorkerThreads high enough so that there are threads available to handle the burst. For high scale machines (e.g. 16 Proc), it is not uncommon then to set this minimum in the 160-320 thread range.

When a burst (lets say it needs 100 threads to do the work), then those 100 threads do the work, then calle Dequeue to get the next work. However the burst is over, and thus they all don't find any work left, and go through a loop lin the ThredPoolWorkQueue.Dequeue method to find work to steal from other threads. (which will fail).

Thus you have 100 threads spinning through 160-320 worker threads looking for more work, thus requring 16K to 32K checks. These threads 'fight' over the memory to check that the queues are empty, and thus even though the check is short, it consumes a non-trivial amount of CPU time. If these bursts come frequently (e.g. every 10-100 msec), then the CPU adds up.

Here is where we see the CPU time spent (This is on Desktop framework, but the code is very similar for .NET core). Here is the code in Dequeue.

                   if (null == callback)
                   {
100.0 |                WorkStealingQueue[] otherQueues = allThreadQueues.Current;
  1.9K|                int i = tl.random.Next(otherQueues.Length);
                       int c = otherQueues.Length;
                       while (c > 0)
                       {
 49.1K|                    WorkStealingQueue otherQueue = Volatile.Read(ref otherQueues[i % otherQueues.Length]);
141.2K|                    if (otherQueue != null &&
                               otherQueue != wsq &&
                               otherQueue.TrySteal(out callback, ref missedSteal))

And here we see the hot code in TrySteal (That 141K, broken down)

                   private bool TrySteal(out IThreadPoolWorkItem obj, ref bool missedSteal, int millisecondsTimeout)
                   {
 23.8K|                obj = null;
       
                       while (true)
                       {
 92.5K|                    if (m_headIndex >= m_tailIndex)
  8.1K|                        return false;

In .NET Core the code is a bit different because we have created a helper called 'CanSteal' that does m_headIndex >= m_tailIndex, and we call this helper in Dequeue before calling TrySteal. This helps cut the cost per iteration, but does not mitigate the fact that we are doing an O(n) operation, and we have to 'fight' over the memory representing m_headIndex and m_tailIndex variables) Thus on .NET Core the problem will not show up in TrySteal and should be less severe, but probably still problematic.

To really fix the problem we need to be less aggressive about checking for stealing. Ideally want to do some checking, but we want to be much less aggressive if we know that other threads will shortly come along and do a more aggressive check. This avoids O(N) behavior, which is the fundamental problem.

The solutions probably looks like only looking for work to steal for a subset unless we have been asked to be 'aggressive'. We are aggressive only after a certain amount of time.

@kouvel @stephentoub

@vancem
Copy link
Contributor Author

vancem commented Jul 25, 2018

dotnet/coreclr#18403 would naturally fix this problem because it would cut the number of work stealing queues to the number of processors which is modest.

@kouvel
Copy link
Member

kouvel commented Jul 25, 2018

There were some attempts at fixing the aggressiveness of checking other threads' local queues, but they didn't pan out. See https://github.com/dotnet/coreclr/issues/14017 and items linked to it. My understanding as described in that issue is that it is currently aggressive to compensate for necessary synchronization, which when added, decreases work item throughput.

I was thinking about dotnet/coreclr#18403 but to me some issues are that threads can move from core to core, and checking which core it's currently on looks like it would need a p/invoke which would likely regress work item throughput.

Another thing I was thinking about is having a linked list that contains thread-local queues that contain items, and have threads link and unlink an item to/from the list upon going from empty to non-empty or vice versa. It would probably regress short-burst work throughput (which is not uncommon) but it would hopefully replace the O(n) work with a lock and O(1) work.

@kouvel
Copy link
Member

kouvel commented Jul 25, 2018

it is currently aggressive to compensate for necessary synchronization

This is made worse by the fact that more than one thread is requested for each work item up to procCount threads, for the same reason.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@stephentoub stephentoub removed the untriaged New issue has not been triaged by the area owner label Feb 28, 2020
@mangod9 mangod9 modified the milestones: Future, 5.0 May 5, 2020
@mangod9
Copy link
Member

mangod9 commented May 5, 2020

This came up in some discussions so will determine feasibility to do in .net 5

@danmoseley danmoseley changed the title Reduce time in .NET Threadpool's WorkStealingQueue.TrySteal method and ThredPoolWorkQueue.Dequeue methods. Reduce time in .NET Threadpool's WorkStealingQueue.TrySteal method and ThreadPoolWorkQueue.Dequeue methods. May 12, 2020
@kouvel kouvel modified the milestones: 5.0.0, 6.0.0 Jul 22, 2020
@joaocpaiva
Copy link

joaocpaiva commented Nov 25, 2020

This problem is exacerbated on multi-core machines and the impact is exponential as we double the number of cores.

This stack is taking 2% of CPU on 4 core machines in Microsoft Graph and >6% of CPU on 8 core machines.

Moving from 4 cores to 8 cores, doubles MinWorkerThread, but the lock contention is about 3 times worse. We use MinWorkerThread=100 (per core). Based on #threads metrics, I think we might have some room to adjust this value down as a workaround to soften some of the impact but we can only go down so much.

This likely impacts many high throughput services. Does it not make the cut for .NET 6 @StephenBonikowsky @stephentoub ?

@benaadams
Copy link
Member

@joaocpaiva #44265 may have helped with this?

@joaocpaiva
Copy link

@joaocpaiva #44265 may have helped with this?

@benaadams yes but that fix was applied to ConcurrentQueueSegment, however ThreadPoolWorkQueue implementation seems to use an internal QueueSegment implementation, which still has the busy wait.

@joaocpaiva
Copy link

As expected, setting MinWorkerThreads from 100 down to 50 helped bring down the cost close to 3% (as opposed to 6%). We don't have more room to reduce it further, without taking a risk of causing issues and performance degradation during bursty workloads.

@mangod9
Copy link
Member

mangod9 commented Jul 22, 2021

This is no longer a partner ask.

@mangod9 mangod9 removed this from the 6.0.0 milestone Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants