-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce time in .NET Threadpool's WorkStealingQueue.TrySteal method and ThreadPoolWorkQueue.Dequeue methods. #10752
Comments
dotnet/coreclr#18403 would naturally fix this problem because it would cut the number of work stealing queues to the number of processors which is modest. |
There were some attempts at fixing the aggressiveness of checking other threads' local queues, but they didn't pan out. See https://github.com/dotnet/coreclr/issues/14017 and items linked to it. My understanding as described in that issue is that it is currently aggressive to compensate for necessary synchronization, which when added, decreases work item throughput. I was thinking about dotnet/coreclr#18403 but to me some issues are that threads can move from core to core, and checking which core it's currently on looks like it would need a p/invoke which would likely regress work item throughput. Another thing I was thinking about is having a linked list that contains thread-local queues that contain items, and have threads link and unlink an item to/from the list upon going from empty to non-empty or vice versa. It would probably regress short-burst work throughput (which is not uncommon) but it would hopefully replace the O(n) work with a lock and O(1) work. |
This is made worse by the fact that more than one thread is requested for each work item up to procCount threads, for the same reason. |
This came up in some discussions so will determine feasibility to do in .net 5 |
This problem is exacerbated on multi-core machines and the impact is exponential as we double the number of cores. This stack is taking 2% of CPU on 4 core machines in Microsoft Graph and >6% of CPU on 8 core machines. Moving from 4 cores to 8 cores, doubles MinWorkerThread, but the lock contention is about 3 times worse. We use MinWorkerThread=100 (per core). Based on #threads metrics, I think we might have some room to adjust this value down as a workaround to soften some of the impact but we can only go down so much. This likely impacts many high throughput services. Does it not make the cut for .NET 6 @StephenBonikowsky @stephentoub ? |
@joaocpaiva #44265 may have helped with this? |
@benaadams yes but that fix was applied to ConcurrentQueueSegment, however ThreadPoolWorkQueue implementation seems to use an internal QueueSegment implementation, which still has the busy wait. |
As expected, setting MinWorkerThreads from 100 down to 50 helped bring down the cost close to 3% (as opposed to 6%). We don't have more room to reduce it further, without taking a risk of causing issues and performance degradation during bursty workloads. |
This is no longer a partner ask. |
A number of different users have noted a large amount of time in the .NET ThreadPool WorkStealingQueue.TrySteal method (Being called from ThredPoolWorkQueue.Dequeue method).
From what we can tell the scenario that causes is bursty workloads. For bursty workloads our guidance is to set a MinWorkerThreads high enough so that there are threads available to handle the burst. For high scale machines (e.g. 16 Proc), it is not uncommon then to set this minimum in the 160-320 thread range.
When a burst (lets say it needs 100 threads to do the work), then those 100 threads do the work, then calle Dequeue to get the next work. However the burst is over, and thus they all don't find any work left, and go through a loop lin the ThredPoolWorkQueue.Dequeue method to find work to steal from other threads. (which will fail).
Thus you have 100 threads spinning through 160-320 worker threads looking for more work, thus requring 16K to 32K checks. These threads 'fight' over the memory to check that the queues are empty, and thus even though the check is short, it consumes a non-trivial amount of CPU time. If these bursts come frequently (e.g. every 10-100 msec), then the CPU adds up.
Here is where we see the CPU time spent (This is on Desktop framework, but the code is very similar for .NET core). Here is the code in Dequeue.
And here we see the hot code in TrySteal (That 141K, broken down)
In .NET Core the code is a bit different because we have created a helper called 'CanSteal' that does m_headIndex >= m_tailIndex, and we call this helper in Dequeue before calling TrySteal. This helps cut the cost per iteration, but does not mitigate the fact that we are doing an O(n) operation, and we have to 'fight' over the memory representing m_headIndex and m_tailIndex variables) Thus on .NET Core the problem will not show up in TrySteal and should be less severe, but probably still problematic.
To really fix the problem we need to be less aggressive about checking for stealing. Ideally want to do some checking, but we want to be much less aggressive if we know that other threads will shortly come along and do a more aggressive check. This avoids O(N) behavior, which is the fundamental problem.
The solutions probably looks like only looking for work to steal for a subset unless we have been asked to be 'aggressive'. We are aggressive only after a certain amount of time.
@kouvel @stephentoub
The text was updated successfully, but these errors were encountered: