[Wip] Improve Threadpool QUWI throughput #5943

benaadams · 2016-06-23T04:27:30Z

11% Improvement for regular queuing (1.6sec over 10M QUWI)
19.4% improvement for high thread count queuing (MinWorkerThreads=500) (7.2sec over 10M QUWI)

Test project: https://gist.github.com/benaadams/b022934e62a3ac1c4f261be3216b1111

10M threadpool queues and executes. Changed items in red, ExecutionContext.Run highlighted and list past ExecutionContext.Restore for relative comparison.

Threadpool QUWI before

Threadpool QUWI after 2nd Update

stephentoub · 2016-06-23T12:34:43Z

src/mscorlib/src/System/Threading/ThreadPool.cs

        internal class WorkStealingQueue
        {
            private const int INITIAL_SIZE = 32;
-            internal volatile IThreadPoolWorkItem[] m_array = new IThreadPoolWorkItem[INITIAL_SIZE];
+            internal volatile PaddedWorkItem[] m_array = new PaddedWorkItem[INITIAL_SIZE];


Why are you 64-byte padding the items in the queue? The queue is owned by a single thread, and all other threads need to take a lock to access it. The owning thread also needs to take a lock when there are fewer than two items in the queue. The only case where you'd have contention this would help with is if there was a thread stealing concurrently with the owning thread pushing/popping on a list with at least two elements. At that point, they're already some distance apart, though not necessarily a full cache line. Have you shown that this change makes a notable improvement? It does so at the expense of effectively increasing the size of every work item by 56 bytes on 64-bit, since every work item reference to be stored now consumes 64 bytes instead of 8 bytes (plus the size of the work item object itself).

benaadams · 2016-06-23T18:11:41Z

@stephentoub as you pointed out don't think padding the items is helpful

Still working on it - hot spots are Dequeue and TrySteal

benaadams · 2016-06-23T18:17:58Z

Looking closer the main effect may just be looping the queues (many threads) with mainly empty queues.

omariom · 2016-06-23T19:18:42Z

@benaadams I think you should start from finding false sharings. In this article shown how to use VS for that. Not sure if it possible on Windows 10,but on Win 7 it was.

update: Another good tool Intel VTune
Not sure if it works with Windows 10.

benaadams · 2016-06-24T16:31:11Z

@omariom there is false sharing in stealing; however the current implementation even with the false sharing is pretty hard to improve on. Still iterating, though looking at something quite different than the PR in this current state.

omariom · 2016-06-25T21:23:05Z

Interesting optimization would be to find all the places (on hot pathes) where volatile reads/writes are unnecessary, replace them with plain reads and use Volatile class in the rest.
It may help on ARM where volatile reads/writes are implemented as fairly expensive full barriers.

benaadams · 2016-06-26T06:44:33Z

Interesting optimization would be to find all the places (on hot pathes) where volatile reads/writes are unnecessary, replace them with plain reads and use Volatile class in the rest.

There are some areas where this might be possible. Will try it and measure the impact.

Although such a change does make me a little uncomfortable 😟

benaadams · 2016-06-27T08:06:56Z

Bit better, getting a 6% improvement for QueueUserWorkItem throughput for 10M work items (4 core 1 socket)
With 9% improvement for set COMPlus_ThreadPool_ForceMinWorkerThreads=500

Still investigating.

benaadams · 2016-06-27T10:05:23Z

10% Improvement for regular (13.1s vs 14.7s)
14% improvement for set COMPlus_ThreadPool_ForceMinWorkerThreads=500 (31.2s vs 36.6s)

benaadams · 2016-06-28T04:16:37Z

10M threadpool queues and executes. Changed items in red, ExecutionContext.Run highlighted and list to System.Random.Sample for relative comparison.

Threadpool QUWI before

Threadpool QUWI after

benaadams · 2016-06-28T04:26:01Z

Test code https://gist.github.com/benaadams/b022934e62a3ac1c4f261be3216b1111

It also allocates 2112 bytes per 255 queued items in discarded QueueSegments (85MB per 10M) - which give a process equilibrium at 300MB-400MB memory use vs 50MB equilibrium without these allocations.

Caching a QueueSegment as it gets dropped off the tail and the reusing it as it for a new head avoids these allocations but its also not entirely straightforward with the concurrency flows; so not perusing that at this stage.

I believe the changes in this PR should not alter any of the concurrency behaviour.

prajwal-aithal · 2016-06-28T04:48:58Z

@dotnet-bot test Linux ARM Emulator Cross Debug Build

benaadams · 2016-06-28T05:58:17Z

@dotnet-bot test Linux ARM Emulator Cross Release Build

benaadams · 2016-06-28T12:56:48Z

@dotnet-bot test this please

benaadams · 2016-06-28T13:01:00Z

Added QueueSegment reuse as commit, needs tests rerunning

Skip EC.Restore when not changing from defaults Early bail from GetLocalValue when EC Default Fast-path SetLocalValue adding first value

danmoseley · 2016-10-13T21:10:40Z

@benaadams what remains here to call this PR good to go?

benaadams · 2016-10-14T03:30:04Z

@danmosemsft it needs to be freshened and rebased. I'll open as another PR with new results as there is a lot of noise now in this one.

dnfclas added the cla-already-signed label Jun 23, 2016

benaadams force-pushed the threapool-falsesharing branch 3 times, most recently from 5057952 to aef37e4 Compare June 23, 2016 05:19

stephentoub reviewed Jun 23, 2016
View reviewed changes

stephentoub assigned ericeil and kouvel Jun 23, 2016

benaadams changed the title ~~Prevent ThreadPool false sharing~~ [Wip} Prevent ThreadPool false sharing Jun 23, 2016

benaadams changed the title ~~[Wip} Prevent ThreadPool false sharing~~ [Wip] Prevent ThreadPool false sharing Jun 23, 2016

benaadams force-pushed the threapool-falsesharing branch from eec3a75 to 21c3d55 Compare June 27, 2016 07:52

benaadams changed the title ~~[Wip] Prevent ThreadPool false sharing~~ [Wip] Improve Threadpool throughput Jun 27, 2016

benaadams force-pushed the threapool-falsesharing branch from 8632a2d to 7621765 Compare June 27, 2016 16:08

benaadams changed the title ~~[Wip] Improve Threadpool throughput~~ Improve Threadpool QUWI throughput Jun 28, 2016

benaadams mentioned this pull request Jun 28, 2016

Remove empty Threadpool .cctors #3148

Closed

benaadams added 22 commits August 17, 2016 05:56

ThreadPool.SparseArray improvement for DequeueSteal

851ee72

Deterministic not random DequeueSteal search start

50f8154

Inline GetIndexes

440e303

Inline index comparisions

4bc20cb

Use ref rather than out for new call chain

3215ffd

Add queue on enqueue rather than dispatch

0956d7e

fix queues

4e7eadb

Differentiate threadpool local and global enqueue

eddd231

Base class for UserWorkItems

052e43d

Use Generic ExecutionContext.Run

eb2b1cf

Inline simple props

2ec54c8

Inline AsyncCausalityTracer logging to remove it

9ab01dc

Inline + simplify stack guard

2c4faff

ExecuteWorkItem -> abstract

e6987a3

Specialize TaskContinuations

d1e1211

Strong type ExecutionConext.Runs

ccf0c8d

Fast-path EC.Restore & AsyncLocal

fdd9fba

Skip EC.Restore when not changing from defaults Early bail from GetLocalValue when EC Default Fast-path SetLocalValue adding first value

Thread LocalQueues

cd60ec7

Task local

a66a59d

fix race condition

d401672

Deal with integer overflows

c6d7a20

Enqueue falsesharing

87d2fae

benaadams force-pushed the threapool-falsesharing branch from 5f13774 to 87d2fae Compare August 17, 2016 04:56

benaadams closed this Oct 14, 2016

benaadams deleted the threapool-falsesharing branch March 27, 2018 05:11

benaadams restored the threapool-falsesharing branch March 27, 2018 05:11

benaadams deleted the threapool-falsesharing branch January 11, 2019 21:37

benaadams mentioned this pull request Jan 31, 2020

High threadpool count burns cpu in native WorkerThreadStart dotnet/runtime#6265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Wip] Improve Threadpool QUWI throughput #5943

[Wip] Improve Threadpool QUWI throughput #5943

benaadams commented Jun 23, 2016 •

edited

Loading

stephentoub Jun 23, 2016

benaadams commented Jun 23, 2016

benaadams commented Jun 23, 2016

omariom commented Jun 23, 2016 •

edited

Loading

benaadams commented Jun 24, 2016

omariom commented Jun 25, 2016 •

edited

Loading

benaadams commented Jun 26, 2016 •

edited

Loading

benaadams commented Jun 27, 2016

benaadams commented Jun 27, 2016 •

edited

Loading

benaadams commented Jun 28, 2016

benaadams commented Jun 28, 2016

prajwal-aithal commented Jun 28, 2016

benaadams commented Jun 28, 2016

benaadams commented Jun 28, 2016

benaadams commented Jun 28, 2016 •

edited

Loading

danmoseley commented Oct 13, 2016

benaadams commented Oct 14, 2016

[Wip] Improve Threadpool QUWI throughput #5943

[Wip] Improve Threadpool QUWI throughput #5943

Conversation

benaadams commented Jun 23, 2016 • edited Loading

stephentoub Jun 23, 2016

Choose a reason for hiding this comment

benaadams commented Jun 23, 2016

benaadams commented Jun 23, 2016

omariom commented Jun 23, 2016 • edited Loading

benaadams commented Jun 24, 2016

omariom commented Jun 25, 2016 • edited Loading

benaadams commented Jun 26, 2016 • edited Loading

benaadams commented Jun 27, 2016

benaadams commented Jun 27, 2016 • edited Loading

benaadams commented Jun 28, 2016

benaadams commented Jun 28, 2016

prajwal-aithal commented Jun 28, 2016

benaadams commented Jun 28, 2016

benaadams commented Jun 28, 2016

benaadams commented Jun 28, 2016 • edited Loading

danmoseley commented Oct 13, 2016

benaadams commented Oct 14, 2016

benaadams commented Jun 23, 2016 •

edited

Loading

omariom commented Jun 23, 2016 •

edited

Loading

omariom commented Jun 25, 2016 •

edited

Loading

benaadams commented Jun 26, 2016 •

edited

Loading

benaadams commented Jun 27, 2016 •

edited

Loading

benaadams commented Jun 28, 2016 •

edited

Loading