Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Improve ArrayPool implementation and performance #8209

Merged
merged 3 commits into from
Dec 1, 2016

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Nov 18, 2016

This commit improves the performance of ArrayPool, with several changes:

  • Whereas we currently use DefaultArrayPool for both ArrayPool.Shared and ArrayPool.Create, this commit introduces a new internal ArrayPool-derived type. What was DefaultArrayPool is now ConfigurableArrayPool, and a
    new implementation is provided in DefaultArrayPool. ArrayPool.Shared returns an instance of (the new) DefaultArrayPool, and the Create methods return an instance of ConfigurableArrayPool.
  • The DefaultArrayPool now defaults to 64 buffers (instead of 50) per bucket, a number chosen as we need a power of 2 to work with the new algorithm. (In the future, we can experiment with changing the number of buffers per bucket, e.g. allowing more buffers in buckets of smaller, common sizes, something that we couldn't do easily when having to respect the knobs specified to the ctor).
  • Rather than using a SpinLock, DefaultArrayPool.Bucket implements a (almost) lock-free, multi-producer/multi-consumer thread-safe array-based queue. When not contended, rents and returns incur a single interlocked operation. When the queue isn't empty, returners and renters remain mostly isolated from each other with minimal false sharing concerns. (I say "almost" lock-free because there's a small window between two writes where other threads will end up spinning while a thread is in that window. I've added a #define that allows us to change the implementation between spinning in such cases and simply bailing on the operation, a tradeoff between guaranteed forward progress and a bit more allocation under contention; we can decide which direction to go.... we could also implement a compromise, e.g. spin a few times and then bail. Note that I experimented with true lock-free algorithms, but none of them did nearly as well as this.)

In my local testing on 64-bit, uncontended access is ~40% faster than before. Multiple threads each renting and returning furiously ends up being anywhere from the same throughput to ~2x faster, depending on access patterns. I've not tested on 32-bit, but I expect this won't do as well there, due to the reads/writes on Int64s: we may choose to return an instance of ConfigurableArrayPool instead of DefaultArrayPool from Shared/Create() on 32-bit.

@rynowak, would it be possible to test this out with your ASP.NET benchmarks you previously used to evaluate ArrayPool's impact?

cc: @jkotas, @kouvel, @JeremyKuhne, @redknightlois, @benaadams, @rynowak
Fixes https://github.com/dotnet/coreclr/issues/8203

}

/// <summary>Represents a buffer in the bucket.</summary>
private struct Slot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth making this [StructLayout(LayoutKind.Explicit, Size = 64)] to avoid contention with nearby slots? Will currently be 4 neighbouring slots in a cacheline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experiments with LMAX Disruptor showed that in the case of Ring Buffers the false sharing issue is the biggest cause of contention in high multithreading environments.

image

SRC: http://mechanical-sympathy.blogspot.com.ar/2011/07/false-sharing.html

Copy link
Member Author

@stephentoub stephentoub Nov 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experiments with LMAX Disruptor showed that in the case of Ring Buffers the false sharing issue is the biggest cause of contention in high multithreading environments

I understand that and agree, but the typical use case for disruptor is very different than the typical use case for a buffer pool. Disruptor is very much about producer/consumer, typically when producer/consumer are different threads (that are often affinitized). That's not the common case here. That's why I left the comment I did; there's opportunity to experiment further to see if it makes a difference, but I didn't see one in the testing I did on the hardware available to me for the patterns I tested, and adding padding here increases the size of each of the arrays non-trivially, so we'd want to know it made a meaningful difference for the target scenarios before doing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, @redknightlois or @benaadams, if you want to experiment with this, I'd welcome it. I'd adding padding by changing the Slot to this:

            /// <summary>Represents a buffer in the bucket.</summary>
            [StructLayout(LayoutKind.Sequential)]
            private struct Slot
            {
                private readonly Padding56 _pad1;
                /// <summary>The sequence number for this slot, used to synchronize between producers and consumers.</summary>
                public long SequenceNumber; // Assuming a 64-bit cache line, sits in the last 8 bytes of [0,64)
                /// <summary>The buffer.</summary>
                public T[] Array; // Assuming a 64-bit cache line, sits in the first 8 bytes of [64,128)
                private readonly Padding56 _pad2;
            }
        }
    }

    [StructLayout(LayoutKind.Explicit, Size = 56)]
    internal struct Padding56 { }

in case you want to use that as a starting point.

@rynowak
Copy link
Member

rynowak commented Nov 19, 2016

@stephentoub - I can probably do this if it's important, it won't be straightforward because we haven't update everything to 1.2 yet....

@stephentoub
Copy link
Member Author

I can probably do this if it's important, it won't be straightforward because we haven't update everything to 1.2 yet

@rynowak, would it be easier if I gave you a 1.1-based build of System.Buffers.dll?

@jkotas
Copy link
Member

jkotas commented Nov 19, 2016

The typical ArrayPool pattern is: rent a buffer, write data to it, read data from it, return it; on multiple threads at the same time. I have written a test that does exactly that: https://gist.github.com/jkotas/319f360c72ba12c3dccc00f557567997. It allocates 4k buffer, fills it in, computes sum of it, and returns it back; on all cores; in a loop.

The numbers from my dev box (Intel E5-2630, 2 sockets, 16 cores, 32 logical processors) are:

  • ArrayPool: 58s
  • No pooling: 37s
  • Thread-static cache: 9s

These numbers show that the ArrayPool is 1.5x worse (both before and after this change) than just allocating a fresh array every time, and 6.5x slower than ad-hoc thread-static buffer cache. The micro-benchmarks like this one is what we need to tune for because of they are better proxy of the real-world program behavior - in particular around process cache utilization; raw performance of rent/return is just a small component of it.

I believe it should be achievable for ArrayPool to be within 20% of the thread-static cache number for this microbenchmark.

The ArrayPool is variant of the classic C malloc/free. There are a few things to learn from it:

  • Any decent malloc/free has per-thread caches for scalability. The scalable ArrayPool will need to have them too.
  • The only ArrayPool that really matters is the shared global one. Allocation of non-global pool instances will hurt performance in 99% of cases, pretty much ignore them for any kind of tunning. Windows learned it 10+ years ago: The common practive used to be for every component to have its own private malloc/free heap until somebody noticed that it is really hurting in big picture, and then they spent years chaising down everybody to use the process global heap.

@stephentoub
Copy link
Member Author

The only ArrayPool that really matters is the shared global one

I agree. That's why in this change I split the implementation so that we can tune one separately from the other. You agree that's a good direction?

The scalable ArrayPool will need to have them too.

Once the shared is split from the configurable, that's relatively straightforward to add. What's harder is thinking through and tuning it. How big should the thread-local parts of the cache be? Should that vary based on size of buffer? Should it vary dynamically based on access patterns? Should there be any mechanism whereby buffers cached on one thread can be made available to buffers cached on another, or a mechanism whereby they can be reclaimed from TLS (weak refs or something more integrated into the GC)? I see #7747 as addressing that... or are you suggesting we scrap this PR?

The typical ArrayPool pattern is: rent a buffer, write data to it, read data from it, return it; on multiple threads at the same time

While I agree this is very common, we've also made it very common for the renting and returning to be done by different threads, due to async/await: rent a buffer, potentially write data to it, potentially hop threads as part of an awaited operation, potentially read data from it, return the buffer.

@stephentoub stephentoub force-pushed the arraypool_perf branch 2 times, most recently from 379bfda to fe01d08 Compare November 19, 2016 14:38
@jkotas
Copy link
Member

jkotas commented Nov 19, 2016

I split the implementation so that we can tune one separately from the other. You agree that's a good direction?

Yes, this split is good.

I see #7747 as addressing that... or are you suggesting we scrap this PR?

I do not know how the underlying algorithms and datastructures required to support #7747 tuning are going to look like exactly. It is very likely that they are going to be different from what they look today. I would expect the local states and the global state to be tightly coupled.

We can keep this PR if it is an improvement (ideally measured on something real like ASP.NET). Chances are there won't be much left from it once we are done with the overall tuning.

We've also made it very common for the renting and returning to be done by different threads

Yes, this is another case to tune for. It may be potentially interesting to deal well with the case where the different threads are running on the same core/socket. Schedulers tend to have heuristics to affinitize work that belongs to the same "request" to the same core/socket.

@stephentoub
Copy link
Member Author

stephentoub commented Nov 20, 2016

I've taken a stab at adding a TLS layer in front of the global cache. Along with each global bucket, each thread has a TLS bucket that can hold a few arrays, but using weak references to avoid prolonged leaks. I also made the sizing scheme a bit more dynamic, so fewer big arrays are cachable. Rents first try TLS, then the global cache, and returns first target the TLS cache, and then go to the global if it's full. (Maybe with some mscorlib magic we could make stale TLS arrays move to the global... I experimented with using finalization and resurrection to do so, but it tanked perf.)

I took Jan's benchmark and created one with a few more cases and scenarios:
https://gist.github.com/stephentoub/2a7773a683850f66b0e071eec1991ea4

  • AllocArrayPool == an ArrayPool where Rent just uses new and Return is a nop
  • TlsArrayPool == an ArrayPool backed by a single ThreadStatic field storing one array
  • ConfigurableArrayPool == what you get back from ArrayPool.Create with this PR, which is the same as what you get back from ArrayPool.Shared prior to this PR
  • DefaultArrayPool == ArrayPool.Shared with this PR
    (I realize it's not exactly the same in that there's now a virtual call in the "new" case, but putting that behind the ArrayPool abstraction made things easier to compare.) The only box I have access to right now is 4-core/8-logical processors... here are the results (Jan, if you get a chance, I'd be interested in seeing how this does on your 16/32 machine):
C:\Users\stoub\Desktop\default.netcoreapp1.1>corerun aptest.exe
           AllocArrayPool(1000000, 1,   1,False) => Time: 00:00:03.6227070, GC0/1/2: 15524 /    0 /   0
             TlsArrayPool(1000000, 1,   1,False) => Time: 00:00:00.0756416, GC0/1/2:     0 /    0 /   0
  ConfigurableArrayPool`1(1000000, 1,   1,False) => Time: 00:00:00.5467060, GC0/1/2:     0 /    0 /   0
       DefaultArrayPool`1(1000000, 1,   1,False) => Time: 00:00:00.2665368, GC0/1/2:     0 /    0 /   0

           AllocArrayPool( 100000, 1, 256,False) => Time: 00:00:00.4462131, GC0/1/2:  1544 /    0 /   0
             TlsArrayPool( 100000, 1, 256,False) => Time: 00:00:00.1195556, GC0/1/2:     0 /    0 /   0
  ConfigurableArrayPool`1( 100000, 1, 256,False) => Time: 00:00:00.5256749, GC0/1/2:     0 /    0 /   0
       DefaultArrayPool`1( 100000, 1, 256,False) => Time: 00:00:00.1289681, GC0/1/2:     0 /    0 /   0

           AllocArrayPool( 100000, 1,4096,False) => Time: 00:00:01.9906992, GC0/1/2:  1559 /    0 /   0
             TlsArrayPool( 100000, 1,4096,False) => Time: 00:00:01.7403081, GC0/1/2:     0 /    0 /   0
  ConfigurableArrayPool`1( 100000, 1,4096,False) => Time: 00:00:03.0270425, GC0/1/2:     0 /    0 /   0
       DefaultArrayPool`1( 100000, 1,4096,False) => Time: 00:00:01.7403150, GC0/1/2:     0 /    0 /   0

           AllocArrayPool( 100000, 1,4096, True) => Time: 00:00:02.2665990, GC0/1/2:  1530 /   26 /   0
             TlsArrayPool( 100000, 1,4096, True) => Time: 00:00:01.7851760, GC0/1/2:     7 /    3 /   0
  ConfigurableArrayPool`1( 100000, 1,4096, True) => Time: 00:00:03.2325261, GC0/1/2:     6 /    3 /   0
       DefaultArrayPool`1( 100000, 1,4096, True) => Time: 00:00:01.7959821, GC0/1/2:     7 /    3 /   0

           AllocArrayPool( 100000, 5,   1,False) => Time: 00:00:01.7836983, GC0/1/2:  7792 /    1 /   0
             TlsArrayPool( 100000, 5,   1,False) => Time: 00:00:01.4772221, GC0/1/2:  6203 /    1 /   0
  ConfigurableArrayPool`1( 100000, 5,   1,False) => Time: 00:00:00.2690019, GC0/1/2:     0 /    0 /   0
       DefaultArrayPool`1( 100000, 5,   1,False) => Time: 00:00:00.2640937, GC0/1/2:     0 /    0 /   0

           AllocArrayPool( 100000, 5, 409,False) => Time: 00:00:02.4586660, GC0/1/2:  7170 /    0 /   0
             TlsArrayPool( 100000, 5, 409,False) => Time: 00:00:01.9753044, GC0/1/2:  5645 /    0 /   0
  ConfigurableArrayPool`1( 100000, 5, 409,False) => Time: 00:00:02.9278374, GC0/1/2:     0 /    0 /   0
       DefaultArrayPool`1( 100000, 5, 409,False) => Time: 00:00:00.9515516, GC0/1/2:     0 /    0 /   0

           AllocArrayPool( 100000, 5,4096,False) => Time: 00:00:09.7180277, GC0/1/2:  5566 /    0 /   0
             TlsArrayPool( 100000, 5,4096,False) => Time: 00:00:09.4990777, GC0/1/2:  4445 /    0 /   0
  ConfigurableArrayPool`1( 100000, 5,4096,False) => Time: 00:00:11.5443067, GC0/1/2:     0 /    0 /   0
       DefaultArrayPool`1( 100000, 5,4096,False) => Time: 00:00:08.6526323, GC0/1/2:     0 /    0 /   0

           AllocArrayPool(1000000, 5,   1, True) => Time: 00:00:27.7776567, GC0/1/2: 76280 / 25258 /   1
             TlsArrayPool(1000000, 5,   1, True) => Time: 00:00:21.9095737, GC0/1/2: 58902 / 2390 /   0
  ConfigurableArrayPool`1(1000000, 5,   1, True) => Time: 00:00:13.1512960, GC0/1/2:   336 /  112 /   0
       DefaultArrayPool`1(1000000, 5,   1, True) => Time: 00:00:08.4448090, GC0/1/2:   518 /  168 /   2

@omariom
Copy link

omariom commented Nov 20, 2016

What if it were a per cpu pool?
Number of cpus is finite and usually smaller than the number of threads.

@benaadams
Copy link
Member

What if it were a per cpu pool?

Per cpu pool is good, but hard to enforce without thread affinity (which would cause more problems than it would solve in this instance). Its also a sys call to determine the current core.

However; saying that while migrations would still happen; it would reduce the chance of lock contention; but locks would still be required (without affinities).

@benaadams
Copy link
Member

benaadams commented Nov 20, 2016

@omariom I think my first interaction with @stephentoub was around per core stuff https://github.com/dotnet/corefx/issues/696 😄

@omariom
Copy link

omariom commented Nov 20, 2016

@benaadams

However; saying that while migrations would still happen;

As @stephentoub mentioned even interthread migrations is normal.

it would reduce the chance of lock contention; but locks would still be required (without affinities).

May be there is a way with just Interlockeds?

@benaadams
Copy link
Member

May be there is a way with just Interlockeds?

Synchronisation might have been better phrasing. With thread data no synchronisation is required - though I think extending the thread with data comes at a higher starting cost?

@stephentoub as a halfway house between AllocArrayPool and ConfigurableArrayPool what if the SpinLock in Buckets used a timeout of 0 (with change #6952)

I'm having trouble building apps with coreclr directly as suffer from lots of namespace and dependency clashes (which I assume is transitory and a result for bring apis back) so can't test myself :-/

@jkotas
Copy link
Member

jkotas commented Nov 21, 2016

Its also a sys call to determine the current core.

@benaadams It is just a call (not a sys call) on decent systems. E.g. GetCurrentProcessorNumber is implemented as rdtscp instruction. Also, server GC has it sort of cached because of it tries to snap heaps to cores.

@jkotas
Copy link
Member

jkotas commented Nov 21, 2016

@vancem @Maoni0 Thoughts about this pooling algorithm?

@jkotas
Copy link
Member

jkotas commented Nov 21, 2016

This is fundamentally really similar to https://github.com/dotnet/coreclr/blob/master/src/mscorlib/Common/PinnableBufferCache.cs . I wonder whether we should merge the two. The PinnableBufferCache has quite a bit of self-tuning logic.

// This isn't our first time pushing on this thread. Iterate through
// the weak references looking for one that doesn't have an array,
// and store the pushed array into it.
foreach (WeakReference<T[]> wr in _arrays)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WeakReference means that we will lose all per-thread buffers (that are not in use) every Gen2 GC. I do not know how bad is it going to be; but it is certainly not going to be good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. As I noted, what I really wanted was something that would move the buffers from here to the global pool, instead of dropping them. To my knowledge we just don't have anything like that existing yet. I'd tried using finalizable objects and resurrection to do it, essentially instead of having a weak ref to an array, having a weak ref to a poolable resurrectable finalizable object wrapping an array, and the finalizer would TryEnqueue the array on to the global queue. Perf tanked, but it's possible I messed it up somehow... I'll try again now that I've made other changes.

@stephentoub
Copy link
Member Author

I wonder whether we should merge the two. The PinnableBufferCache has quite a bit of self-tuning logic.

Certainly worth exploring.

I added into my test two additional test pools (again, not 100% apples to apples, but close enough), one based on PinnableBufferCache, and one based on @benaadams's question about SpinLock.TryEnter. Here are the results (again on my 4core/8logical machine):

c:\Users\stoub\Desktop\CoreClrTest>corerun aptest.exe
         AllocArrayPool`1(1000000, 1,   1,False) => Time: 00:00:03.7226429, GC0/1/2: 15458 /    0 /   0
           TlsArrayPool`1(1000000, 1,   1,False) => Time: 00:00:00.0760515, GC0/1/2:     0 /    0 /   0
   TrySpinLockArrayPool`1(1000000, 1,   1,False) => Time: 00:00:00.5656253, GC0/1/2:    54 /    0 /   0
  PinnableBufferCachePool(1000000, 1,   1,False) => Time: 00:00:01.5391600, GC0/1/2:     0 /    0 /   0
  ConfigurableArrayPool`1(1000000, 1,   1,False) => Time: 00:00:00.6526092, GC0/1/2:     0 /    0 /   0
       DefaultArrayPool`1(1000000, 1,   1,False) => Time: 00:00:00.2677964, GC0/1/2:     0 /    0 /   0

         AllocArrayPool`1( 100000, 1, 256,False) => Time: 00:00:00.3796690, GC0/1/2:  1559 /    0 /   0
           TlsArrayPool`1( 100000, 1, 256,False) => Time: 00:00:00.1176229, GC0/1/2:     0 /    0 /   0
   TrySpinLockArrayPool`1( 100000, 1, 256,False) => Time: 00:00:00.4555447, GC0/1/2:   226 /    0 /   0
  PinnableBufferCachePool( 100000, 1, 256,False) => Time: 00:00:00.1767529, GC0/1/2:    11 /    0 /   0
  ConfigurableArrayPool`1( 100000, 1, 256,False) => Time: 00:00:00.4661720, GC0/1/2:     0 /    0 /   0
       DefaultArrayPool`1( 100000, 1, 256,False) => Time: 00:00:00.1299157, GC0/1/2:     0 /    0 /   0

         AllocArrayPool`1( 100000, 1,4096,False) => Time: 00:00:02.0147209, GC0/1/2:  1559 /    0 /   0
           TlsArrayPool`1( 100000, 1,4096,False) => Time: 00:00:01.7413910, GC0/1/2:     0 /    0 /   0
   TrySpinLockArrayPool`1( 100000, 1,4096,False) => Time: 00:00:01.7565455, GC0/1/2:    41 /    0 /   0
  PinnableBufferCachePool( 100000, 1,4096,False) => Time: 00:00:01.7452105, GC0/1/2:     6 /    0 /   0
  ConfigurableArrayPool`1( 100000, 1,4096,False) => Time: 00:00:02.0043132, GC0/1/2:     0 /    0 /   0
       DefaultArrayPool`1( 100000, 1,4096,False) => Time: 00:00:01.7657864, GC0/1/2:     0 /    0 /   0

         AllocArrayPool`1( 100000, 1,4096, True) => Time: 00:00:02.2791764, GC0/1/2:  1531 /   26 /   0
           TlsArrayPool`1( 100000, 1,4096, True) => Time: 00:00:01.7821046, GC0/1/2:     7 /    3 /   0
   TrySpinLockArrayPool`1( 100000, 1,4096, True) => Time: 00:00:01.8229591, GC0/1/2:    80 /   16 /   0
  PinnableBufferCachePool( 100000, 1,4096, True) => Time: 00:00:01.8104295, GC0/1/2:    11 /    5 /   0
  ConfigurableArrayPool`1( 100000, 1,4096, True) => Time: 00:00:02.0592156, GC0/1/2:     6 /    3 /   0
       DefaultArrayPool`1( 100000, 1,4096, True) => Time: 00:00:01.8016369, GC0/1/2:     7 /    3 /   0

         AllocArrayPool`1( 100000, 5,   1,False) => Time: 00:00:01.8147635, GC0/1/2:  7794 /    1 /   0
           TlsArrayPool`1( 100000, 5,   1,False) => Time: 00:00:01.5040712, GC0/1/2:  6224 /    0 /   0
   TrySpinLockArrayPool`1( 100000, 5,   1,False) => Time: 00:00:00.2855701, GC0/1/2:    43 /    2 /   0
  PinnableBufferCachePool( 100000, 5,   1,False) => Time: 00:00:00.7191607, GC0/1/2:    26 /    1 /   0
  ConfigurableArrayPool`1( 100000, 5,   1,False) => Time: 00:00:00.2800174, GC0/1/2:     0 /    0 /   0
       DefaultArrayPool`1( 100000, 5,   1,False) => Time: 00:00:00.2711562, GC0/1/2:     0 /    0 /   0

         AllocArrayPool`1( 100000, 5, 409,False) => Time: 00:00:02.3515123, GC0/1/2:  7167 /    0 /   0
           TlsArrayPool`1( 100000, 5, 409,False) => Time: 00:00:01.9828191, GC0/1/2:  5653 /    0 /   0
   TrySpinLockArrayPool`1( 100000, 5, 409,False) => Time: 00:00:01.6222076, GC0/1/2:   718 /    0 /   0
  PinnableBufferCachePool( 100000, 5, 409,False) => Time: 00:00:00.9512356, GC0/1/2:    35 /    0 /   0
  ConfigurableArrayPool`1( 100000, 5, 409,False) => Time: 00:00:01.9247528, GC0/1/2:     0 /    0 /   0
       DefaultArrayPool`1( 100000, 5, 409,False) => Time: 00:00:00.9521102, GC0/1/2:     0 /    0 /   0

         AllocArrayPool`1( 100000, 5,4096,False) => Time: 00:00:09.7974853, GC0/1/2:  5420 /    0 /   0
           TlsArrayPool`1( 100000, 5,4096,False) => Time: 00:00:09.5533178, GC0/1/2:  4432 /    0 /   0
   TrySpinLockArrayPool`1( 100000, 5,4096,False) => Time: 00:00:08.7936312, GC0/1/2:   118 /    0 /   0
  PinnableBufferCachePool( 100000, 5,4096,False) => Time: 00:00:08.6697581, GC0/1/2:    31 /    0 /   0
  ConfigurableArrayPool`1( 100000, 5,4096,False) => Time: 00:00:08.9701564, GC0/1/2:     0 /    0 /   0
       DefaultArrayPool`1( 100000, 5,4096,False) => Time: 00:00:08.6865613, GC0/1/2:     0 /    0 /   0

         AllocArrayPool`1(1000000, 5,   1, True) => Time: 00:00:32.4310303, GC0/1/2: 72750 / 23683 /   1
           TlsArrayPool`1(1000000, 5,   1, True) => Time: 00:00:23.2720540, GC0/1/2: 58887 / 2402 /   0
   TrySpinLockArrayPool`1(1000000, 5,   1, True) => Time: 00:00:14.1677749, GC0/1/2: 20680 / 5816 /   0
  PinnableBufferCachePool(1000000, 5,   1, True) => Time: 00:00:11.1633328, GC0/1/2:   690 /  173 /   0
  ConfigurableArrayPool`1(1000000, 5,   1, True) => Time: 00:00:10.6817787, GC0/1/2:   344 /  115 /   0
       DefaultArrayPool`1(1000000, 5,   1, True) => Time: 00:00:08.3515682, GC0/1/2:   530 /  142 /   3

@jkotas, my goal in adding this second commit was mainly experimentation. To make forward progress with this, I see a few options:

  1. Close this PR.
  2. Effectively just use the first commit, splitting into two pool implementations and using the newer queue-based one, which beats out the existing SpinLock-based one in all of the tests I've done.
  3. Change the second commit to employ a compromise for now: have one TLS slot per thread, no weak reference.
  4. Merge the PR.

I'd prefer not to do (1) or (4). There's value in the first commit, and the second commit with its WeakReference-based implementation was more experimentation than anything else; as you say, it'll have not-good behaviors when all of the buffers get cleaned up in a gen2 GC. Given the discussion thus far and a desire for some kind of TLS cache, my preference is (3): it'll give a boost for the common access pattern you call out, without leaking an obscene amount of memory. We would then use #7747 to look at ways of tuning this, dumping those TLS caches in a better manner, etc. (2) would also be ok.

Thoughts?

@stephentoub
Copy link
Member Author

stephentoub commented Nov 21, 2016

One other thought:
I was thinking of having Shared return this pool for byte[] and char[], maybe a few others, and falling back to something without TLS for other types.

@jkotas
Copy link
Member

jkotas commented Nov 21, 2016

my preference is (3)

Sounds good to me.

@rynowak
Copy link
Member

rynowak commented Nov 21, 2016

@rynowak, would it be easier if I gave you a 1.1-based build of System.Buffers.dll?

Yes this would make things a lot easier.

Copy link
Member

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest looks good to me

_buckets[bucket].Return(array);
// Check to see if the buffer is the correct size for this bucket
GlobalBucket b = _buckets[bucket];
if (array.Length < b.BufferLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would be ok to pool arrays that were not rented from here and are a bit larger, but if the array is significantly larger, especially for the last bucket, it would be a pity to hold on to that array. Seems like it would be fair to reject or ignore any arrays that are not exactly the same size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already changed it back, just haven't pushed an updated commit yet. This was a holdover from when I was playing around with allowing buffers to be rented from higher buckets if the target one was empty and returned to lower buckets if the target one was full.

GlobalBucket b = _buckets[bucket];
if (array.Length < b.BufferLength)
{
throw new ArgumentException(SR.ArgumentException_BufferNotFromPool, nameof(array));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an empty array is rented and returned, wouldn't this throw? It seems like it should ignore it as was done before in the else if (array.Length == 0) block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never get here for an empty array, as the higher-level if block won't be entered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an empty array bucket would be 0 and _buckets[0].BufferLength would be 16 (Utilities.GetMaxSizeForBucket(0) would return 16). In that case, it would enter this block, no?

Copy link
Member Author

@stephentoub stephentoub Nov 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an empty array bucket would be 0

Actually, no, because of the selection algorithm and the -1 used in https://github.com/dotnet/coreclr/pull/8209/files/f4c1cc902e4dbdb3e25a101818c61c7e33a84c53#diff-32265784f459bed5d7538aded41a5958R15 , 0 ends up looking like a gigantic buffer and skips past this whole block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense

}
else if (delta < 0)
{
// Queue is full.
Copy link
Member

@kouvel kouvel Nov 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose _slots.Length is 1 and only TryEnqueues are happening. On the first TryEnqueue, currentTail changes from 0 to 1, and _slots[0].SequenceNumber changes from 0 to 1. On the second TryEnqueue, slotsIndex is still 0, so delta is also 0. currentTail changes from 1 to 2, and _slots[0].SequenceNumber changes from 1 to 2. I expected the second TryEnqueue to trigger the "queue is full" case here. Did I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I saw now the assert that says the length must be >= 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as in TryEnqueue, if old or torn values are read, this may also incorrectly determine that the queue is full and is not guaranteed to enqueue an item when there is space.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the previous comment, torn reads should not happen, and it's ok to say it's full when it's not, in which case the array just gets dropped even if there could have been a slot for it.

}
else if (delta < 0)
{
// Queue is empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This determination is only based on volatile reads of long values. Those could potentially be reading old values or the reads may be torn on some 32-bit platforms, so this determination that the queue is empty could be incorrect. I guess it is ok in the case it is used from ArrayPool since it would just end up allocating an array, but if this is correct, it would be good to add a comment indicating that this queue is not guaranteed to dequeue a queued item.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be any tearing, as all Int64 reads are done with Volatile.Read. Regardless, though, it's ok for it to be wrong in this direction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, missed the interlocked read on 32-bit platforms. It would be good to add a comment or to reflect in the name that enqueues/dequeues are not guaranteed, as the class could be used later for another purpose where it matters.

@stephentoub stephentoub added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Nov 21, 2016
@Maoni0
Copy link
Member

Maoni0 commented Nov 22, 2016

I've taken a look at this thread. I am not clear on how the PinnableBufferCache experiment was done - @stephentoub are you using the stock implementation of PinnableBufferCache, or did you also modify it to be say with a per thread cache too?

Also since you were not actually doing any gen2 GCs, much of the logic in PinnableBufferCache is not activated as it uses gen2 GCs as a “clock” to trim the free list.

It also already has a way to move items between different lists. You can do the same thing with your per thread pool and global pool (which can correspond to its gen2 free list and non gen2 free list). The idea is you grow the list when you need to but you remember which ones are preferred (in the PinnableBufferCache case this is the gen2 buffers) on a list so you give those out first; and when you trim you trim the non preferred ones first.

I would try with the way PinnableBufferCache manages the buffers in its pool but add the per thread caching support. This wasn’t that essential for PinnableBufferCache because we don’t expect very intensive usage of this pool – this is there mostly for fragmentation control, not for speed. In order to make it a general purpose pool where speed would be a key factor you’d want to make some modifications. I would start by taking a profile to see where we are spending the time. Also are you expecting that most of the allocations would be rented from the pool? If so you would simply not have many gen2 GCs at all and using gen2 wouldn’t be a good “lock”. In that case you might consider using other factors like # of items or memory pressure.

Another thing I should mention is whether you want per thread or per core caching. GC uses per core but GC only does this every alloc quantum which is a few KB so it can afford to check at that granularity. And if there's no async, and you have a bunch of worker threads that basically just run on the same cores, that's basically like having a per core cache. With async I am not sure if it just means a threadpool thread can be used to execute parts of any random requests.

@stephentoub
Copy link
Member Author

I am not clear on how the PinnableBufferCache experiment was done - @stephentoub are you using the stock implementation of PinnableBufferCache, or did you also modify it to be say with a per thread cache too?

I was keeping it simple: the previously shown numbers were just an ArrayPool-based wrapper around a PinnableBufferCache instance of one array-length size (it would need to be more complicated for a real implementation). I've added in another that does similar TLS-based caching before hitting the PinnableBufferCache.

Also since you were not actually doing any gen2 GCs, much of the logic in PinnableBufferCache is not activated as it uses gen2 GCs as a “clock” to trim the free list.

Yes. There are gen2 GCs being forced between tests, but they're rare in the tests themselves. Note that the same PBC is used across all of the tests.

It also already has a way to move items between different lists. You can do the same thing with your per thread pool and global pool (which can correspond to its gen2 free list and non gen2 free list)

It's not the when I'm concerned about, but rather the how. Specifically, in the situation where a thread ends up caching something in TLS and then that thread never again executes code that deals with that particular array size... that array will sit stuck in that thread's TLS, and we won't have a mechanism to get it out without executing code on that thread. So either we say this is rare enough that we don't care, or we need to come up with some mechanism for doing this.

Also are you expecting that most of the allocations would be rented from the pool?

For the specific array types, the expectation is that (eventually) the majority will come from the pool, at least those created by the stack up through ASP.NET.

I've basically scrapped my previous implementation and have a new one which I'll post shortly...

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you!

[MethodImpl(MethodImplOptions.AggressiveInlining)]
get
{
// On Unix, GetCurrentProcessorNumber is implemented in terms of sched_getcpu, which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what the performance characteristics of GetCurrentProcessorNumber are accross different Unix OSes where it is actually implemented. We may want to cache it - similar to how GC does it; or maybe piggy back on the GC cache where possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. refresh the cached value every N rents; or whenever we hit lock contention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar problem may exist on Windows - some of combinations of old OS versions; old CPU versions; or some types of virtualized environments are likely going to have slow implementation of GetCurrentProcessorNumber.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you could; the GC uses affinitised threads so the core is stable? Unless there is some way to get the OS scheduler to inject the core number it will keep changing so couldn't be cached very well?

Though it may work as "mostly" correct as to avoid contention; unlikely to be L1/L2 cache correct though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or whenever we hit lock contention.

That might work nicely

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GC uses affinitised threads so the core is stable

The server GC worker threads are affinitised. The regular user threads are not affinitised - regularly queries the OS about what processor is the thread actually running on, and also gives hint to the OS scheduler what it thinks the ideal processor for the thread to run on is (using SetThreadIdealProcessor - only on Windows today).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

performance characteristics of GetCurrentProcessorNumber

Maybe we should time a few calls, and then decide whether it is cheap enough to call it for each rent/return, or cache it instead to amortize the costs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. refresh the cached value every N rents; or whenever we hit lock contention.

I added a commit that caches the process number in TLS and flushes the cache whenever lock contention is encountered on the per-core stack accesses. Is this along the lines of what you had in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is along the lines what I had in mind ... but I am not sure how well it is actually going to work in practice. It is something we should work on more (e.g. together with making GetCurrentProcessorNumber FCall). It does not need to block this change from getting merged.

@stephentoub stephentoub removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Nov 22, 2016
// round-robin through the other stacks.
T[] arr;
LockedStack[] stacks = _perCoreStacks;
int index = ExecutionId % stacks.Length;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May prefixing with if (executionId >= stacks.Length) noticeably improve it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice the mod operation as being particularly measurable.

@Maoni0
Copy link
Member

Maoni0 commented Nov 23, 2016

so how well would the async operations keep the threads that are part of an async operation on the same core? if method foo is await-ing on bar and bar is dispatched to another core, it doesn't help if foo got the allocation on the core it's running on and bar uses this allocation on a different core...

@stephentoub
Copy link
Member Author

if method foo is await-ing on bar and bar is dispatched to another core, it doesn't help if foo got the allocation on the core it's running on and bar uses this allocation on a different core

Right, that's effectively the GlobalYield case in the benchmarks I posted, and why I included it.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static void MonitorEnterWithProcNumberFlush(object obj)
{
if (!Monitor.TryEnter(obj))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fail fast? #6951

if (!Monitor.TryEnter(obj, 0))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment... Monitor.TryEnter(obj) is Monitor.TryEnter(obj, 0). Can you elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😳 Good point...

@benaadams
Copy link
Member

benaadams commented Nov 23, 2016

[ThreadStatic] comes with a small cost; would there be an advantage to putting t_tlsBuckets and t_cachedProcessorNumber in a ThreadStatic class so it only has a single access and single ThreadStatic item rather than multiple accesses and two items?

e.g.

private class ThreadStaticData
{
    public T[][] Buckets
    public int? ProcessorNumber
}

[ThreadStatic]
private static ThreadStaticData t_tlsData;

Then assign t_tlsData to a local variable and pass to the various functions that use its data?

@stephentoub
Copy link
Member Author

would there be an advantage

I can experiment with it.

- Renames DefaultArrayPool to ConfigurableArrayPool, which remains unchanged.
- Adds a new TlsOverPerCoreLockedStacksArrayPool, which is used as the shared pool for byte[] and char[].  The pool is tiered, with a small per-thread TLS cache, followed by a global cache.  The global cache is split into effectively per-core buckets, although threads are able to check other buckets if their assigned bucket is empty/full for rents/returns, respectively.
@rynowak
Copy link
Member

rynowak commented Nov 30, 2016

Here's our data from trying this with TechEmpower's JSON benchmark implemented using aspnet MVC on our 48 core environment. This is the thinnest benchmark we have today that uses the buffer pool, and it's currently limited by contention. (note 62% CPU usage)

For reference https://www.techempower.com/benchmarks/#section=data-r13&hw=ph&test=json (CTRL+F aspnetcore-mvc-linux). The official results use different hardware so don't directly compare the numbers.

1 2 3 STDEV AVG
Vanilla - Depth = 1 333988.96 343132.20 345864.02 6219.31 340995.06 62% CPU Usage
Vanilla - Depth = 16 358848.09 373391.77 359052.15 8338.51 363764.00 59% CPU Usage
Private - Depth = 1 423815.20 404264.73 408737.05 10243.48 412272.33 100% CPU Usage
Private - Depth = 16 738235.34 748093.14 740584.74 5148.98 742304.41 100% CPU Usage

This is roughly a 20% improvement for us in this scenario.

@benaadams
Copy link
Member

benaadams commented Nov 30, 2016

+20.9 % for single requests
+104.0 % for pipelined 16 (higher contention)

And cpu maxed, quite effective 😁

@stephentoub
Copy link
Member Author

@jkotas, I was just waiting for data from @rynowak. Now that it's in and looks good, any concerns with merging this? In particular, if you could review what I did for the FCall to make sure I did it right, I'd appreciate it.

@vancem
Copy link

vancem commented Nov 30, 2016

This is good stuff!

@jkotas
Copy link
Member

jkotas commented Dec 1, 2016

I think we should have the cache of the current processor number on the unmanaged side as part of the C++ FCall implementation, but that can be done as a follow up fix. I will look into doing it. I does need to block getting this in.

Agree with @vancem - good stuff!

@jkotas jkotas merged commit 32e6bf8 into dotnet:master Dec 1, 2016
@stephentoub stephentoub deleted the arraypool_perf branch December 1, 2016 01:22
@omariom
Copy link

omariom commented Dec 1, 2016

Why do OSes use rdtsp instead of storing CPU numbers in TEBs?
Just curious.

@jkotas
Copy link
Member

jkotas commented Dec 1, 2016

The processors tend to have quite a few special registers that are managed by the OS kernel. These registers are often read-only to user mode, or not available in user mode at all. When you show up in kernel mode out of the blue, it is kind of useful to know where you came from and which processor you are running on. Once you burn one of these special registers on processor id, it makes sense to use it for processor id in user mode as well. No need to duplicate it in usermode TLS.

On ARM processors, these special registers are more regular, have numbers and it is easier to wrap your head around them. For example, Windows choose to store the processor number in c13, c0, 3 (also known as TPIDRURO).

On Intel processors, these special registers are more colorful because of they have more history behind them. For example, if you step into disassembly of GetCurrentProcessorNumber on Windows x64, you can see that it uses rdtscp if it is available, and fallbacks to decoding it from lsl (load segment limit) if rdtscp is not available. The segment limit is basically another one of these special registers. And if the trick with the segment limit is not available, it fallbacks to yet another scheme.

@mattwarren
Copy link

@benaadams @stephentoub based on the comment above, I wrote a simple benchmark to test out the effect of single/multiple TLS field access.

However there doesn't seem to be much difference, in fact the combined/single access is slightly faster? Maybe storing it in a local field explains the difference? (benchmark code)

BenchmarkDotNet=v0.10.0
OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630771 Hz, Resolution=380.1167 ns, Timer=TSC
Host Runtime=Clr 4.0.30319.42000, Arch=32-bit RELEASE
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1590.0
Job Runtime(s):
	Clr 4.0.30319.42000, Arch=32-bit RELEASE
Method Mean StdDev Median
IndividualTLSFields 7.6758 ns 0.1391 ns 7.6476 ns
CombinedTLSFields 8.2771 ns 0.0653 ns 8.2900 ns

@benaadams
Copy link
Member

@mattwarren interesting...

I was thinking of something like benaadams@f2d6e0a

@karelz karelz modified the milestone: 2.0.0 Aug 28, 2017
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Improve ArrayPool implementation and performance

Commit migrated from dotnet/coreclr@32e6bf8
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.