-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Improve ArrayPool implementation and performance #8209
Conversation
} | ||
|
||
/// <summary>Represents a buffer in the bucket.</summary> | ||
private struct Slot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth making this [StructLayout(LayoutKind.Explicit, Size = 64)]
to avoid contention with nearby slots? Will currently be 4 neighbouring slots in a cacheline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Experiments with LMAX Disruptor showed that in the case of Ring Buffers the false sharing issue is the biggest cause of contention in high multithreading environments.
SRC: http://mechanical-sympathy.blogspot.com.ar/2011/07/false-sharing.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Experiments with LMAX Disruptor showed that in the case of Ring Buffers the false sharing issue is the biggest cause of contention in high multithreading environments
I understand that and agree, but the typical use case for disruptor is very different than the typical use case for a buffer pool. Disruptor is very much about producer/consumer, typically when producer/consumer are different threads (that are often affinitized). That's not the common case here. That's why I left the comment I did; there's opportunity to experiment further to see if it makes a difference, but I didn't see one in the testing I did on the hardware available to me for the patterns I tested, and adding padding here increases the size of each of the arrays non-trivially, so we'd want to know it made a meaningful difference for the target scenarios before doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, @redknightlois or @benaadams, if you want to experiment with this, I'd welcome it. I'd adding padding by changing the Slot to this:
/// <summary>Represents a buffer in the bucket.</summary>
[StructLayout(LayoutKind.Sequential)]
private struct Slot
{
private readonly Padding56 _pad1;
/// <summary>The sequence number for this slot, used to synchronize between producers and consumers.</summary>
public long SequenceNumber; // Assuming a 64-bit cache line, sits in the last 8 bytes of [0,64)
/// <summary>The buffer.</summary>
public T[] Array; // Assuming a 64-bit cache line, sits in the first 8 bytes of [64,128)
private readonly Padding56 _pad2;
}
}
}
[StructLayout(LayoutKind.Explicit, Size = 56)]
internal struct Padding56 { }
in case you want to use that as a starting point.
@stephentoub - I can probably do this if it's important, it won't be straightforward because we haven't update everything to 1.2 yet.... |
@rynowak, would it be easier if I gave you a 1.1-based build of System.Buffers.dll? |
The typical ArrayPool pattern is: rent a buffer, write data to it, read data from it, return it; on multiple threads at the same time. I have written a test that does exactly that: https://gist.github.com/jkotas/319f360c72ba12c3dccc00f557567997. It allocates 4k buffer, fills it in, computes sum of it, and returns it back; on all cores; in a loop. The numbers from my dev box (Intel E5-2630, 2 sockets, 16 cores, 32 logical processors) are:
These numbers show that the ArrayPool is 1.5x worse (both before and after this change) than just allocating a fresh array every time, and 6.5x slower than ad-hoc thread-static buffer cache. The micro-benchmarks like this one is what we need to tune for because of they are better proxy of the real-world program behavior - in particular around process cache utilization; raw performance of rent/return is just a small component of it. I believe it should be achievable for ArrayPool to be within 20% of the thread-static cache number for this microbenchmark. The ArrayPool is variant of the classic C malloc/free. There are a few things to learn from it:
|
I agree. That's why in this change I split the implementation so that we can tune one separately from the other. You agree that's a good direction?
Once the shared is split from the configurable, that's relatively straightforward to add. What's harder is thinking through and tuning it. How big should the thread-local parts of the cache be? Should that vary based on size of buffer? Should it vary dynamically based on access patterns? Should there be any mechanism whereby buffers cached on one thread can be made available to buffers cached on another, or a mechanism whereby they can be reclaimed from TLS (weak refs or something more integrated into the GC)? I see #7747 as addressing that... or are you suggesting we scrap this PR?
While I agree this is very common, we've also made it very common for the renting and returning to be done by different threads, due to async/await: rent a buffer, potentially write data to it, potentially hop threads as part of an awaited operation, potentially read data from it, return the buffer. |
379bfda
to
fe01d08
Compare
Yes, this split is good.
I do not know how the underlying algorithms and datastructures required to support #7747 tuning are going to look like exactly. It is very likely that they are going to be different from what they look today. I would expect the local states and the global state to be tightly coupled. We can keep this PR if it is an improvement (ideally measured on something real like ASP.NET). Chances are there won't be much left from it once we are done with the overall tuning.
Yes, this is another case to tune for. It may be potentially interesting to deal well with the case where the different threads are running on the same core/socket. Schedulers tend to have heuristics to affinitize work that belongs to the same "request" to the same core/socket. |
fe01d08
to
f4c1cc9
Compare
I've taken a stab at adding a TLS layer in front of the global cache. Along with each global bucket, each thread has a TLS bucket that can hold a few arrays, but using weak references to avoid prolonged leaks. I also made the sizing scheme a bit more dynamic, so fewer big arrays are cachable. Rents first try TLS, then the global cache, and returns first target the TLS cache, and then go to the global if it's full. (Maybe with some mscorlib magic we could make stale TLS arrays move to the global... I experimented with using finalization and resurrection to do so, but it tanked perf.) I took Jan's benchmark and created one with a few more cases and scenarios:
|
What if it were a per cpu pool? |
Per cpu pool is good, but hard to enforce without thread affinity (which would cause more problems than it would solve in this instance). Its also a sys call to determine the current core. However; saying that while migrations would still happen; it would reduce the chance of lock contention; but locks would still be required (without affinities). |
@omariom I think my first interaction with @stephentoub was around per core stuff https://github.com/dotnet/corefx/issues/696 😄 |
As @stephentoub mentioned even interthread migrations is normal.
May be there is a way with just Interlockeds? |
Synchronisation might have been better phrasing. With thread data no synchronisation is required - though I think extending the thread with data comes at a higher starting cost? @stephentoub as a halfway house between I'm having trouble building apps with coreclr directly as suffer from lots of namespace and dependency clashes (which I assume is transitory and a result for bring apis back) so can't test myself :-/ |
@benaadams It is just a call (not a sys call) on decent systems. E.g. GetCurrentProcessorNumber is implemented as |
This is fundamentally really similar to https://github.com/dotnet/coreclr/blob/master/src/mscorlib/Common/PinnableBufferCache.cs . I wonder whether we should merge the two. The PinnableBufferCache has quite a bit of self-tuning logic. |
// This isn't our first time pushing on this thread. Iterate through | ||
// the weak references looking for one that doesn't have an array, | ||
// and store the pushed array into it. | ||
foreach (WeakReference<T[]> wr in _arrays) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WeakReference means that we will lose all per-thread buffers (that are not in use) every Gen2 GC. I do not know how bad is it going to be; but it is certainly not going to be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. As I noted, what I really wanted was something that would move the buffers from here to the global pool, instead of dropping them. To my knowledge we just don't have anything like that existing yet. I'd tried using finalizable objects and resurrection to do it, essentially instead of having a weak ref to an array, having a weak ref to a poolable resurrectable finalizable object wrapping an array, and the finalizer would TryEnqueue the array on to the global queue. Perf tanked, but it's possible I messed it up somehow... I'll try again now that I've made other changes.
Certainly worth exploring. I added into my test two additional test pools (again, not 100% apples to apples, but close enough), one based on PinnableBufferCache, and one based on @benaadams's question about SpinLock.TryEnter. Here are the results (again on my 4core/8logical machine):
@jkotas, my goal in adding this second commit was mainly experimentation. To make forward progress with this, I see a few options:
I'd prefer not to do (1) or (4). There's value in the first commit, and the second commit with its WeakReference-based implementation was more experimentation than anything else; as you say, it'll have not-good behaviors when all of the buffers get cleaned up in a gen2 GC. Given the discussion thus far and a desire for some kind of TLS cache, my preference is (3): it'll give a boost for the common access pattern you call out, without leaking an obscene amount of memory. We would then use #7747 to look at ways of tuning this, dumping those TLS caches in a better manner, etc. (2) would also be ok. Thoughts? |
One other thought: |
Sounds good to me. |
Yes this would make things a lot easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest looks good to me
_buckets[bucket].Return(array); | ||
// Check to see if the buffer is the correct size for this bucket | ||
GlobalBucket b = _buckets[bucket]; | ||
if (array.Length < b.BufferLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would be ok to pool arrays that were not rented from here and are a bit larger, but if the array is significantly larger, especially for the last bucket, it would be a pity to hold on to that array. Seems like it would be fair to reject or ignore any arrays that are not exactly the same size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already changed it back, just haven't pushed an updated commit yet. This was a holdover from when I was playing around with allowing buffers to be rented from higher buckets if the target one was empty and returned to lower buckets if the target one was full.
GlobalBucket b = _buckets[bucket]; | ||
if (array.Length < b.BufferLength) | ||
{ | ||
throw new ArgumentException(SR.ArgumentException_BufferNotFromPool, nameof(array)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an empty array is rented and returned, wouldn't this throw? It seems like it should ignore it as was done before in the else if (array.Length == 0)
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never get here for an empty array, as the higher-level if block won't be entered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an empty array bucket
would be 0 and _buckets[0].BufferLength
would be 16 (Utilities.GetMaxSizeForBucket(0)
would return 16). In that case, it would enter this block, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an empty array bucket would be 0
Actually, no, because of the selection algorithm and the -1 used in https://github.com/dotnet/coreclr/pull/8209/files/f4c1cc902e4dbdb3e25a101818c61c7e33a84c53#diff-32265784f459bed5d7538aded41a5958R15 , 0 ends up looking like a gigantic buffer and skips past this whole block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense
} | ||
else if (delta < 0) | ||
{ | ||
// Queue is full. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose _slots.Length
is 1 and only TryEnqueues are happening. On the first TryEnqueue, currentTail
changes from 0 to 1, and _slots[0].SequenceNumber
changes from 0 to 1. On the second TryEnqueue, slotsIndex
is still 0, so delta
is also 0. currentTail
changes from 1 to 2, and _slots[0].SequenceNumber
changes from 1 to 2. I expected the second TryEnqueue to trigger the "queue is full" case here. Did I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I saw now the assert that says the length must be >= 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But as in TryEnqueue, if old or torn values are read, this may also incorrectly determine that the queue is full and is not guaranteed to enqueue an item when there is space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with the previous comment, torn reads should not happen, and it's ok to say it's full when it's not, in which case the array just gets dropped even if there could have been a slot for it.
} | ||
else if (delta < 0) | ||
{ | ||
// Queue is empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This determination is only based on volatile reads of long
values. Those could potentially be reading old values or the reads may be torn on some 32-bit platforms, so this determination that the queue is empty could be incorrect. I guess it is ok in the case it is used from ArrayPool since it would just end up allocating an array, but if this is correct, it would be good to add a comment indicating that this queue is not guaranteed to dequeue a queued item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There shouldn't be any tearing, as all Int64 reads are done with Volatile.Read. Regardless, though, it's ok for it to be wrong in this direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, missed the interlocked read on 32-bit platforms. It would be good to add a comment or to reflect in the name that enqueues/dequeues are not guaranteed, as the class could be used later for another purpose where it matters.
I've taken a look at this thread. I am not clear on how the PinnableBufferCache experiment was done - @stephentoub are you using the stock implementation of PinnableBufferCache, or did you also modify it to be say with a per thread cache too? Also since you were not actually doing any gen2 GCs, much of the logic in PinnableBufferCache is not activated as it uses gen2 GCs as a “clock” to trim the free list. It also already has a way to move items between different lists. You can do the same thing with your per thread pool and global pool (which can correspond to its gen2 free list and non gen2 free list). The idea is you grow the list when you need to but you remember which ones are preferred (in the PinnableBufferCache case this is the gen2 buffers) on a list so you give those out first; and when you trim you trim the non preferred ones first. I would try with the way PinnableBufferCache manages the buffers in its pool but add the per thread caching support. This wasn’t that essential for PinnableBufferCache because we don’t expect very intensive usage of this pool – this is there mostly for fragmentation control, not for speed. In order to make it a general purpose pool where speed would be a key factor you’d want to make some modifications. I would start by taking a profile to see where we are spending the time. Also are you expecting that most of the allocations would be rented from the pool? If so you would simply not have many gen2 GCs at all and using gen2 wouldn’t be a good “lock”. In that case you might consider using other factors like # of items or memory pressure. Another thing I should mention is whether you want per thread or per core caching. GC uses per core but GC only does this every alloc quantum which is a few KB so it can afford to check at that granularity. And if there's no async, and you have a bunch of worker threads that basically just run on the same cores, that's basically like having a per core cache. With async I am not sure if it just means a threadpool thread can be used to execute parts of any random requests. |
I was keeping it simple: the previously shown numbers were just an ArrayPool-based wrapper around a PinnableBufferCache instance of one array-length size (it would need to be more complicated for a real implementation). I've added in another that does similar TLS-based caching before hitting the PinnableBufferCache.
Yes. There are gen2 GCs being forced between tests, but they're rare in the tests themselves. Note that the same PBC is used across all of the tests.
It's not the when I'm concerned about, but rather the how. Specifically, in the situation where a thread ends up caching something in TLS and then that thread never again executes code that deals with that particular array size... that array will sit stuck in that thread's TLS, and we won't have a mechanism to get it out without executing code on that thread. So either we say this is rare enough that we don't care, or we need to come up with some mechanism for doing this.
For the specific array types, the expectation is that (eventually) the majority will come from the pool, at least those created by the stack up through ASP.NET. I've basically scrapped my previous implementation and have a new one which I'll post shortly... |
f4c1cc9
to
b403a93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you!
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
get | ||
{ | ||
// On Unix, GetCurrentProcessorNumber is implemented in terms of sched_getcpu, which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what the performance characteristics of GetCurrentProcessorNumber are accross different Unix OSes where it is actually implemented. We may want to cache it - similar to how GC does it; or maybe piggy back on the GC cache where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. refresh the cached value every N rents; or whenever we hit lock contention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar problem may exist on Windows - some of combinations of old OS versions; old CPU versions; or some types of virtualized environments are likely going to have slow implementation of GetCurrentProcessorNumber.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure you could; the GC uses affinitised threads so the core is stable? Unless there is some way to get the OS scheduler to inject the core number it will keep changing so couldn't be cached very well?
Though it may work as "mostly" correct as to avoid contention; unlikely to be L1/L2 cache correct though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or whenever we hit lock contention.
That might work nicely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GC uses affinitised threads so the core is stable
The server GC worker threads are affinitised. The regular user threads are not affinitised - regularly queries the OS about what processor is the thread actually running on, and also gives hint to the OS scheduler what it thinks the ideal processor for the thread to run on is (using SetThreadIdealProcessor - only on Windows today).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
performance characteristics of GetCurrentProcessorNumber
Maybe we should time a few calls, and then decide whether it is cheap enough to call it for each rent/return, or cache it instead to amortize the costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. refresh the cached value every N rents; or whenever we hit lock contention.
I added a commit that caches the process number in TLS and flushes the cache whenever lock contention is encountered on the per-core stack accesses. Is this along the lines of what you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is along the lines what I had in mind ... but I am not sure how well it is actually going to work in practice. It is something we should work on more (e.g. together with making GetCurrentProcessorNumber FCall). It does not need to block this change from getting merged.
// round-robin through the other stacks. | ||
T[] arr; | ||
LockedStack[] stacks = _perCoreStacks; | ||
int index = ExecutionId % stacks.Length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May prefixing with if (executionId >= stacks.Length)
noticeably improve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice the mod operation as being particularly measurable.
so how well would the async operations keep the threads that are part of an async operation on the same core? if method foo is await-ing on bar and bar is dispatched to another core, it doesn't help if foo got the allocation on the core it's running on and bar uses this allocation on a different core... |
Right, that's effectively the GlobalYield case in the benchmarks I posted, and why I included it. |
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static void MonitorEnterWithProcNumberFlush(object obj) | ||
{ | ||
if (!Monitor.TryEnter(obj)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fail fast? #6951
if (!Monitor.TryEnter(obj, 0))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the comment... Monitor.TryEnter(obj) is Monitor.TryEnter(obj, 0). Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😳 Good point...
e.g. private class ThreadStaticData
{
public T[][] Buckets
public int? ProcessorNumber
}
[ThreadStatic]
private static ThreadStaticData t_tlsData; Then assign |
I can experiment with it. |
- Renames DefaultArrayPool to ConfigurableArrayPool, which remains unchanged. - Adds a new TlsOverPerCoreLockedStacksArrayPool, which is used as the shared pool for byte[] and char[]. The pool is tiered, with a small per-thread TLS cache, followed by a global cache. The global cache is split into effectively per-core buckets, although threads are able to check other buckets if their assigned bucket is empty/full for rents/returns, respectively.
7ea4cfa
to
ab67f37
Compare
ab67f37
to
9a7bddb
Compare
Here's our data from trying this with TechEmpower's JSON benchmark implemented using aspnet MVC on our 48 core environment. This is the thinnest benchmark we have today that uses the buffer pool, and it's currently limited by contention. (note 62% CPU usage) For reference https://www.techempower.com/benchmarks/#section=data-r13&hw=ph&test=json (CTRL+F aspnetcore-mvc-linux). The official results use different hardware so don't directly compare the numbers.
This is roughly a 20% improvement for us in this scenario. |
+20.9 % for single requests And cpu maxed, quite effective 😁 |
This is good stuff! |
I think we should have the cache of the current processor number on the unmanaged side as part of the C++ FCall implementation, but that can be done as a follow up fix. I will look into doing it. I does need to block getting this in. Agree with @vancem - good stuff! |
Why do OSes use rdtsp instead of storing CPU numbers in TEBs? |
The processors tend to have quite a few special registers that are managed by the OS kernel. These registers are often read-only to user mode, or not available in user mode at all. When you show up in kernel mode out of the blue, it is kind of useful to know where you came from and which processor you are running on. Once you burn one of these special registers on processor id, it makes sense to use it for processor id in user mode as well. No need to duplicate it in usermode TLS. On ARM processors, these special registers are more regular, have numbers and it is easier to wrap your head around them. For example, Windows choose to store the processor number in On Intel processors, these special registers are more colorful because of they have more history behind them. For example, if you step into disassembly of GetCurrentProcessorNumber on Windows x64, you can see that it uses |
@benaadams @stephentoub based on the comment above, I wrote a simple benchmark to test out the effect of single/multiple TLS field access. However there doesn't seem to be much difference, in fact the combined/single access is slightly faster? Maybe storing it in a local field explains the difference? (benchmark code) BenchmarkDotNet=v0.10.0
OS=Microsoft Windows NT 6.1.7601 Service Pack 1
Processor=Intel(R) Core(TM) i7-4800MQ CPU 2.70GHz, ProcessorCount=8
Frequency=2630771 Hz, Resolution=380.1167 ns, Timer=TSC
Host Runtime=Clr 4.0.30319.42000, Arch=32-bit RELEASE
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1590.0
Job Runtime(s):
Clr 4.0.30319.42000, Arch=32-bit RELEASE
|
@mattwarren interesting... I was thinking of something like benaadams@f2d6e0a |
Improve ArrayPool implementation and performance Commit migrated from dotnet/coreclr@32e6bf8
This commit improves the performance of ArrayPool, with several changes:
new implementation is provided in DefaultArrayPool. ArrayPool.Shared returns an instance of (the new) DefaultArrayPool, and the Create methods return an instance of ConfigurableArrayPool.
In my local testing on 64-bit, uncontended access is ~40% faster than before. Multiple threads each renting and returning furiously ends up being anywhere from the same throughput to ~2x faster, depending on access patterns. I've not tested on 32-bit, but I expect this won't do as well there, due to the reads/writes on Int64s: we may choose to return an instance of ConfigurableArrayPool instead of DefaultArrayPool from Shared/Create() on 32-bit.
@rynowak, would it be possible to test this out with your ASP.NET benchmarks you previously used to evaluate ArrayPool's impact?
cc: @jkotas, @kouvel, @JeremyKuhne, @redknightlois, @benaadams, @rynowak
Fixes https://github.com/dotnet/coreclr/issues/8203