Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Much higher performance with these changes #9

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 55 additions & 29 deletions src/MPMCQueue.NET/MPMCQueue.cs
Original file line number Diff line number Diff line change
@@ -1,91 +1,117 @@
using System;
using System.Runtime.InteropServices;
using System.Threading;
using System.Runtime.CompilerServices;

namespace MPMCQueue.NET
{
[StructLayout(LayoutKind.Explicit, Size = 192, CharSet = CharSet.Ansi)]
[StructLayout(LayoutKind.Explicit, Size = 336)]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, should be 384

public class MPMCQueue
{
[FieldOffset(0)]
private readonly Cell[] _buffer;
[FieldOffset(8)]
private readonly int _bufferMask;
[FieldOffset(64)]
private int _enqueuePos;
[FieldOffset(128)]
private int _dequeuePos;
/// <summary>
/// 128 bytes cache line already exists in some CPUs.
/// </summary>
/// <remarks>
/// Also "the spatial prefetcher strives to keep pairs of cache lines in the L2 cache."
/// https://stackoverflow.com/questions/29199779/false-sharing-and-128-byte-alignment-padding
/// </remarks>
internal const int SAFE_CACHE_LINE = 128;

[FieldOffset(SAFE_CACHE_LINE)]
private readonly Cell[] _enqueueBuffer;

[FieldOffset(SAFE_CACHE_LINE + 8)]
private volatile int _enqueuePos;

// Separate access to buffers from enqueue and dequeue.
// This removes false sharing and accessing a buffer
// reference also prefetches the following Pos with [(64 - (8 + 4 + 4)) = 52]/64 probability.

[FieldOffset(SAFE_CACHE_LINE * 2)]
private readonly Cell[] _dequeueBuffer;

[FieldOffset(SAFE_CACHE_LINE * 2 + 8)]
private volatile int _dequeuePos;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello :)
I wonder if the observed improved performance really comes from having two copies of the same read-only variable, or rather from the fact that the frequently read fields are no longer in the same cache line as the fields being modified frequently.
I mean: I can see how constantly writing to a cache line can trash it and cause false sharing issues, but I fail to see how having multiple readers is a problem - the cache line should be in (S)hared state in MESI protocol, right?
I see that in the old code, the read-only _bufferMask had offset 8, and write-intensive _enqueuePos had offset 64, but does it really mean they are guaranteed to be in different cache lines?
What if the MPMCQueue instance itself is not aligned at cache line boundary?
Could it then be that __bufferMask is at the start of the cache line and __enqueuePos at its end?
And if the Facebook's observation that we should rather care about 128 rather than 64 units is correct, then the same applies to __buffer and __enqueuePos (and other pairs of "read-only" and "write-havy" fields as many of them are closer than 128 bytes to each other).
In description of PR you give one more probable reason of why duplication helps: data needed to enqueue is in one line, and data needed for dequeue is in one (other) line, which creates nice "packages" of self-sufficient data. I can see how it can help: it requires one fetch instead of two, in case data was missing from cache. But how frequent is the situation of cache miss in a tight loop of benchmark?
I would appreciate someone educate me on the most probable explanation (ideally: backed with experiment) why this really helps :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it then be that __bufferMask is at the start of the cache line and __enqueuePos at its end?

With probability 8/64 yes. .NET aligns objects to 8 bytes, not cache line. And yes, if anything is modified on a cache line all other threads will need to reload that line, so modifying enqueuePos could interfere with Dequeue reading bufferMask with 8/64 probability.


public MPMCQueue(int bufferSize)
{
if (bufferSize < 2) throw new ArgumentException($"{nameof(bufferSize)} should be greater than or equal to 2");
if ((bufferSize & (bufferSize - 1)) != 0) throw new ArgumentException($"{nameof(bufferSize)} should be a power of 2");

_bufferMask = bufferSize - 1;
_buffer = new Cell[bufferSize];
_enqueueBuffer = new Cell[bufferSize];

for (var i = 0; i < bufferSize; i++)
{
_buffer[i] = new Cell(i, null);
_enqueueBuffer[i] = new Cell(i, null);
}

_dequeueBuffer = _enqueueBuffer;
_enqueuePos = 0;
_dequeuePos = 0;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link

@omariom omariom Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a significant change by itself.

Is the perf the same without AI?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without AI is slightly slower of cause. But the main point of changes in multi-threaded scalability: spinning and two references to the same buffer. Not sure about buffer bounds check with & buffer.Length - 1 - does it work now? Am still lazy to learn assembly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is still there, but JIT reuses the load of Length for both the check and buffer.Length - 1.

Copy link
Author

@buybackoff buybackoff Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any trick to avoid bounds check here? To leverage pow2 somehow. I searched through coreclr issues but couldn't find any. Also do you know if there is a single place with all current optimizations for BC elimination?

Copy link

@omariom omariom Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can avoid bounds check.
With int index = pos & (buffer.Length - 1) pos can be anything and buffer.Length can be zero.
The bounds check is always predictable and not the only branch here. So unlikely change much.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't put AggressiveInlining on public methods as we don't control callers code here. That could have some side effects: the caller code gets bigger and won't be inlined inself, other methods in caller code won't be inlined, some optimizations in caller can be missed, etc.
Yes, it's faster in benchmarks obviously but I wouldn't add it anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my bad habbit in general. But for low-level building blocks that are intended to be used later it's often better to have it: you could always wrap it into another method if you do not want inlining or prefer JIT to decide. Usually the caller decides this, not callee (relevant discussion dotnet/corefxlab#2592 (comment)). Here I never compared perf to the original implementation in absolute numbers, but the multithreaded scalability of the same method, which doesn't depend on the attribute.

public bool TryEnqueue(object item)
{
var spinner = new SpinWait();
do
{
var buffer = _buffer;
var buffer = _enqueueBuffer;
var pos = _enqueuePos;
var index = pos & _bufferMask;
var cell = buffer[index];
var index = pos & (buffer.Length - 1);
ref var cell = ref buffer[index];
if (cell.Sequence == pos && Interlocked.CompareExchange(ref _enqueuePos, pos + 1, pos) == pos)
{
buffer[index].Element = item;
Volatile.Write(ref buffer[index].Sequence, pos + 1);
cell.Element = item;
cell.Sequence = pos + 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Volatile.Write is now almost identical to direct reads/writes to volatile fields.

On line 62 cell.Sequence is now a volatile read which is fairly slow on ARM.
I don't know if RuyJIT has changed implementation to use ARMv8 half-barries, but they are not availabe onARMv7 anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still only planning to buy Raspberry PI for testing :) Yes, good point. It's next to Interlocked call anyways. I changed that mostly from aesthetics perspective.

return true;
}

if (cell.Sequence < pos)
{
return false;
}

spinner.SpinOnce();
} while (true);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public bool TryDequeue(out object result)
{
result = null;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result is a ref. Each access dereference it.
That means you write twice to it on success.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left from my API changes where I return object? with local object result = null;

var spinner = new SpinWait();
do
{
var buffer = _buffer;
var bufferMask = _bufferMask;
var buffer = _dequeueBuffer;
var pos = _dequeuePos;
var index = pos & bufferMask;
var cell = buffer[index];
var index = pos & (buffer.Length - 1);
ref var cell = ref buffer[index];
if (cell.Sequence == pos + 1 && Interlocked.CompareExchange(ref _dequeuePos, pos + 1, pos) == pos)
{
result = cell.Element;
buffer[index].Element = null;
Volatile.Write(ref buffer[index].Sequence, pos + bufferMask + 1);
return true;
cell.Element = null;
cell.Sequence = pos + buffer.Length;
break;
}

if (cell.Sequence < pos + 1)
{
result = default(object);
return false;
break;
}

spinner.SpinOnce();
} while (true);

return result != null;
}

[StructLayout(LayoutKind.Explicit, Size = 16, CharSet = CharSet.Ansi)]
[StructLayout(LayoutKind.Explicit, Size = 16)]
private struct Cell
{
[FieldOffset(0)]
public int Sequence;
public volatile int Sequence;

[FieldOffset(8)]
public object Element;

Expand All @@ -96,4 +122,4 @@ public Cell(int sequence, object element)
}
}
}
}
}