(See draft Sse2.Pause() here: [https://github.com/zpodlovics/pauseintrinsics/blob/main/ApiDraft.md]) (Inspired by https://github.com/giltene/GilExamples/tree/master/SpinWaitTest)
A simple thread-to-thread communication latency and throughput tests that measures and reports on the behavior of thread-to-thread ping-pong latencies when spinning using a shared volatile field, align with the impact of using a Sse2.Pause() call on that latency behavior.
This test can be used to measure and document the impact of Sse2.Pause() behavior on thread-to-thread communication latencies. E.g. when the two threads are pinned to the two hardware threads of a shared x86 core (with a shared L1), this test will demonstrate an estimate the best case thread-to-thread latencies possible on the platform, if the latency of measuring time with Stopwatch.GetTimestamp() is discounted (nanoTime latency can be separately estimated across the percentile spectrum using the PauseIntrinsics.GetTimestamp.Benchmark.Cli test in this project).
Example .NET results plot (two threads on a shared core on a Xeon E5-2660v1 with SMT disabled and using all spectre / meltdown / related mitigations enabled by default):
Example .NET results plot (two threads on a shared core on a Kaveri 7850K using all spectre / meltdown / related mitigations enabled by default):
A non-official, non-validated, non-compatible proof of concept .NET SDK will be available for benchmarking the pause intrinsics in source (spec and patch file for CentOS8 dotnet5 package) form. WARNING: due the fixed public api surface an existing api call (Sse2.MemoryFence()) will emit PAUSE instruction instead of the MFENCE instruction.
This test is obviously intended to be run on machines with 2 or more vcores (tests on single vcore machines will produce understandably outrageously long runtimes).
(If needed) Prepare the PauseIntrinsics.sln by running (.NET SDK 5.x required):
% ./publish.sh
The simplest way to run SpinWait benchmark is:
% ./artifacts/PauseIntrinsics.SpinWait.Benchmark.Cli/PauseIntrinsics.SpinWait.Benchmark.Cli
The simplest way to run MemoryFence / Pause (assuming HAVE_PAUSE_INTRINSICS is defined for build.sh) benchmark is:
% ./artifacts/PauseIntrinsics.Pause.Benchmark.Cli/PauseIntrinsics.Pause.Benchmark.Cli
The simplest way to run BenchmarkDotNet benchmark for the various waiting methods is (.NET SDK 5.x prototype required for pause intrinsics):
% ./artifacts/PauseIntrinsics.BenchmarkDotnet.Cli/PauseIntrinsics.BenchmarkDotnet.Cli -f "*" -m -d
Output:
BenchmarkDotNet=v0.12.1, OS=centos 8
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G, 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=5.0.102
[Host] : .NET Core 5.0 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
DefaultJob : .NET Core 5.0.2 (CoreCLR 5.0.220.62901, CoreFX 5.0.220.62901), X64 RyuJIT
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | Code Size |
|------------- |----------:|----------:|----------:|------:|------:|------:|----------:|----------:|
| BusySpin | 1.311 ns | 0.0542 ns | 0.0507 ns | - | - | - | - | 6 B |
| GetTimestamp | 50.397 ns | 1.0498 ns | 1.2893 ns | - | - | - | - | 33 B |
| SpinWait1 | 42.325 ns | 0.7664 ns | 0.7169 ns | - | - | - | - | 10 B |
| MemoryFence | 1.177 ns | 0.0518 ns | 0.0484 ns | - | - | - | - | 3 B |
Assembly:
## .NET Core 5.0.2 (CoreCLR 5.0.220.62901, CoreFX 5.0.220.62901), X64 RyuJIT
; PauseIntrinsics.BenchmarkDotNet.Cli.Benchmark.MemoryFence()
pause
ret
; Total bytes of code 3
Since the test is intended to highlight the benefits of an intrinsic Sse2.Pause, using a prototype .NET that that intrinsifies Sse2.Pause() a PAUSE instruction, you can compare the output of:
% ./artifacts/PauseIntrinsics.Pause.Benchmark.Cli/PauseIntrinsics.Pause.Benchmark.Cli > Pause.hgrm
and
% ./artifacts/PauseIntrinsics.SpinWait.Benchmark.Cli/PauseIntrinsics.SpinWait.Benchmark.Cli > SpinWait.hgrm
By plotting them both with [HdrHistogram's online percentile plotter] (http://hdrhistogram.github.io/HdrHistogram/plotFiles.html)
On modern x86-64 sockets, comparisons seem to show an 18-20nsec difference in the round trip latency.
For consistent measurement, it is recommended that this test be executed while binding the process to specific cores. E.g. on a Linux system, the following command can be used:
% taskset -c 0,1 ./artifacts/PauseIntrinsics.Pause.Benchmark.Cli/PauseIntrinsics.Pause.Benchmark.Cli > Pause.hgrm
To place the spinning threads on the same core. (the choice of cores 0 and 1 is specific to a 32 vcore system where cores 0 and 1 represent two hyper-threads on a common core. You will want to identify a matching pair on your specific system. You can use lstopo or hwloc linux tools to identify the machine pairs.). You can also improve the measurement to execute it with high(er) priority eg.:
% nice -20 taskset -c 0,1 ./artifacts/PauseIntrinsics.Pause.Benchmark.Cli/PauseIntrinsics.Pause.Benchmark.Cli > Pause.hgrm
PauseIntrinsics outputs a percentile histogram distribution in HdrHistogram's common.hgrm format. This output can/should be redirected to an .hgrm file (e.g. SpinWait.hgrm), which can then be directly plotted using tools like [HdrHistogram's online percentile plotter] (http://hdrhistogram.github.io/HdrHistogram/plotFiles.html)
A prototype .NET implementation that implements Sse2.Pause as a PAUSE instruction on x86-64 is available.
Relevant repository could be found here:
Please note: These full implementations are included for x86. Implementations on other platforms may choose to use the same instructions as linux cpu_relax and / or plasma_spin
A non-official, non-validated, non-compatible proof of concept .NET SDK will be available for benchmarking the pause intrinsics in source (spec and patch file for CentOS8 dotnet5 package) form. WARNING: due the fixed public api surface an existing api call (Sse2.MemoryFence()) will emit PAUSE instruction instead of the MFENCE instruction.
This package includes some additional tests that can be used to explore the impact of Sse2.Pause() behavior:
To ping pong latency test with busy wait:
% ./artifacts/PauseIntrinsics.BusyWait.Benchmark.Cli/PauseIntrinsics.BusyWait.Benchmark.Cli
To test busy wait pure ping pong throughput test with no latency measurement overhead:
% ./artifacts/PauseIntrinsics.BusyWait.Throughput.Benchmark.Cli/PauseIntrinsics.BusyWait.Throughput.Benchmark.Cli
To test spin wait pure ping pong throughput test with no latency measurement overhead:
% ./artifacts/PauseIntrinsics.SpinWait.Throughput.Benchmark.Cli/PauseIntrinsics.SpinWait.Throughput.Benchmark.Cli
To test pause wait pure ping pong throughput test with no latency measurement overhead:
% ./artifacts/PauseIntrinsics.Pause.Throughput.Benchmark.Cli/PauseIntrinsics.Pause.Throughput.Benchmark.Cli
To document the latency of measure time with Stopwatch.GetTimestamp() (so that it can be discounted when observing ping pong latencies in the latency measuring tests):
% ./artifacts/PauseIntrinsics.GetTimestamp.Benchmark.Cli/PauseIntrinsics.GetTimestamp.Benchmark.Cli
[1] [https://github.com/giltene/GilExamples/tree/master/SpinWaitTest]