Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce pause intrinsics in order to support spin wait loop indication #53532

Closed
zpodlovics opened this issue Jun 1, 2021 · 10 comments · Fixed by #61065
Closed

Introduce pause intrinsics in order to support spin wait loop indication #53532

zpodlovics opened this issue Jun 1, 2021 · 10 comments · Fixed by #61065
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics
Milestone

Comments

@zpodlovics
Copy link

zpodlovics commented Jun 1, 2021

Background and Motivation

Some hardware platforms may be greatly benefit from software indication that a spin wait loop is in progress.

Some common execution benefits may be observed:

  1. The reaction time of a spin wait loop construct may be improved when a spin wait indicating is used due to various factors, reducing thread-to-thread latencies in spinning wait situations.

  2. The power consumed by the core or hardware thread involved in the spin wait loop construct may be reduced, benefitting overall power consumption of a program, and possibly allowing other cores or hardware threads to execute at faster speeds within the same power consumption envelope.

As a practical example and use case, current x86 processors support a PAUSE instruction that can be used to indicate spinning behavior. Using a PAUSE instruction demonstrably reduces thread-to-thread round trips. Due to it's benefits and commonly recommended use, the x86 PAUSE instruction is commonly used in kernel spin locks, in POSIX libraries that perform heuristic spins prior to blocking, and even by the .NET itself (mm_pause). However, due to the inability to indicate that a .NET loop is spinning, it's benefits are not available to regular .NET code.

In the prototype the round-trip latencies were demonstrably reduced by ~29-69 nsec across a wide percentile spectrum (from the 10%'ile to the 99.9%'ile). This reduction can represent an improvement as high as ~30%-50% in best-case thread-to-thread communication latency.

Please note just like any other instruction latency the PAUSE instruction may vary depending on processor architectures [8][9]:

  • Intel® (Core i5-5200U) processor on Broadwell architecture: 9 cycles;
  • Intel® (Core i7-6500U) processor on Skylake architecture: 140 cycles;
  • Intel® processor on Cascade Lake architecture: 40 cycles.
  • AMD® (Ryzen 7 3700X) processor on Zen2 architecture: 65 cycles.

Thanks to @jkotas suggestion Yield instrinsics will be also provided on ARM architecture at the same time to be in parity.

Proposed API changes

(using the alphanumerical order of file names):

    public abstract class ArmBase 
         public static uint ReverseElementBits(uint value);
+        public static void Yield();
    }
 }
    public abstract partial class X86Base 
         public static unsafe (int Eax, int Ebx, int Ecx, int Edx) CpuId(int functionId, int subFunctionId);
+        public static void Pause();
     }
 }

Usage Examples

Efficient thread-to-thread communication in order to implement highly performant (and often latency sensitive) concurrent data structures and communication patterns. A simple thread-to-thread communication latency and throughput tests that measures and reports on the behavior of thread-to-thread ping-pong latencies when spinning using a shared volatile field, align with the impact of using the Stopwatch/BusySpin/SpinWait/Pause call on that latency behavior.

The test can be used to measure and document the impact of Sse2.Pause() behavior on thread-to-thread communication latencies. E.g. when the two threads are pinned to the two hardware threads of a shared x86 core (with a shared L1), this test will demonstrate an estimate the best case thread-to-thread latencies possible on the platform, if the latency of measuring time with Stopwatch.GetTimestamp() is discounted (GetTimestamp latency can be separately estimated across the percentile spectrum using the PauseIntrinsics.GetTimestamp.Benchmark.Cli test in the PauseIntrinsics project).

The thread-to-thread communication benchmarks (in order to measure the latency of timestamp, busyspin, spinwait, pause wait methods) project are available: https://github.com/zpodlovics/PauseIntrinsics

A non-official, non-validated, non-compatible proof of concept .NET SDK benchmarking result (using modified CentOS8 dotnet5 package). WARNING: due the fixed public api surface an existing api call (Sse2.MemoryFence()) will emit PAUSE instruction instead of the MFENCE instruction.

Example .NET results plot (two threads on a shared core on a Xeon E5-2660v1 with SMT disabled and using all spectre / meltdown / related mitigations enabled by default):

68747470733a2f2f7261772e6769746875622e636f6d2f7a706f646c6f766963732f7061757365696e7472696e736963732f6d61696e2f6d6561737572656d656e74732f4b61766572695f4c6174656e63792e706e67

Example .NET results plot (two threads on a shared core on a Kaveri 7850K using all spectre / meltdown / related mitigations enabled by default):

68747470733a2f2f7261772e6769746875622e636f6d2f7a706f646c6f766963732f7061757365696e7472696e736963732f6d61696e2f6d6561737572656d656e74732f53616e64794272696467655f4c6174656e63792e706e67

BenchmarkDotNet=v0.12.1, OS=centos 8
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G, 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=5.0.102
  [Host]     : .NET Core 5.0 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
  DefaultJob : .NET Core 5.0.2 (CoreCLR 5.0.220.62901, CoreFX 5.0.220.62901), X64 RyuJIT


|       Method |      Mean |     Error |    StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | Code Size |
|------------- |----------:|----------:|----------:|------:|------:|------:|----------:|----------:|
|     BusySpin |  1.311 ns | 0.0542 ns | 0.0507 ns |     - |     - |     - |         - |       6 B |
| GetTimestamp | 50.397 ns | 1.0498 ns | 1.2893 ns |     - |     - |     - |         - |      33 B |
|    SpinWait1 | 42.325 ns | 0.7664 ns | 0.7169 ns |     - |     - |     - |         - |      10 B |
|  MemoryFence |  1.177 ns | 0.0518 ns | 0.0484 ns |     - |     - |     - |         - |       3 B |

Alternative Designs

DllImport or DllImport alternatives (e.g.: function pointers) can be used to spin loop with a spin-loop-indicating CPU instruction, but the DllImport / DllImport alternative boundary crossing overhead tends to be larger than the benefit provided by the instruction.

.NET pattern maching could attempt to have the JIT compilers deduce spin-wait-loop situations and code and choose to automatically include a spin-loop-indicating CPU instructions with no .NET code indication required. I would expect that the complexity of automatically and reliably detecting spinning situations, coupled with questions about potential tradeoffs in using the indication on some platform to delay the availability of viable implementations significantly.

Risks

An intrinsic x86 implementation will involve modifications to multiple .NET components and exposing a new Sse2.Pause Intrinsics API and as such they carry some risks, but no more than other simple intrinsics added to the .NET.

Some processor architecture may have significantly different latency profile for PAUSE intrinsics (e.g.: · Intel® Xeon® Scalable processor on Skylake architecture: 140 cycles). However this is also true for every other intrinsics that is available and should not prevent the intrinsics usage and it seems that the latency improved greatly since than (e.g.: 2nd generation Intel® Xeon® Scalable processor based on Cascade Lake architecture: 40 cycles.).

References

[1] LMAX Disruptor .NET implementation
[2] Pause intrinsics latency and throughput benchmarks (C#)
[3] [Pause intrinsics latency and throughput benchmarks (Java)] (https://github.com/giltene/GilExamples/tree/master/SpinWaitTest)
[4] Chart depicting Java onSpinWait() intrinsification impact
[5] [.NET prototype Sse2.Pause intrinsics implementation branch] (https://github.com/zpodlovics/runtime/tree/sse2pause)
[6] Implementations on other platforms (other than x86) may choose to use the same instructions as linux cpu_relax and/or plasma_spin
[7] https://software.intel.com/content/www/us/en/develop/articles/benefitting-power-and-performance-sleep-loops.html
[8] Andreas Abel: Automatic Generation of Models of Microarchitectures
[9] https://uops.info/table.html?search=PAUSE&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SNB=on&cb_IVB=on&cb_HSW=on&cb_BDW=on&cb_SKL=on&cb_SKX=on&cb_KBL=on&cb_CFL=on&cb_CNL=on&cb_CLX=on&cb_ICL=on&cb_ZENp=on&cb_ZEN2=on&cb_measurements=on&cb_iaca30=on&cb_doc=on&cb_base=on&cb_sse=on&cb_others=on

@zpodlovics zpodlovics added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Jun 1, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Runtime.Intrinsics untriaged New issue has not been triaged by the area owner labels Jun 1, 2021
@ghost
Copy link

ghost commented Jun 1, 2021

Tagging subscribers to this area: @tannergooding
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and Motivation

Some hardware platforms may be greatly benefit from software indication that a spin wait loop is in progress.

Some common execution benefits may be observed:

  1. The reaction time of a spin wait loop construct may be improved when a spin wait indicating is used due to various factors, reducing thread-to-thread latencies in spinning wait situations.

  2. The power consumed by the core or hardware thread involved in the spin wait loop construct may be reduced, benefitting overall power consumption of a program, and possibly allowing other cores or hardware threads to execute at faster speeds within the same power consumption envelope.

As a practical example and use case, current x86 processors support a PAUSE instruction that can be used to indicate spinning behavior. Using a PAUSE instruction demonstrably reduces thread-to-thread round trips. Due to it's benefits and commonly recommended use, the x86 PAUSE instruction is commonly used in kernel spin locks, in POSIX libraries that perform heuristic spins prior to blocking, and even by the .NET itself (mm_pause). However, due to the inability to indicate that a .NET loop is spinning, it's benefits are not available to regular .NET code.

In the prototype the round-trip latencies were demonstrably reduced by ~29-69 nsec across a wide percentile spectrum (from the 10%'ile to the 99.9%'ile). This reduction can represent an improvement as high as ~30%-50% in best-case thread-to-thread communication latency.

Please note just like any other instruction latency the PAUSE instruction may vary depending on processor architectures [8][9]:

  • Intel® (Core i5-5200U) processor on Broadwell architecture: 9 cycles;
  • Intel® (Core i7-6500U) processor on Skylake architecture: 140 cycles;
  • Intel® processor on Cascade Lake architecture: 40 cycles.
  • AMD® (Ryzen 7 3700X) processor on Zen2 architecture: 65 cycles.

Proposed API

(using the alphanumerical order of method names):

public abstract partial class Sse2 : System.Runtime.Intrinsics.X86.Sse
         public static System.Runtime.Intrinsics.Vector128<sbyte> PackSignedSaturate(System.Runtime.Intrinsics.Vector128<short> left, System.Runtime.Intrinsics.Vector128<short> right) { throw null; }
         public static System.Runtime.Intrinsics.Vector128<short> PackSignedSaturate(System.Runtime.Intrinsics.Vector128<int> left, System.Runtime.Intrinsics.Vector128<int> right) { throw null; }
         public static System.Runtime.Intrinsics.Vector128<byte> PackUnsignedSaturate(System.Runtime.Intrinsics.Vector128<short> left, System.Runtime.Intrinsics.Vector128<short> right) { throw null; }
+        public static void Pause() { }
         public static System.Runtime.Intrinsics.Vector128<short> ShiftLeftLogical(System.Runtime.Intrinsics.Vector128<short> value, byte count) { throw null; }
         public static System.Runtime.Intrinsics.Vector128<short> ShiftLeftLogical(System.Runtime.Intrinsics.Vector128<short> value, System.Runtime.Intrinsics.Vector128<short> count) { throw null; }
         public static System.Runtime.Intrinsics.Vector128<int> ShiftLeftLogical(System.Runtime.Intrinsics.Vector128<int> value, byte count) { throw null; }

Usage Examples

Efficient thread-to-thread communication in order to implement highly performant (and often latency sensitive) concurrent data structures and communication patterns. A simple thread-to-thread communication latency and throughput tests that measures and reports on the behavior of thread-to-thread ping-pong latencies when spinning using a shared volatile field, align with the impact of using the Stopwatch/BusySpin/SpinWait/Pause call on that latency behavior.

The test can be used to measure and document the impact of Sse2.Pause() behavior on thread-to-thread communication latencies. E.g. when the two threads are pinned to the two hardware threads of a shared x86 core (with a shared L1), this test will demonstrate an estimate the best case thread-to-thread latencies possible on the platform, if the latency of measuring time with Stopwatch.GetTimestamp() is discounted (GetTimestamp latency can be separately estimated across the percentile spectrum using the PauseIntrinsics.GetTimestamp.Benchmark.Cli test in the PauseIntrinsics project).

The thread-to-thread communication benchmarks (in order to measure the latency of timestamp, busyspin, spinwait, pause wait methods) project are available: https://github.com/zpodlovics/PauseIntrinsics

A non-official, non-validated, non-compatible proof of concept .NET SDK benchmarking result (using modified CentOS8 dotnet5 package). WARNING: due the fixed public api surface an existing api call (Sse2.MemoryFence()) will emit PAUSE instruction instead of the MFENCE instruction.

Example .NET results plot (two threads on a shared core on a Xeon E5-2660v1 with SMT disabled and using all spectre / meltdown / related mitigations enabled by default):

68747470733a2f2f7261772e6769746875622e636f6d2f7a706f646c6f766963732f7061757365696e7472696e736963732f6d61696e2f6d6561737572656d656e74732f4b61766572695f4c6174656e63792e706e67

Example .NET results plot (two threads on a shared core on a Kaveri 7850K using all spectre / meltdown / related mitigations enabled by default):

68747470733a2f2f7261772e6769746875622e636f6d2f7a706f646c6f766963732f7061757365696e7472696e736963732f6d61696e2f6d6561737572656d656e74732f53616e64794272696467655f4c6174656e63792e706e67

BenchmarkDotNet=v0.12.1, OS=centos 8
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G, 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=5.0.102
  [Host]     : .NET Core 5.0 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
  DefaultJob : .NET Core 5.0.2 (CoreCLR 5.0.220.62901, CoreFX 5.0.220.62901), X64 RyuJIT


|       Method |      Mean |     Error |    StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | Code Size |
|------------- |----------:|----------:|----------:|------:|------:|------:|----------:|----------:|
|     BusySpin |  1.311 ns | 0.0542 ns | 0.0507 ns |     - |     - |     - |         - |       6 B |
| GetTimestamp | 50.397 ns | 1.0498 ns | 1.2893 ns |     - |     - |     - |         - |      33 B |
|    SpinWait1 | 42.325 ns | 0.7664 ns | 0.7169 ns |     - |     - |     - |         - |      10 B |
|  MemoryFence |  1.177 ns | 0.0518 ns | 0.0484 ns |     - |     - |     - |         - |       3 B |

Alternative Designs

DllImport or DllImport alternatives (e.g.: function pointers) can be used to spin loop with a spin-loop-indicating CPU instruction, but the DllImport / DllImport alternative boundary crossing overhead tends to be larger than the benefit provided by the instruction.

.NET pattern maching could attempt to have the JIT compilers deduce spin-wait-loop situations and code and choose to automatically include a spin-loop-indicating CPU instructions with no .NET code indication required. I would expect that the complexity of automatically and reliably detecting spinning situations, coupled with questions about potential tradeoffs in using the indication on some platform to delay the availability of viable implementations significantly.

Risks

An intrinsic x86 implementation will involve modifications to multiple .NET components and exposing a new Sse2.Pause Intrinsics API and as such they carry some risks, but no more than other simple intrinsics added to the .NET.

Some processor architecture may have significantly different latency profile for PAUSE intrinsics (e.g.: · Intel® Xeon® Scalable processor on Skylake architecture: 140 cycles). However this is also true for every other intrinsics that is available and should not prevent the intrinsics usage and it seems that the latency improved greatly since than (e.g.: 2nd generation Intel® Xeon® Scalable processor based on Cascade Lake architecture: 40 cycles.).

References

[1] LMAX Disruptor .NET implementation
[2] Pause intrinsics latency and throughput benchmarks (C#)
[3] [Pause intrinsics latency and throughput benchmarks (Java)] (https://github.com/giltene/GilExamples/tree/master/SpinWaitTest)
[4] Chart depicting Java onSpinWait() intrinsification impact
[5] [.NET prototype Sse2.Pause intrinsics implementation branch] (https://github.com/zpodlovics/runtime/tree/sse2pause)
[6] Implementations on other platforms (other than x86) may choose to use the same instructions as linux cpu_relax and/or plasma_spin
[7] https://software.intel.com/content/www/us/en/develop/articles/benefitting-power-and-performance-sleep-loops.html
[8] Andreas Abel: Automatic Generation of Models of Microarchitectures
[9] https://uops.info/table.html?search=PAUSE&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SNB=on&cb_IVB=on&cb_HSW=on&cb_BDW=on&cb_SKL=on&cb_SKX=on&cb_KBL=on&cb_CFL=on&cb_CNL=on&cb_CLX=on&cb_ICL=on&cb_ZENp=on&cb_ZEN2=on&cb_measurements=on&cb_iaca30=on&cb_doc=on&cb_base=on&cb_sse=on&cb_others=on

Author: zpodlovics
Assignees: -
Labels:

api-suggestion, area-System.Runtime.Intrinsics, untriaged

Milestone: -

@tannergooding
Copy link
Member

This was briefly discussed back when intrinsics were first implemented/exposed and the recommendation from at the time was that we have more data/discussion on exposing PAUSE due to it being very difficult to use correctly: #10260 (comment)

Intrinsics are lowlevel and unsafe, but pause itself is much more difficult to use then the counterparts we've exposed and may require micro-architecture specific tuning and profiling. That being said, I don't have anything particular against blocking this scenario if we feel the data above is sufficient.

It would be good to get weigh in from @jkotas and @stephentoub on whether we feel this is useful enough to power users doing customized threading or other synchronization primitives to expose here.

@jkotas
Copy link
Member

jkotas commented Jun 1, 2021

We have been introducing hw intrinsics for many corner cases to support all sorts of micro-optimizations (that I wish would be just handled by the JIT transparently instead). I do not see a fundamenal problem with adding intrinsic for pause instruction.

class Sse2

Should it be X86Base instead? I believe that pause works everywhere, even before SSE2.

Also, we should add yield for ARM64 at the same time for parity.

@zpodlovics
Copy link
Author

According to Wikipedia [1] the PAUSE instruction itself is added as a part of SSE2. However it seems that it constructed in a really clever way that it's a no-op (rep; nop) in older architectures. This is why I added it to Sse2 instead of X86Base.

[1] https://en.wikipedia.org/wiki/MOV_(x86_instruction)#Added_with_SSE2

@tannergooding
Copy link
Member

Should it be X86Base instead? I believe that pause works everywhere, even before SSE2.

This one is strictly documented as being SSE2 in the architecture manuals. In practice its everywhere, much like CPUID, but since we already have the Sse2 class, I think it should just go there.

Also, we should add yield for ARM64 at the same time for parity.

Agreed.

However, that begs the question of: Should this be architecture dependent intrinsics or something like a new Thread.Pause method which is intrinsic on x86/x64 and ARM/ARM64 and falls back to something reasonable (like Wait(1)) on platforms without intrinsic support?

@jkotas
Copy link
Member

jkotas commented Jun 1, 2021

This one is strictly documented as being SSE2 in the architecture manuals.

My copy of the manual says "This instruction was introduced in the Pentium 4 processors, but is backward compatible with all IA-32 processors.". I do not see any mention of SSE2 in the instruction manual page that describes this instruction.

Should this be architecture dependent intrinsics

We have the architecture neutral version of this already: Thread.SpinWait. And also struct SpinWait.

@tannergooding
Copy link
Member

My copy of the manual says "This instruction was introduced in the Pentium 4 processors, but is backward compatible with all IA-32 processors.". I do not see any mention of SSE2 in the instruction manual page that describes this instruction.

It's listed here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#!=undefined&text=Pause&expand=4141, likely because the P4 is the CPU that SSE2 was introduced in.

It probably doesn't matter too much given both Intel and AMD list this as being a "nop" on hardware that doesn't recognize it and so I think it might be reasonable to put it in X86Base.

We have the architecture neutral version of this already: Thread.SpinWait. And also struct SpinWait.

I meant something that users could rely on being this intrinsic (pause) on x86 or yield on ARM. I don't believe SpinWait(1) provides that guarantee today.

I don't have a preference either way, just trying to determine if that's something we want or were interested in. If not, then we can update the top post with the System.Runtime.Intrinsics proposal for X86Base.Pause and ArmBase.Yield

@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Jun 17, 2021
@tannergooding tannergooding added this to the Future milestone Jun 17, 2021
@tannergooding
Copy link
Member

@zpodlovics, could you please update the top post to place Pause in X86Base and to also propose ArmBase.Yield?

After that we can mark this api-ready-for-review

@zpodlovics
Copy link
Author

@tannergooding @jkotas Thanks a lot for your comments. As suggested, I updated the proposal to use X86Base.Pause and ArmBase.Yield.

@jkotas jkotas changed the title Introduce sse2 pause intrinsics in order to support spin wait loop indication Introduce pause intrinsics in order to support spin wait loop indication Jun 18, 2021
@tannergooding tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Jun 18, 2021
@bartonjs
Copy link
Member

bartonjs commented Aug 17, 2021

Video

Looks good as proposed.

{
    public abstract class ArmBase 
    {
+        public static void Yield();
    }
 }

    public abstract partial class X86Base 
    {
+        public static void Pause();
    }
 }

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Aug 17, 2021
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 1, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Nov 15, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Dec 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants