-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x64 vs ARM64 Microbenchmarks Performance Study Report #67339
Comments
Tagging subscribers to this area: @dotnet/area-meta Issue DetailsRecently @kunalspathak asked me if I could produce a report similar to #66848 for x64 vs arm64 comparison. I took .NET 7 Preview2 results provided by @AndyAyersMS, @kunalspathak and myself for #66848, hacked the tool a little bit (it was not designed to compare different architecture results) and compared x64 vs arm64 using the following configs:
Of course it was not an apples-to-apples comparision, just the best thing we could do right now. Full public results (without absolute values, as I don't have the permission to share them) can be found here. As usual, I've focused on the benchmarks that take longer to execute on arm64 compared to x64. If you are interested in benchmarks that take less to execute, you need to read the report linked above in the reverse order. Benchmarks:
|
Nice! I did a similar report last week and shared on our perf meeting last Monday
Base64 (for utf8) is only vectorized for x64, there is an issue for arm64 #35033 (I think we wanted to assign it to someone to ramp up)
it is properly accelerated (I compared it with __builtin_popcnt in LLVM), the problem that popcnt is vector only on arm64 so we have some overhead on packing/extracting - 5 instructions vs 1 on x64
My guess that Rent-Return is most likely bottle-necked on TLS access speed, can be improved with #63619 if arm64 has special registers for that.
That is expected due to lack of Vector256 I believe, I proposed to add dual-vector128 for arm64 here #66993
Same here, it uses
Correct, the codegen for interlocked ops is completely fine on both arm64 8.0 and 8.1 (Atomics)
If arm64 was M1 than it's the jump-stubs issue, see #62302 (comment)
My guess that it's because we don't use relocs on arm64 and have to compose full 64bit address using several instructions to access a static field. E.g.: static int field;
void IncrementField() => field++; X64: FF05C6CC4200 inc dword ptr [(reloc 0x7ffeb73eac3c)] arm64: D2958780 movz x0, #0xac3c
F2B6E760 movk x0, #0xb73b LSL #16
F2CFFFC0 movk x0, #0x7ffe LSL #32
B9400001 ldr w1, [x0]
11000421 add w1, w1, #1
B9000001 str w1, [x0] Overall, I have a feeling that we might get a very nice boost for many benchmarks/GC if we integrate PGO for native code (VM/GC) |
The |
@EgorBo that data seems like something you could share on a gist for everyone? (Or perhaps just the scenarios with unusual ratios) |
The System.Drawing ones may just be a difference in Windows GDI+ performance since it's largely a wrapper. |
Access for generic statics (for shared generics at least, maybe for all?) can more complicated -- the address must be looked up in runtime data structures. Worth investigating. |
Most likely it is because of ICU. We already having the issue #31273 tracking that. I don't know though why ARM64 runs make more slower. |
@EgorBo perhaps you could open an issue and update the top post? |
right, but it doesn't look to be the case here since it's not shared
Sure, let me see how to export an excel sheet to gist 😄 |
There is a lot of interop in this scenario. Could be differences in interop or performance of this callback runtime/src/libraries/System.Drawing.Common/src/System/Drawing/Internal/GPStream.COMWrappers.cs Line 29 in 3ae8739
Could compare to performance of a load that doesn't use stream, and thus would be more of a GDI+ baseline. cc @eerhardt |
@jkoritzinsky for that interop possibility. Jeremy anything notable in the interop here - any potentially relevant known issue on Arm64? |
this one serializes an array of bytes so it spends most of the time encoding data into base64. So it's the same as #35033 |
We don't have any notable differences (or even any differences I can think of) in the portion of interop used there for ARM64 vs x64. I definitely wouldn't be amazed at all if some portion of GDI+ is better optimized for x64 and we're just seeing that here. @dotnet/interop-contrib if anyone else on the interop team has any issues that come to mind. |
For the regex ones -- do we know we have vectorization gaps that are specific to Arm64 in any areas like -- StartsWith, IndexOf, IndexOfAny - @EgorBo ? (For char, not byte) |
The cited pattern will use |
It is, but I start to think that we won't be able to properly lower Vector256 to double Vector128s in JIT, so I wonder if we should do that on C#/IL level instead e.g. Source-Generators if we really want to - some say that generally these APIs mostly work with small data and cases when we need to open a 0.5Mb book and find a word in it are rare.. |
I really don't think its worth focusing on or investing in that. Like you mentioned, doing it in the JIT is somewhat problematic because you have to take Decomposition here isn't necessarily trivial and has questionable perf throughput for various operations leading users to a potential pit of failure, particularly when running on low-power devices (may negatively impact Mobile). We could do some clever things here and other various optimizations to make it work nicely (including treating it as an HVA), but its not a small amount of work. On top of that, it won't really "close" the gap. The places where doing We simply shouldn't be trying to compare We should instead, when doing x64 vs Arm64 comparisons compare |
I don't think you can assume this given they're critical to regex matching. @stephentoub @joperezr may have a better sense of typical regex text lengths (of course it also depends on how common hits are)
Comparing across hardware is inevitably bogus -- I thought the purpose of this exercise was to look for unusual ratios that might suggest room for targeted improvement by whatever means. Just sounds like there may not be a means, in this case. |
I support your point, however, I think We can add JIT support here, e.g. JIT will be responsible to replace SpanHelpers.IndexOf with a call to a heavily optimized pipelined version if inputs are usually big (PGO) |
https://godbolt.org/z/MxhGPPvaj here I wrote a simple loop to add
I didn't even use -O3 here 😐 |
Right. My point is that we shouldn't drive the work solely based on closing some non-representative Arm64 vs x64 perf gap, because that will be impossible given the two sets of hardware we have (particularly if we actually try and do our best for each platform). If it is perf critical, we should be hand tuning this to fit our needs for all the relevant platforms. If that includes manually unrolling and pipelining, then that's fine (assuming numbers across the hardware we care about show the respective gains). |
These API's are perf critical (certainly for 'char', if it matters)-- if we think it's feasible at reasonable cost to make them significantly faster on this architecture by whatever means, can we get an issue open for that? |
Sure, but I'd love to mine some data first for some apps, 1st parties, benchmarks to understand typical inputs better |
Recently @kunalspathak asked me if I could produce a report similar to #66848 for x64 vs arm64 comparison.
I took .NET 7 Preview2 results provided by @AndyAyersMS, @kunalspathak and myself for #66848, hacked the tool a little bit (it was not designed to compare different architecture results) and compared x64 vs arm64 using the following configs:
Of course it was not an apples-to-apples comparision, just the best thing we could do right now.
Full public results (without absolute values, as I don't have the permission to share them) can be found here.
Internal MS results (with absolute values) can be found here. If you don't have the access please ping me on Teams.
As usual, I've focused on the benchmarks that take longer to execute on arm64 compared to x64. If you are interested in benchmarks that take less to execute, you need to read the report linked above in the reverse order.
Benchmarks:
@kunalspathak
System.Numerics.Tests.Perf_BitOperations.PopCount_ulong
is 5-8 time slower (most likely due to lack of vectorization).PopCount_uint
is slower only on Windows.@tannergooding @GrabYourPitchforks
Base64Encode
benchmarks likeSystem.Buffers.Text.Tests.Base64Tests.Base64Encode(NumberOfBytes: 1000)
are 6 up to 16 times slower Optimize System.Buffers for arm64 using cross-platform intrinsics #35033@stephentoub @kouvel
RentReturnArrayPoolTests
benchmarks are up to few times slower, but these are multi-threaded and very often multimodal benchmarks. Faster thread local statics #63619System.Threading.Tests.Perf_Timer.AsynchronousContention
is 2-3 times slower.@wfurt @MihaZupan
SocketSendReceivePerfTest
benchmarks likeSystem.Net.WebSockets.Tests.SocketSendReceivePerfTest.ReceiveSend
are 2 times slower.@dotnet/area-system-drawing
System.Drawing.Tests.Perf_Image_Load.Image_FromStream_NoValidation
are few times slower on Windows. Only theNoValidation
benchmarks seem to run slower.@stephentoub
RegularExpressions
benchmarks likeSystem.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sher[a-z]+|Hol[a-z]+", Options: Compiled)
are 40-50% slower. This pattern usesIndexOfAny("HOho")
to find the next possible match location. It has a 256-bit vectorization path on x64 but only 128-bit on ARM64.@jkotas @AndyAyersMS
PerfLabTests.LowLevelPerf.GenericClassGenericStaticField
benchmark can be from 16% to x3 times slower. Same goes forPerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod
.@dotnet/jit-contrib
System.Security.Cryptography.Tests.Perf_Hashing.Sha1
is 17-55% slower. (Potentially differences in the GDI+ code)System.IO.Tests.Perf_StreamWriter.WriteString(writeLength: 100)
is 21-46% slower.System.Text.Json.Serialization.Tests.WriteJson<BinaryData>.SerializeToStream
benchmark can be from 16% to x4 times slower. Optimize System.Buffers for arm64 using cross-platform intrinsics #35033SIMD.ConsoleMandel
benchmarks are 40% slower . Double Vector128 for SpanHelpers.IndexOf(byte,byte,int) on ARM64 #66993Burgers.Test3
is 12-59% slower Double Vector128 for SpanHelpers.IndexOf(byte,byte,int) on ARM64 #66993System.Collections.Contains
benchmarks are 2-3 times slower (most likely due to lack of vectorization). Same goes forSystem.Memory.Span<Char>.IndexOfValue
,System.Memory.Span<Char>.Fill
,System.Memory.Span<Int32>.StartsWith
,System.Memory.Span<Byte>.IndexOfAnyTwoValues
andSystem.Memory.ReadOnlySpan.IndexOfString(Ordinal)
. Double Vector128 for SpanHelpers.IndexOf(byte,byte,int) on ARM64 #66993SequenceCompareTo
benchmarks are 30% up to 4 times slower Double Vector128 for SpanHelpers.IndexOf(byte,byte,int) on ARM64 #66993@tannergooding
System.MathBenchmarks.Double.Exp
andSystem.MathBenchmarks.Single.Exp
are 35% slower. Optimize jump stubs on arm64 #62302@dotnet/area-system-globalization
System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse(culturestring: ja)
benchmark can be from 20% to x7 times slower (it's most likely an ICU problem). Initializing the "ja" culture takes 200ms when using ICU #31273Various
Perf_Interlocked
benchmarks are slower, but this is expected due to memory model differences.Various
Perf_Process.Start
benchmarks are slower, but only on macOS so it's most likely a macOS issue.The text was updated successfully, but these errors were encountered: