-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Guid equality checks on 64-bit platforms #35654
Improve Guid equality checks on 64-bit platforms #35654
Conversation
Current code does four 32-bit comparisons. Instead, do two 64-bit comparisons. On x86, the JIT-generated code is slightly different, but equally fast. This will be even better on arm64 which passes everything in 64-bit registers, after dotnet#35622 is addressed in the JIT. Perf results: x64: | Method | Tool | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | |------------------ |------|-----------:|----------:|----------:|-----------:|-----------:|-----------:|------:|--------:| | EqualsSame | base | 2.322 ns | 0.0210 ns | 0.0187 ns | 2.324 ns | 2.294 ns | 2.351 ns | 1.00 | 0.00 | | EqualsSame | diff | 1.547 ns | 0.0092 ns | 0.0071 ns | 1.547 ns | 1.535 ns | 1.559 ns | 0.67 | 0.01 | | EqualsOperator | base | 2.890 ns | 0.3896 ns | 0.4169 ns | 3.030 ns | 1.722 ns | 3.074 ns | 1.00 | 0.00 | | EqualsOperator | diff | 1.346 ns | 0.0160 ns | 0.0150 ns | 1.346 ns | 1.331 ns | 1.380 ns | 0.49 | 0.12 | | NotEqualsOperator | base | 1.738 ns | 0.0306 ns | 0.0255 ns | 1.730 ns | 1.712 ns | 1.805 ns | 1.00 | 0.00 | | NotEqualsOperator | diff | 1.401 ns | 0.0425 ns | 0.0355 ns | 1.389 ns | 1.360 ns | 1.476 ns | 0.81 | 0.02 | x86: | Method | Tool | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | |------------------ |------|-----------:|----------:|----------:|-----------:|-----------:|-----------:|------:|--------:| | EqualsSame | base | 3.164 ns | 0.0234 ns | 0.0208 ns | 3.159 ns | 3.136 ns | 3.203 ns | 1.00 | 0.00 | | EqualsSame | diff | 3.079 ns | 0.0327 ns | 0.0306 ns | 3.074 ns | 3.041 ns | 3.146 ns | 0.97 | 0.01 | | EqualsOperator | base | 2.736 ns | 0.0252 ns | 0.0236 ns | 2.726 ns | 2.710 ns | 2.783 ns | 1.00 | 0.00 | | EqualsOperator | diff | 2.613 ns | 0.0262 ns | 0.0245 ns | 2.600 ns | 2.589 ns | 2.662 ns | 0.95 | 0.01 | | NotEqualsOperator | base | 2.708 ns | 0.0096 ns | 0.0080 ns | 2.705 ns | 2.699 ns | 2.723 ns | 1.00 | 0.00 | | NotEqualsOperator | diff | 2.573 ns | 0.0666 ns | 0.0591 ns | 2.552 ns | 2.526 ns | 2.709 ns | 0.95 | 0.02 |
Additional benchmarks: dotnet/performance#1302 |
Is it safe for arm32 as well (accessing misaligned long)? |
Did you consider a vectorized approach? |
Last time I tried it was slower, |
Unsafe.Add(ref g._a, 1) == Unsafe.Add(ref _a, 1) && | ||
Unsafe.Add(ref g._a, 2) == Unsafe.Add(ref _a, 2) && | ||
Unsafe.Add(ref g._a, 3) == Unsafe.Add(ref _a, 3); | ||
return Unsafe.As<int, long>(ref g._a) == Unsafe.As<int, long>(ref _a) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this logic need to be repeated in multiple places rather than have everything forward to ==
for example?
I'd hope the JIT would properly inline the check in all the cases and the code would be equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, although that's a separate issue.
Unsafe.Add(ref g._a, 2) == Unsafe.Add(ref _a, 2) && | ||
Unsafe.Add(ref g._a, 3) == Unsafe.Add(ref _a, 3); | ||
return Unsafe.As<int, long>(ref g._a) == Unsafe.As<int, long>(ref _a) && | ||
Unsafe.As<byte, long>(ref g._d) == Unsafe.As<byte, long>(ref _d); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that Int64-based operations would be slower than Int32-based operations on 32-bit, especially if the values weren't aligned (and that that's why @jkotas used Int32 here initially when switching these operations to be based on Unsafe). Is that not the case?
+1. It would be better to use |
I didn't. I was specifically looking to improve arm64, but x64 as well (and be simple and cross-platform).
RyuJIT converts a 64-bit fyi, here's the x86 code difference. It appears RyuJIT doesn't handle the current x86 assemblyG_M51749_IG01:
55 push ebp
8BEC mov ebp, esp
;; bbWeight=1 PerfScore 1.25
G_M51749_IG02:
8B4518 mov eax, dword ptr [ebp+18H]
3B4508 cmp eax, dword ptr [ebp+08H]
7532 jne SHORT G_M51749_IG05
;; bbWeight=1 PerfScore 3.00
G_M51749_IG03:
8D4518 lea eax, bword ptr [ebp+18H]
8B4004 mov eax, dword ptr [eax+4]
8D5508 lea edx, bword ptr [ebp+08H]
3B4204 cmp eax, dword ptr [edx+4]
7524 jne SHORT G_M51749_IG05
8D4518 lea eax, bword ptr [ebp+18H]
8B4008 mov eax, dword ptr [eax+8]
8D5508 lea edx, bword ptr [ebp+08H]
3B4208 cmp eax, dword ptr [edx+8]
7516 jne SHORT G_M51749_IG05
8D4518 lea eax, bword ptr [ebp+18H]
8B400C mov eax, dword ptr [eax+12]
8D5508 lea edx, bword ptr [ebp+08H]
3B420C cmp eax, dword ptr [edx+12]
0F94C0 sete al
0FB6C0 movzx eax, al
;; bbWeight=0.50 PerfScore 9.13
G_M51749_IG04:
5D pop ebp
C22000 ret 32
;; bbWeight=0.50 PerfScore 1.25
G_M51749_IG05:
33C0 xor eax, eax
;; bbWeight=0.50 PerfScore 0.13
G_M51749_IG06:
5D pop ebp
C22000 ret 32 new x86 assemblyG_M51749_IG02:
8B442414 mov eax, dword ptr [esp+14H]
8B542418 mov edx, dword ptr [esp+18H]
33442404 xor eax, dword ptr [esp+04H]
33542408 xor edx, dword ptr [esp+08H]
0BC2 or eax, edx
751B jne SHORT G_M51749_IG05
;; bbWeight=1 PerfScore 5.25
G_M51749_IG03:
8B44241C mov eax, dword ptr [esp+1CH]
8B542420 mov edx, dword ptr [esp+20H]
3344240C xor eax, dword ptr [esp+0CH]
33542410 xor edx, dword ptr [esp+10H]
0BC2 or eax, edx
0F94C0 sete al
0FB6C0 movzx eax, al
;; bbWeight=0.50 PerfScore 2.75
G_M51749_IG04:
C22000 ret 32
;; bbWeight=0.50 PerfScore 1.00
G_M51749_IG05:
33C0 xor eax, eax
;; bbWeight=0.50 PerfScore 0.13
G_M51749_IG06:
C22000 ret 32 |
@BruceForstall, are you still working on this? Thanks. |
I haven't, and likely won't get back to it. I'll go ahead and close the PR. I'd be happy if someone picked it up. The "only thing" is to try Jan's suggestion about using |
Current code does four 32-bit comparisons. Instead,
do two 64-bit comparisons. On x86, the JIT-generated
code is slightly different, but equally fast.
This will be even better on arm64 which passes everything
in 64-bit registers, after #35622 is addressed in the JIT.
Perf results:
x64:
x86: