Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Guid equality checks on 64-bit platforms #35654

Closed

Conversation

BruceForstall
Copy link
Member

Current code does four 32-bit comparisons. Instead,
do two 64-bit comparisons. On x86, the JIT-generated
code is slightly different, but equally fast.

This will be even better on arm64 which passes everything
in 64-bit registers, after #35622 is addressed in the JIT.

Perf results:

x64:

Method Tool Mean Error StdDev Median Min Max Ratio RatioSD
EqualsSame base 2.322 ns 0.0210 ns 0.0187 ns 2.324 ns 2.294 ns 2.351 ns 1.00 0.00
EqualsSame diff 1.547 ns 0.0092 ns 0.0071 ns 1.547 ns 1.535 ns 1.559 ns 0.67 0.01
EqualsOperator base 2.890 ns 0.3896 ns 0.4169 ns 3.030 ns 1.722 ns 3.074 ns 1.00 0.00
EqualsOperator diff 1.346 ns 0.0160 ns 0.0150 ns 1.346 ns 1.331 ns 1.380 ns 0.49 0.12
NotEqualsOperator base 1.738 ns 0.0306 ns 0.0255 ns 1.730 ns 1.712 ns 1.805 ns 1.00 0.00
NotEqualsOperator diff 1.401 ns 0.0425 ns 0.0355 ns 1.389 ns 1.360 ns 1.476 ns 0.81 0.02

x86:

Method Tool Mean Error StdDev Median Min Max Ratio RatioSD
EqualsSame base 3.164 ns 0.0234 ns 0.0208 ns 3.159 ns 3.136 ns 3.203 ns 1.00 0.00
EqualsSame diff 3.079 ns 0.0327 ns 0.0306 ns 3.074 ns 3.041 ns 3.146 ns 0.97 0.01
EqualsOperator base 2.736 ns 0.0252 ns 0.0236 ns 2.726 ns 2.710 ns 2.783 ns 1.00 0.00
EqualsOperator diff 2.613 ns 0.0262 ns 0.0245 ns 2.600 ns 2.589 ns 2.662 ns 0.95 0.01
NotEqualsOperator base 2.708 ns 0.0096 ns 0.0080 ns 2.705 ns 2.699 ns 2.723 ns 1.00 0.00
NotEqualsOperator diff 2.573 ns 0.0666 ns 0.0591 ns 2.552 ns 2.526 ns 2.709 ns 0.95 0.02

Current code does four 32-bit comparisons. Instead,
do two 64-bit comparisons. On x86, the JIT-generated
code is slightly different, but equally fast.

This will be even better on arm64 which passes everything
in 64-bit registers, after dotnet#35622 is addressed in the JIT.

Perf results:

x64:

|            Method | Tool |       Mean |     Error |    StdDev |     Median |        Min |        Max | Ratio | RatioSD |
|------------------ |------|-----------:|----------:|----------:|-----------:|-----------:|-----------:|------:|--------:|
|        EqualsSame | base |   2.322 ns | 0.0210 ns | 0.0187 ns |   2.324 ns |   2.294 ns |   2.351 ns |  1.00 |    0.00 |
|        EqualsSame | diff |   1.547 ns | 0.0092 ns | 0.0071 ns |   1.547 ns |   1.535 ns |   1.559 ns |  0.67 |    0.01 |
|    EqualsOperator | base |   2.890 ns | 0.3896 ns | 0.4169 ns |   3.030 ns |   1.722 ns |   3.074 ns |  1.00 |    0.00 |
|    EqualsOperator | diff |   1.346 ns | 0.0160 ns | 0.0150 ns |   1.346 ns |   1.331 ns |   1.380 ns |  0.49 |    0.12 |
| NotEqualsOperator | base |   1.738 ns | 0.0306 ns | 0.0255 ns |   1.730 ns |   1.712 ns |   1.805 ns |  1.00 |    0.00 |
| NotEqualsOperator | diff |   1.401 ns | 0.0425 ns | 0.0355 ns |   1.389 ns |   1.360 ns |   1.476 ns |  0.81 |    0.02 |

x86:

|            Method | Tool |       Mean |     Error |    StdDev |     Median |        Min |        Max | Ratio | RatioSD |
|------------------ |------|-----------:|----------:|----------:|-----------:|-----------:|-----------:|------:|--------:|
|        EqualsSame | base |   3.164 ns | 0.0234 ns | 0.0208 ns |   3.159 ns |   3.136 ns |   3.203 ns |  1.00 |    0.00 |
|        EqualsSame | diff |   3.079 ns | 0.0327 ns | 0.0306 ns |   3.074 ns |   3.041 ns |   3.146 ns |  0.97 |    0.01 |
|    EqualsOperator | base |   2.736 ns | 0.0252 ns | 0.0236 ns |   2.726 ns |   2.710 ns |   2.783 ns |  1.00 |    0.00 |
|    EqualsOperator | diff |   2.613 ns | 0.0262 ns | 0.0245 ns |   2.600 ns |   2.589 ns |   2.662 ns |  0.95 |    0.01 |
| NotEqualsOperator | base |   2.708 ns | 0.0096 ns | 0.0080 ns |   2.705 ns |   2.699 ns |   2.723 ns |  1.00 |    0.00 |
| NotEqualsOperator | diff |   2.573 ns | 0.0666 ns | 0.0591 ns |   2.552 ns |   2.526 ns |   2.709 ns |  0.95 |    0.02 |
@BruceForstall
Copy link
Member Author

Additional benchmarks: dotnet/performance#1302

@EgorBo
Copy link
Member

EgorBo commented Apr 30, 2020

Is it safe for arm32 as well (accessing misaligned long)?

@gfoidl
Copy link
Member

gfoidl commented Apr 30, 2020

Did you consider a vectorized approach?
Only shown for SSE2, needs some specialization for Arm, and I don't know if it's worth it...

@EgorBo
Copy link
Member

EgorBo commented Apr 30, 2020

Did you consider a vectorized approach?
Only shown for SSE2, needs some specialization for Arm, and I don't know if it's worth it...

Last time I tried it was slower,
I think in most cases guids are different and early out check is enough.

Unsafe.Add(ref g._a, 1) == Unsafe.Add(ref _a, 1) &&
Unsafe.Add(ref g._a, 2) == Unsafe.Add(ref _a, 2) &&
Unsafe.Add(ref g._a, 3) == Unsafe.Add(ref _a, 3);
return Unsafe.As<int, long>(ref g._a) == Unsafe.As<int, long>(ref _a) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this logic need to be repeated in multiple places rather than have everything forward to == for example?

I'd hope the JIT would properly inline the check in all the cases and the code would be equivalent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, although that's a separate issue.

Unsafe.Add(ref g._a, 2) == Unsafe.Add(ref _a, 2) &&
Unsafe.Add(ref g._a, 3) == Unsafe.Add(ref _a, 3);
return Unsafe.As<int, long>(ref g._a) == Unsafe.As<int, long>(ref _a) &&
Unsafe.As<byte, long>(ref g._d) == Unsafe.As<byte, long>(ref _d);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that Int64-based operations would be slower than Int32-based operations on 32-bit, especially if the values weren't aligned (and that that's why @jkotas used Int32 here initially when switching these operations to be based on Unsafe). Is that not the case?

@jkotas
Copy link
Member

jkotas commented Apr 30, 2020

Is it safe for arm32 as well (accessing misaligned long)?

+1. It would be better to use Unsafe.ReadUnaligned here to make the code portable.

@BruceForstall
Copy link
Member Author

Did you consider a vectorized approach?

I didn't. I was specifically looking to improve arm64, but x64 as well (and be simple and cross-platform).

Is it safe for arm32 as well (accessing misaligned long)?

RyuJIT converts a 64-bit long access to two 32-bit int accesses on 32-bit platforms, so there is no difference in alignment for 32-bit. There could be a difference in alignment for 64-bit since presumably Guid could be 4-byte aligned.

fyi, here's the x86 code difference. It appears RyuJIT doesn't handle the Unsafe.Add calls well.

current x86 assembly
G_M51749_IG01:
       55           push     ebp
       8BEC         mov      ebp, esp
                        ;; bbWeight=1    PerfScore 1.25
G_M51749_IG02:
       8B4518       mov      eax, dword ptr [ebp+18H]
       3B4508       cmp      eax, dword ptr [ebp+08H]
       7532         jne      SHORT G_M51749_IG05
                        ;; bbWeight=1    PerfScore 3.00
G_M51749_IG03:
       8D4518       lea      eax, bword ptr [ebp+18H]
       8B4004       mov      eax, dword ptr [eax+4]
       8D5508       lea      edx, bword ptr [ebp+08H]
       3B4204       cmp      eax, dword ptr [edx+4]
       7524         jne      SHORT G_M51749_IG05
       8D4518       lea      eax, bword ptr [ebp+18H]
       8B4008       mov      eax, dword ptr [eax+8]
       8D5508       lea      edx, bword ptr [ebp+08H]
       3B4208       cmp      eax, dword ptr [edx+8]
       7516         jne      SHORT G_M51749_IG05
       8D4518       lea      eax, bword ptr [ebp+18H]
       8B400C       mov      eax, dword ptr [eax+12]
       8D5508       lea      edx, bword ptr [ebp+08H]
       3B420C       cmp      eax, dword ptr [edx+12]
       0F94C0       sete     al
       0FB6C0       movzx    eax, al
                        ;; bbWeight=0.50 PerfScore 9.13
G_M51749_IG04:
       5D           pop      ebp
       C22000       ret      32
                        ;; bbWeight=0.50 PerfScore 1.25
G_M51749_IG05:
       33C0         xor      eax, eax
                        ;; bbWeight=0.50 PerfScore 0.13
G_M51749_IG06:
       5D           pop      ebp
       C22000       ret      32
new x86 assembly
G_M51749_IG02:
       8B442414     mov      eax, dword ptr [esp+14H]
       8B542418     mov      edx, dword ptr [esp+18H]
       33442404     xor      eax, dword ptr [esp+04H]
       33542408     xor      edx, dword ptr [esp+08H]
       0BC2         or       eax, edx
       751B         jne      SHORT G_M51749_IG05
                        ;; bbWeight=1    PerfScore 5.25
G_M51749_IG03:
       8B44241C     mov      eax, dword ptr [esp+1CH]
       8B542420     mov      edx, dword ptr [esp+20H]
       3344240C     xor      eax, dword ptr [esp+0CH]
       33542410     xor      edx, dword ptr [esp+10H]
       0BC2         or       eax, edx
       0F94C0       sete     al
       0FB6C0       movzx    eax, al
                        ;; bbWeight=0.50 PerfScore 2.75
G_M51749_IG04:
       C22000       ret      32
                        ;; bbWeight=0.50 PerfScore 1.00
G_M51749_IG05:
       33C0         xor      eax, eax
                        ;; bbWeight=0.50 PerfScore 0.13
G_M51749_IG06:
       C22000       ret      32

@stephentoub
Copy link
Member

@BruceForstall, are you still working on this? Thanks.

@BruceForstall
Copy link
Member Author

@BruceForstall, are you still working on this? Thanks.

I haven't, and likely won't get back to it. I'll go ahead and close the PR. I'd be happy if someone picked it up. The "only thing" is to try Jan's suggestion about using Unsafe.ReadUnaligned, and verifying performance on all platforms still improves.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
@BruceForstall BruceForstall deleted the ImproveGuidEqualityChecks branch February 28, 2021 03:08
@BruceForstall BruceForstall restored the ImproveGuidEqualityChecks branch February 28, 2021 03:08
@BruceForstall BruceForstall deleted the ImproveGuidEqualityChecks branch December 28, 2022 01:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants