-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvement for Guid.Equals using SSE #53012
Conversation
var result = Sse2.CompareEqual(g1, g2); | ||
return Sse2.MoveMask(result) == 0b1111_1111_1111_1111; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var result = Sse2.CompareEqual(g1, g2); | |
return Sse2.MoveMask(result) == 0b1111_1111_1111_1111; | |
return g1.Equals(g2); |
Results in same codegen and is simpler to read.
If this hinders inlining, etc. keep it as is (except var
-> concrete type).
Ideally the JIT should emit code that takes SSE4.1's _mm_testz_si128
into account?
(Iif this variant is always faster on the supported cpus)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector128<T>.Equals(Vector128<T>)
currently does not have a specific/optimized code path for SSE 4.1's _mm_testz_si128
, which I why I called out an explicit code path for SSE 4.1 in Guid.EqualsCore
.
But yes, under the SSE2 code path, it should be equivalent to invoke g1.Equals(g2)
on the assumption that the JIT will inline the call, which it should because Vector128<T>.Equals(Vector128<T>)
is decorated with "aggressive inlining".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently does not have a specific/optimized code path for SSE 4.1's
Yeah, my question was intented for @tannergooding (?) to have a look at this on the JIT-side 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the JIT should emit code that takes SSE4.1's _mm_testz_si128 into account?
There are a few improvements that could happen for Vector128.Equals
. I've been waiting on #49397 before tackling any of them, however.
ref int rB = ref Unsafe.AsRef(in right._a); | ||
if (Sse2.IsSupported) | ||
{ | ||
var g1 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(left)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: use explicit types, not var
, to follow the coding guidelines in this repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated accordingly.
|
||
// Compare each element | ||
static bool SoftwareFallback(in Guid left, in Guid right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re in
: just to double-check the discussion in the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to take direction on this either way. I have found in my performance tests thus far that passing a Guid
by reference outperforms passing by value - I suspect because the JIT has difficulty keeping the Guid
parameters in registers because they have 11 fields.
&& Unsafe.Add(ref rA, 3) == Unsafe.Add(ref rB, 3); | ||
// Compare each element | ||
|
||
return rA == rB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we special case 64-bits processes by comparing with long
s?
E.g. on ARM 64 this code will be hit.
Cf. #35654
if (Environment.Is64BitProcess)
{
// code with long
}
else
{
// code with int
}
The "fast out behavior" should be given with longs too, but I'm not sure how about alignment on ARM 64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to update with a special case for 64-bit software fallback if desired. The current Guid.EqualsCore
method doesn't cater for this, but that's no reason to not cater for it now.
private static bool EqualsCore(in Guid left, in Guid right) | ||
{ | ||
ref int rA = ref Unsafe.AsRef(in left._a); | ||
ref int rB = ref Unsafe.AsRef(in right._a); | ||
if (Sse2.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about ARM64? We care about ARM64 as much as we care about x64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if SSE2
is not supported, we can just use a method without SSE2
? Much like what we have been doing so far. But still, in most cases SSE2
is supported.
FWIW, I think that this is way too complex change for the benefit that it provides. I think it would be better to wait for the general Vector128 improvements. |
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsFixes #52296. Baseline performance:
Updated performance:
|
Based on this comment and the lack of activity on this PR since, I'm going to go ahead and close it. Thanks for submitting the PR @billpoole-mi; it led to an interesting and valuable discussion here. |
Fixes #52296.
Baseline performance:
Updated performance: