-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intrinsicify SpanHelpers.IndexOfAny(char,...) #40729
Intrinsicify SpanHelpers.IndexOfAny(char,...) #40729
Conversation
Thanks. Shall we close the others? I think it would be better. You might consider running these microbenchmarks too At a glance that only uses a search space of 512 chars? Maybe it would be useful to add a bit more diversity there to help establish this PR's characteristics for small strings eg. |
+19%
/cc @stephentoub |
+30% .. +65% for the span, 512 chars
|
Smaller ones
|
Those numbers are great! I guess that is with AVX2. How do we test perf for CPU without that -- repeat with COMPlus_EnableAVX2=0 and COMPlus_EnableHWIntrinsic=0 (?) |
@tannergooding or @GrabYourPitchforks what configurations do you believe we should get perf numbers for? without AVX2 and without intrinsics? |
COMPlus_EnableAVX2=0
|
COMPlus_EnableAVX2=0 and COMPlus_EnableHWIntrinsic=0
|
Interesting that you made it faster without any HW intrinsics applied (magic..) @pgovind this change doesn't use ARM intrinsics, but we don't want to regress ARM. Is the test with |
ARM uses the |
Right, thanks so I guess the question is how to force |
I'm actually not certain how |
I believe it should be |
|
Well, the numbers seem solid, now we just need a reviewer. It might be a few days with vacations and such - but if this pans out and folks feel good about it next week I hope we could port it into 5.0 (and RC1) then. |
There's a C# .NET Core versus SDK 5 preview https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-csharppreview.html Some regressions? |
Yes, thanks for the reminder. I started digging into our own copies of those (some are out of date) and any regression we see in our own data, but got side tracked. I'll open a new issue. |
Will close 5 issues 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Some minor comments.
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Outdated
Show resolved
Hide resolved
// Bitwise Or to combine the flagged matches for the second, third and fourth values to our match flags | ||
matches |= Avx2.MoveMask(Avx2.CompareEqual(values1, search).AsByte()); | ||
matches |= Avx2.MoveMask(Avx2.CompareEqual(values2, search).AsByte()); | ||
matches |= Avx2.MoveMask(Avx2.CompareEqual(values3, search).AsByte()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really wish if we had a way to generalize these methods. The logic is identical except the value0
...value3
part being different.
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Outdated
Show resolved
Hide resolved
Hmm, might be able to improve short lengths (non-vector) more also #40883 |
K, wouldn't hold these up for that think it might hit the vector path with some extra push/pops that I'll have to work through; so will follow up in another PR |
Thank you @benaadams for the wonderful improvements! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job, @benaadams!
{ | ||
search = LoadVector256(ref searchStart, offset); | ||
// We preform the Or at non-Vector level as we are using the maximum number of non-preserved registers, | ||
// and more causes them first to be pushed to stack and then popped on exit to preseve their values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// and more causes them first to be pushed to stack and then popped on exit to preseve their values. | |
on exit to preserve their values. |
if (Vector.IsHardwareAccelerated && length >= Vector<ushort>.Count * 2) | ||
if (Sse2.IsSupported) | ||
{ | ||
// Calculate lengthToExamine here for test, rather than just testing as it used later, rather than doing it twice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Calculate lengthToExamine here for test, rather than just testing as it used later, rather than doing it twice. | |
// Calculate lengthToExamine here for test, rather than just testing as it is used later, rather than doing it twice. |
int unaligned = ((int)pCh & (Unsafe.SizeOf<Vector<ushort>>() - 1)) / elementsPerByte; | ||
length = (Vector<ushort>.Count - unaligned) & (Vector<ushort>.Count - 1); | ||
// >= Sse2 intrinsics are supported and length is enough to use them, so use that path. | ||
// We jump forward to the intrinsics at the end of them method so a naive branch predict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// We jump forward to the intrinsics at the end of them method so a naive branch predict | |
// We jump forward to the intrinsics at the end of the method so a naive branch predict |
} | ||
else if (Vector.IsHardwareAccelerated) | ||
{ | ||
// Calculate lengthToExamine here for test, rather than just testing as it used later, rather than doing it twice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Calculate lengthToExamine here for test, rather than just testing as it used later, rather than doing it twice. | |
// Calculate lengthToExamine here for test, rather than just testing as it is used later, rather than doing it twice. |
{ | ||
char* pCh = pChars; | ||
char* pEndCh = pCh + length; | ||
nuint offset = 0; // Use nuint for arithmetic to avoid unnecessary 64->32->64 truncations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why nuint
here, when IndexOf
uses nint
? Should that be changed to nuint
as well?
And if we change it, can we remove the nint
overloads of LoadVector128
, LoadVector256
, etc.?
10% to 125% performance increase #40729 (comment)
Put them all in same PR as they are mostly identical
Intrinsicify IndexOfAny(char,char) #40589
Intrinsicify IndexOfAny(char,char,char) #40590
Intrinsicify IndexOfAny(char,char,char,char) #40591
Intrinsicify IndexOfAny(char,char,char,char,char) #40592
Resolves: #12094
Resolves: #12095
Resolves: #12096
Resolves: #12097
Resolves #25023