Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intrinsicify SpanHelpers.IndexOfAny(char,...) #40729

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Aug 12, 2020

10% to 125% performance increase #40729 (comment)

Put them all in same PR as they are mostly identical

Intrinsicify IndexOfAny(char,char) #40589
Intrinsicify IndexOfAny(char,char,char) #40590
Intrinsicify IndexOfAny(char,char,char,char) #40591
Intrinsicify IndexOfAny(char,char,char,char,char) #40592

Resolves: #12094
Resolves: #12095
Resolves: #12096
Resolves: #12097

Resolves #25023

@danmoseley
Copy link
Member

danmoseley commented Aug 12, 2020

Thanks. Shall we close the others? I think it would be better.

You might consider running these microbenchmarks too
https://github.com/dotnet/performance/blob/8f00082e5f1ab8b86a98bf3bfc9c307a171912a3/src/benchmarks/micro/libraries/System.Memory/Span.cs#L72-L78

At a glance that only uses a search space of 512 chars? Maybe it would be useful to add a bit more diversity there to help establish this PR's characteristics for small strings eg.

@benaadams
Copy link
Member Author

+19%

| Faster                                                  | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ------------------------------------------------------- | ---------:| ----------------:| ----------------:| ------- |
| BenchmarksGame.RegexRedux_1.RunBench                    |      1.11 |      45419250.00 |      40848366.67 |         |
| BenchmarksGame.RegexRedux_5.RunBench(options: Compiled) |      1.19 |       8419890.74 |       7057605.17 | bimodal |
| BenchmarksGame.RegexRedux_5.RunBench(options: None)     |      1.02 |      25805255.56 |      25379127.78 |         |

/cc @stephentoub

@benaadams
Copy link
Member Author

+30% .. +65% for the span, 512 chars
+75% for the string (pos 23, two items, 178 char string)

| Faster                                                    | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| --------------------------------------------------------- | ---------:| ----------------:| ----------------:| ------- |
| System.Tests.Perf_String.IndexOfAny                       |      1.75 |            14.00 |             7.98 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 512) |      1.65 |            24.25 |            14.71 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 512)   |      1.35 |            16.50 |            12.22 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 512)  |      1.30 |            29.68 |            22.83 |         |

@benaadams
Copy link
Member Author

benaadams commented Aug 12, 2020

Smaller ones

| Slower                                                 | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| ------------------------------------------------------ | ---------:| ----------------:| ----------------:| -------:|
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 4) |      1.07 |             5.28 |             5.66 |         |

| Faster                                                   | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------:|
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 8)    |      1.20 |             4.41 |             3.67 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 16)   |      1.44 |             5.66 |             3.94 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 24)   |      1.91 |             6.98 |             3.64 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 32)   |      2.25 |             8.83 |             3.92 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 48)   |      2.17 |             8.93 |             4.12 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 64)   |      1.92 |             9.33 |             4.85 |         |

| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 4)  |      1.18 |             4.07 |             3.44 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 16) |      1.63 |             7.23 |             4.44 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 24) |      2.12 |             8.66 |             4.08 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 32) |      2.11 |            10.21 |             4.84 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 48) |      2.31 |            10.76 |             4.65 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 64) |      2.01 |            11.33 |             5.64 |         |

| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 8)   |      1.11 |             6.70 |             6.06 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 16)  |      1.43 |             9.25 |             6.46 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 24)  |      1.84 |            11.55 |             6.28 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 48)  |      1.81 |            12.99 |             7.18 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 32)  |      1.77 |            12.44 |             7.04 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 64)  |      1.70 |            14.07 |             8.29 |         |

@danmoseley
Copy link
Member

Those numbers are great! I guess that is with AVX2. How do we test perf for CPU without that -- repeat with COMPlus_EnableAVX2=0 and COMPlus_EnableHWIntrinsic=0 (?)

@danmoseley
Copy link
Member

@tannergooding or @GrabYourPitchforks what configurations do you believe we should get perf numbers for? without AVX2 and without intrinsics?

@benaadams
Copy link
Member Author

COMPlus_EnableAVX2=0

| Faster                                                   | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------:|
| BenchmarksGame.RegexRedux_1.RunBench                     |      1.13 |      45408050.00 |      40193533.33 |         |
| BenchmarksGame.RegexRedux_5.RunBench(options: Compiled)  |      1.16 |       7868740.74 |       6801046.97 |         |

| System.Tests.Perf_String.IndexOfAny                      |      1.43 |            14.66 |            10.22 |         |

| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 8)    |      1.40 |             4.38 |             3.12 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 16)   |      1.99 |             7.13 |             3.59 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 24)   |      1.48 |             7.69 |             5.18 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 32)   |      1.42 |             6.93 |             4.87 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 48)   |      1.50 |             8.52 |             5.69 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 64)   |      1.42 |             9.56 |             6.74 |         |

| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 4)  |      1.14 |             3.80 |             3.33 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 8)  |      1.16 |             4.78 |             4.13 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 16) |      1.87 |             8.54 |             4.57 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 24) |      1.89 |             9.14 |             4.85 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 32) |      1.63 |             9.26 |             5.67 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 48) |      1.53 |             9.93 |             6.48 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 64) |      1.42 |            10.68 |             7.53 |         |

| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 8)   |      1.10 |             6.37 |             5.80 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 16)  |      1.36 |             8.94 |             6.58 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 24)  |      1.67 |            10.86 |             6.51 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 32)  |      1.40 |            11.01 |             7.85 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 48)  |      1.16 |            10.26 |             8.82 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 64)  |      1.29 |            13.04 |            10.08 |         |

@benaadams
Copy link
Member Author

COMPlus_EnableAVX2=0 and COMPlus_EnableHWIntrinsic=0

| Faster                                                  | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------:|

| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 16)  |      1.11 |             4.25 |             3.83 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 24)  |      1.09 |             5.49 |             5.01 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 32)  |      1.07 |             6.76 |             6.30 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 48)  |      1.05 |             9.26 |             8.85 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 64)  |      1.04 |            11.77 |            11.32 |         |

| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 4) |      1.14 |             3.04 |             2.66 |         |

| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 8)  |      1.07 |             5.58 |             5.21 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 16) |      1.05 |             7.72 |             7.38 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 24) |      1.04 |            10.15 |             9.77 |         |

| BenchmarksGame.RegexRedux_1.RunBench                    |      1.01 |      41839725.00 |      41626433.33 |         |

@danmoseley
Copy link
Member

Interesting that you made it faster without any HW intrinsics applied (magic..)

@pgovind this change doesn't use ARM intrinsics, but we don't want to regress ARM. Is the test with COMPlus_EnableHWIntrinsic=0 a reasonable proxy?

@benaadams
Copy link
Member Author

@pgovind this change doesn't use ARM intrinsics, but we don't want to regress ARM.Is the test with COMPlus_EnableHWIntrinsic=0 a reasonable proxy?

ARM uses the Vector<T> branch?

@danmoseley
Copy link
Member

ARM uses the Vector branch?

Right, thanks so I guess the question is how to force Sse2.IsSupported to false and Vector.IsHardwareAccelerated to true. @AndyAyersMS ? Or someone can just get the numbers on ARM maybe next week.

@pgovind
Copy link

pgovind commented Aug 12, 2020

but we don't want to regress ARM. Is the test with COMPlus_EnableHWIntrinsic=0 a reasonable proxy

I'm actually not certain how EnableHWIntrisic plays with EnableAdvSimd=0 (and the IsHardwareAccelerated check). @kunalspathak or @echesakovMSFT are probably the best suited to answer that.

@kunalspathak
Copy link
Member

but we don't want to regress ARM. Is the test with COMPlus_EnableHWIntrinsic=0 a reasonable proxy

I'm actually not certain how EnableHWIntrisic plays with EnableAdvSimd=0 (and the IsHardwareAccelerated check). @kunalspathak or @echesakovMSFT are probably the best suited to answer that.

I believe it should be COMPlus_EnableSse2=0.

@benaadams
Copy link
Member Author

COMPlus_EnableSse2=0 + COMPlus_EnableAVX2=0

| Faster                                                   | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------:|
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 4)    |      1.18 |             2.65 |             2.25 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 8)    |      1.35 |             3.35 |             2.48 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 16)   |      1.26 |             4.75 |             3.77 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 24)   |      1.15 |             5.90 |             5.14 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 32)   |      1.13 |             7.15 |             6.34 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 48)   |      1.08 |             9.58 |             8.85 |         |
| System.Memory.Span<Char>.IndexOfAnyTwoValues(Size: 64)   |      1.06 |            12.01 |            11.35 |         |

| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 4)  |      1.20 |             3.03 |             2.52 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 8)  |      1.12 |             3.69 |             3.30 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 16) |      1.11 |             5.48 |             4.94 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 24) |      1.06 |             7.25 |             6.84 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 32) |      1.06 |             9.01 |             8.50 |         |
| System.Memory.Span<Char>.IndexOfAnyThreeValues(Size: 48) |      1.04 |            12.51 |            12.03 |         |

| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 16)  |      1.04 |             7.70 |             7.38 |         |
| System.Memory.Span<Char>.IndexOfAnyFourValues(Size: 32)  |      1.06 |            12.69 |            12.00 |         |

@danmoseley
Copy link
Member

Well, the numbers seem solid, now we just need a reviewer. It might be a few days with vacations and such - but if this pans out and folks feel good about it next week I hope we could port it into 5.0 (and RC1) then.

@benaadams
Copy link
Member Author

There's a C# .NET Core versus SDK 5 preview

https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-csharppreview.html

Some regressions?

@danmoseley
Copy link
Member

Some regressions?

Yes, thanks for the reminder. I started digging into our own copies of those (some are out of date) and any regression we see in our own data, but got side tracked. I'll open a new issue.

@benaadams
Copy link
Member Author

if this pans out and folks feel good about it next week I hope we could port it into 5.0 (and RC1) then

Will close 5 issues 😉

@adamsitnik adamsitnik added the tenet-performance Performance related issue label Aug 14, 2020
@adamsitnik adamsitnik added this to the 5.0.0 milestone Aug 14, 2020
Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Some minor comments.

// Bitwise Or to combine the flagged matches for the second, third and fourth values to our match flags
matches |= Avx2.MoveMask(Avx2.CompareEqual(values1, search).AsByte());
matches |= Avx2.MoveMask(Avx2.CompareEqual(values2, search).AsByte());
matches |= Avx2.MoveMask(Avx2.CompareEqual(values3, search).AsByte());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wish if we had a way to generalize these methods. The logic is identical except the value0...value3 part being different.

@benaadams
Copy link
Member Author

Hmm, might be able to improve short lengths (non-vector) more also #40883

@benaadams
Copy link
Member Author

K, wouldn't hold these up for that think it might hit the vector path with some extra push/pops that I'll have to work through; so will follow up in another PR

@kunalspathak kunalspathak merged commit 0bfb0fd into dotnet:master Aug 17, 2020
@kunalspathak
Copy link
Member

Thank you @benaadams for the wonderful improvements!

Copy link
Member

@ahsonkhan ahsonkhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job, @benaadams!

{
search = LoadVector256(ref searchStart, offset);
// We preform the Or at non-Vector level as we are using the maximum number of non-preserved registers,
// and more causes them first to be pushed to stack and then popped on exit to preseve their values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// and more causes them first to be pushed to stack and then popped on exit to preseve their values.
on exit to preserve their values.

if (Vector.IsHardwareAccelerated && length >= Vector<ushort>.Count * 2)
if (Sse2.IsSupported)
{
// Calculate lengthToExamine here for test, rather than just testing as it used later, rather than doing it twice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Calculate lengthToExamine here for test, rather than just testing as it used later, rather than doing it twice.
// Calculate lengthToExamine here for test, rather than just testing as it is used later, rather than doing it twice.

int unaligned = ((int)pCh & (Unsafe.SizeOf<Vector<ushort>>() - 1)) / elementsPerByte;
length = (Vector<ushort>.Count - unaligned) & (Vector<ushort>.Count - 1);
// >= Sse2 intrinsics are supported and length is enough to use them, so use that path.
// We jump forward to the intrinsics at the end of them method so a naive branch predict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We jump forward to the intrinsics at the end of them method so a naive branch predict
// We jump forward to the intrinsics at the end of the method so a naive branch predict

}
else if (Vector.IsHardwareAccelerated)
{
// Calculate lengthToExamine here for test, rather than just testing as it used later, rather than doing it twice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Calculate lengthToExamine here for test, rather than just testing as it used later, rather than doing it twice.
// Calculate lengthToExamine here for test, rather than just testing as it is used later, rather than doing it twice.

{
char* pCh = pChars;
char* pEndCh = pCh + length;
nuint offset = 0; // Use nuint for arithmetic to avoid unnecessary 64->32->64 truncations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why nuint here, when IndexOf uses nint? Should that be changed to nuint as well?

https://github.com/benaadams/runtime/blob/73e8efb9f68bcb0e1c9c1ecf0f937447129a5373/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs#L218-L219

And if we change it, can we remove the nint overloads of LoadVector128, LoadVector256, etc.?

@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.