Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use intrinsics for SequenceEqual<byte> vectorization to emit at R2R #32371

Merged
merged 4 commits into from
Apr 27, 2020

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Feb 15, 2020

Bit more stable across sizes; but mostly similar. The biggest win I'd highlight is that it's now emitted at R2R rather than always requiring JIT.

image

image

gist Benchamark+Results

Resolves #32363

/cc @ahsonkhan

@jkotas jkotas added the tenet-performance Performance related issue label Feb 15, 2020
@benaadams benaadams changed the title Use intrinsics for SequenceEqual<byte> Use intrinsics for SequenceEqual<byte> and improve short lengths Feb 15, 2020
@benaadams
Copy link
Member Author

Redoing this on top of @ahsonkhan's change #32364 as that outperformed this in various areas

@ahsonkhan
Copy link
Member

ahsonkhan commented Feb 15, 2020

@benaadams, can you run the following benchmark with what's in master (with my recent change) vs. what's in this PR to measure/validate the small buffer perf?

I am asking because I noticed not using the actually built SequenceEqual method was giving different results (compared to having your own local implementation in the benchmark). The RuntimeHelpers.IsBitwiseEquatable<T> call with multiple return points should be part of the benchmark (it ends up changing the results noticably).

[BenchmarkCategory(Categories.CoreFX, Categories.JSON)]
[DisassemblyDiagnoser(printPrologAndEpilog: true, recursiveDepth: 5)]
public class SequenceEqualThreshold
{
    private byte[] _input;
    private byte[] _expected;

    [Params(0, 1, 2, 3, 4)]
    public int Length;

    [GlobalSetup]
    public void Setup()
    {
        var builder = new StringBuilder();
        for (int i = 0; i < Length; i++)
        {
            builder.Append("a");
        }
        string input = builder.ToString();
        _input = Encoding.UTF8.GetBytes(input);
        _expected = _input;
        _expected = Encoding.UTF8.GetBytes(input);
        Console.WriteLine(typeof(Span<byte>).AssemblyQualifiedName);
        Console.WriteLine(typeof(Span<byte>).Assembly.Location);
    }

    [Benchmark]
    public bool SequenceEqual()
    {
        return _input.AsSpan().SequenceEqual(_expected);
    }
}

This is what I did here: #32363

Using the dotnet/performance repo (see https://github.com/dotnet/performance/blob/ca80d8e2886b583d0a69635740b188248d3d6fdd/src/benchmarks/micro/README.md#private-runtime-builds):

  1. Build dotnet/runtime master: build.cmd -subsetCategory coreclr -c Release && build.cmd -subsetCategory libraries /p:CoreCLRConfiguration=Release
  2. Create a copy of "E:\GitHub\Fork\runtime\artifacts\bin\testhost\netcoreapp5.0-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\5.0.0" into a new directory called 5.0.0_Before.
  3. Copy the recently built relevant files from E:\GitHub\Fork\runtime\artifacts\bin\coreclr\Windows_NT.x64.Release (S.P.Corelib.dll, CoreRun.exe, etc.) into it (i.e. into 5.0.0_Before).
  4. Create another copy of "E:\GitHub\Fork\runtime\artifacts\bin\testhost\netcoreapp5.0-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\5.0.0" into a new into a new directory called 5.0.0_After.
  5. Make changes to the implementation in System.Private.Corelib (the optimization you are testing from this PR) and just rebuild the coreclr dlls:
    build.cmd -subsetCategory coreclr -c Release
  6. Copy the newly built dlls (including S.P.Corelib) from E:\GitHub\Fork\runtime\artifacts\bin\coreclr\Windows_NT.x64.Release into 5.0.0_After.
  7. In dotnet/performance repo: cd src\benchmarks\micro
  8. dotnet.exe run -c Release -f netcoreapp5.0 --filter SequenceEqualThreshold --corerun "E:\GitHub\Fork\runtime\artifacts\bin\testhost\netcoreapp5.0-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\5.0.0_Before\CoreRun.exe" --artifacts "E:\results\before" && dotnet.exe run -c Release -f netcoreapp5.0 --filter SequenceEqualThreshold --corerun "E:\GitHub\Fork\runtime\artifacts\bin\testhost\netcoreapp5.0-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\5.0.0_After\CoreRun.exe" --artifacts "E:\results\after"
  9. cd ..\..\tools\ResultsComparer
  10. dotnet run --base "E:\results\before" --diff "E:\results\after" --threshold 2%

Here are the dlls I copy/override:
image

If you have another, easier way to do it, please do that (and share) :) I probably made things more complicated than needed, so there gotta be a better way to do the perf measurements to speed up inner dev loop.

Btw, @adamsitnik - the workflow instructions need to be updated. The testhost\corerun folder doesn't contain the latest built System.Private.Corelib.dll which is why I ended up having to manually copy the new dlls to that folder.

Also, we may want to see whether removing multiple return statements in the main public method helps (also apparently, the if-branch is the special case, so putting the common code in the else branch or outside the if might be better for perf too, so inverted the condition). Maybe you can find ways to optimize that as well in different ways :)

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static bool SequenceEqual<T>(this Span<T> span, ReadOnlySpan<T> other) where T : IEquatable<T>
{
    int length = span.Length;

    bool result = length == other.Length;

    if (!RuntimeHelpers.IsBitwiseEquatable<T>())
    {
        result = result &&
            SpanHelpers.SequenceEqual(ref MemoryMarshal.GetReference(span), ref MemoryMarshal.GetReference(other), length);
    }
    else
    {
        nuint size = (nuint)Unsafe.SizeOf<T>();
        result = result &&
        SpanHelpers.SequenceEqual(
            ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(span)),
            ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(other)),
            ((nuint)length) * size);  // If this multiplication overflows, the Span we got overflows the entire address range. There's no happy outcome for this api in such a case so we choose not to take the overhead of checking.
    }

    return result;
}

@ahsonkhan ahsonkhan added this to the 5.0 milestone Feb 15, 2020
{
while (offset < lengthToExamine)
{
if (Unsafe.Add(ref first, (IntPtr)offset) != (Unsafe.Add(ref second, (IntPtr)offset)))
Copy link
Member

@jkotas jkotas Feb 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nint -> IntPtr casts are a performance trap. I believe that it will go to 64-bit long first on 32-platforms, and the 64-bit long then gets down-casted using checked cast to 32-bit again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses the explicit operator IntPtr(int value) -> IntPtr(int value) so should be ok? (rather than nuint which would go via long)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI what Jan said is the reason the UTF-8 transcoding logic uses void* as an intermediary when converting between IntPtr and (whatever integral type).

uint remainingInputBytes = (uint)(void*)Unsafe.ByteOffset(ref *pInputBuffer, ref *pFinalPosWhereCanReadDWordFromInputBuffer) + 4;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This should be fine. I thought there is unsigned/signed conversion too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as @GrabYourPitchforks noted it is very easy to miss the cases where it is not fine. We had number of 32-bit specific perf bugs because of that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I previously had an issue because pointers are unsigned so my less than zero tests always went the wrong way, which I hadn't expected :(

@benaadams
Copy link
Member Author

benaadams commented Feb 16, 2020

@ahsonkhan I was using local copies of SequenceEquals for previous vs master vs PR vs loop method; where master outperforms the PR on short lengths; though am combining the two which looks like it improves on both.

Doing it this way for a couple reasons.

  1. Creating and passing the spans in below code takes significantly longer than the SequenceEqual(ref, ref, nuint) method in its entirety (perhaps we need a .SequenceEqual extension on array so it can bypass span creation if you just have arrays, as string.Equals does?); whereas I wanted a fair comparison against a byte-wise loop (calling overheads something to look at separately?)
public bool SequenceEqual()
{
   return _input.AsSpan().SequenceEqual(_expected);
}
  1. Is quite a pain to have 4 coreclrs for the 4 comparisons I'm making (as you detail above); and I'm not currently very efficient with the new runtime repo so have to keep looking up build steps and hunting for files, which is slower for iterations.

  2. Dasming the tiered/vectorized coreclr is more tricky... whereas its quite easy to right click on method in exe using @EgorBo's Disamso see the asm, make some changes, hit refresh get new asm, etc.

Also, we may want to see whether removing multiple return statements in the main public method helps:

Should branch eliminate to only 1 return?

@benaadams
Copy link
Member Author

Probably workflow-wise 3. (iterating on the asm) is the highest factor as you set the bar quite high with the last PR 😄

image

@ahsonkhan
Copy link
Member

ahsonkhan commented Feb 16, 2020

Should branch eliminate to only 1 return?

I assume so since the check is an intrinsic. How can we test/verify that is indeed the case? Is there a way to observe that in the disassembly?

I will re-run the benchmark tomorrow to verify that it has no perf impact.

@benaadams benaadams force-pushed the SequenceEqual branch 2 times, most recently from 06a3478 to 99ca277 Compare February 16, 2020 03:43
@benaadams
Copy link
Member Author

Span passing costs seem very high?

e.g. passing 2 Spans from one method to another is more expensive than comparing the whole 4096 byte spans for equality? (Windows)

|                   Method | Length |      Mean |     Error |    StdDev |
|------------------------- |------- |----------:|----------:|----------:|
|       UseSequenceEqualPR |   4096 | 14.618 ns | 0.0257 ns | 0.0215 ns |
| UseSequenceEqualPRDirect |   4096 |  6.812 ns | 0.0125 ns | 0.0105 ns |
// Passing as params

[Benchmark]
[MethodImpl(MethodImplOptions.NoInlining)]
public bool UseSequenceEqualPR()
{
    return SequenceEqualPR(_input.Span, _expected.Span);
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static bool SequenceEqualPR(Span<byte> input, Span<byte> expected)
{
    return input.Length == expected.Length && SequenceEqualPR(
        ref MemoryMarshal.GetReference(input),
        ref MemoryMarshal.GetReference(expected),
        (nuint)input.Length);
}

// Using direct

[Benchmark]
[MethodImpl(MethodImplOptions.NoInlining)]
public bool UseSequenceEqualPRDirect()
{
    return SequenceEqualPRDirect();
}

[MethodImpl(MethodImplOptions.NoInlining)]
private bool SequenceEqualPRDirect()
{
    Span<byte> input = _input.Span;
    Span<byte> expected = _expected.Span;

    return input.Length == expected.Length && SequenceEqualPR(
        ref MemoryMarshal.GetReference(input),
        ref MemoryMarshal.GetReference(expected),
        (nuint)input.Length);
}

@benaadams
Copy link
Member Author

Updated the benchmark to show the Span passing cost https://gist.github.com/benaadams/bf85405a5eae4c750cf6470a5506fd8d can make the SequenceEqual(ref, ref, nuint) method faster, but its already less than 50% of the invocation cost of SequenceEqual<T>(this Span<T> span, Span<T> other) even up to 4096 bytes 🤔

@benaadams
Copy link
Member Author

Raised issue for the Span<byte> costs when used as parameters #32396

@@ -1309,85 +1311,191 @@ public static unsafe int LastIndexOfAny(ref byte searchSpace, byte value0, byte

// Optimized byte-based SequenceEquals. The "length" parameter for this one is declared a nuint rather than int as we also use it for types other than byte
// where the length can exceed 2Gb once scaled by sizeof(T).
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it now has Sse2 intrinisics, removed the AggressiveOptimization which prevents them from being emitted at R2R.

Pure Vector<T> methods are blocked from R2R so need AggressiveOptimization to bypass the inline restrictions at Tier0.

Note this is a regression on Arm as it will run Tier0 code; however I couldn't find a #if to put it behind; and it should get picked up by #33308

@benaadams
Copy link
Member Author

benaadams commented Mar 14, 2020

R2R version

; Assembly listing for method SpanHelpers:SequenceEqual(byref,byref,long):bool
; Emitting BLENDED_CODE for X64 CPU with SSE2 - Windows
; ReadyToRun compilation
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T01] ( 11, 10   )   byref  ->  rcx        
;  ...
;* V47 tmp27        [V47    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;
; Lcl frame size = 0

G_M37173_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M37173_IG02:
       cmp      r8, 8
       jae      SHORT G_M37173_IG07
						;; bbWeight=1    PerfScore 1.25
G_M37173_IG03:
       cmp      r8, 4
       jae      SHORT G_M37173_IG06
       xor      eax, eax
       mov      r9, r8
       and      r9, 2
       test     r9, r9
       je       SHORT G_M37173_IG04
       movzx    rax, word  ptr [rcx]
       movzx    r10, word  ptr [rdx]
       sub      eax, r10d
						;; bbWeight=0.50 PerfScore 3.75
G_M37173_IG04:
       test     r8b, 1
       je       SHORT G_M37173_IG05
       movzx    rcx, byte  ptr [rcx+r9]
       movzx    rdx, byte  ptr [rdx+r9]
       sub      ecx, edx
       or       ecx, eax
       mov      eax, ecx
						;; bbWeight=0.50 PerfScore 3.00
G_M37173_IG05:
       test     eax, eax
       sete     al
       movzx    rax, al
       jmp      SHORT G_M37173_IG08
						;; bbWeight=0.50 PerfScore 1.75
G_M37173_IG06:
       add      r8, -4
       mov      eax, dword ptr [rcx]
       sub      eax, dword ptr [rdx]
       mov      ecx, dword ptr [rcx+r8]
       sub      ecx, dword ptr [rdx+r8]
       or       eax, ecx
       test     eax, eax
       sete     al
       movzx    rax, al
       jmp      SHORT G_M37173_IG08
						;; bbWeight=0.50 PerfScore 6.00
G_M37173_IG07:
       cmp      rcx, rdx
       je       SHORT G_M37173_IG09
       jmp      SHORT G_M37173_IG11
						;; bbWeight=0.50 PerfScore 1.63
G_M37173_IG08:
       ret      
						;; bbWeight=0.50 PerfScore 0.50
G_M37173_IG09:
       mov      eax, 1
						;; bbWeight=0.50 PerfScore 0.13
G_M37173_IG10:
       ret      
						;; bbWeight=0.50 PerfScore 0.50
G_M37173_IG11:
       cmp      r8, 16
       jb       SHORT G_M37173_IG14
       xor      rax, rax
       add      r8, -16
       test     r8, r8
       je       SHORT G_M37173_IG13
						;; bbWeight=0.50 PerfScore 1.50
G_M37173_IG12:
       movups   xmm0, xmmword ptr [rcx+rax]
       movups   xmm1, xmmword ptr [rdx+rax]
       pcmpeqb  xmm0, xmm1
       pmovmskb  r9d, xmm0
       cmp      r9d, 0xFFFF
       jne      SHORT G_M37173_IG15
       add      rax, 16
       cmp      r8, rax
       ja       SHORT G_M37173_IG12
						;; bbWeight=4    PerfScore 49.00
G_M37173_IG13:
       movups   xmm0, xmmword ptr [rcx+r8]
       movups   xmm1, xmmword ptr [rdx+r8]
       pcmpeqb  xmm0, xmm1
       pmovmskb  ecx, xmm0
       cmp      ecx, 0xFFFF
       jne      SHORT G_M37173_IG15
       jmp      SHORT G_M37173_IG09
						;; bbWeight=0.50 PerfScore 6.38
G_M37173_IG14:
       lea      rax, [r8-8]
       mov      r8, qword ptr [rcx]
       sub      r8, qword ptr [rdx]
       mov      rcx, qword ptr [rcx+rax]
       sub      rcx, qword ptr [rdx+rax]
       or       r8, rcx
       test     r8, r8
       sete     al
       movzx    rax, al
       jmp      SHORT G_M37173_IG08
						;; bbWeight=0.50 PerfScore 6.13
G_M37173_IG15:
       xor      eax, eax
						;; bbWeight=0.50 PerfScore 0.13
G_M37173_IG16:
       ret      
						;; bbWeight=0.50 PerfScore 0.50

; Total bytes of code 225, prolog size 0, PerfScore 104.63, (MethodHash=63986eca) for method SpanHelpers:SequenceEqual(byref,byref,long):bool
; ============================================================

@benaadams benaadams force-pushed the SequenceEqual branch 2 times, most recently from 858ef64 to 8111936 Compare March 16, 2020 00:38
@adamsitnik
Copy link
Member

the workflow instructions need to be updated.

Please excuse me for the late response. Both the benchmarking and profiling docs have been updated some time ago and now they are up to date:

https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-dotnet-runtime.md
https://github.com/dotnet/performance/blob/master/docs/profiling-workflow-dotnet-runtime.md

Please let me know if something does not work as expected.

@benaadams
Copy link
Member Author

Bit more stable across sizes; but mostly similar. The biggest win I'd highlight is that it's now emitted at R2R rather than always requiring JIT.

image

image

@benaadams benaadams changed the title Use intrinsics for SequenceEqual<byte> and improve short lengths Use intrinsics for SequenceEqual<byte> vectorization to emit at R2R Mar 16, 2020
@benaadams
Copy link
Member Author

/cc @GrabYourPitchforks any more to do here?

@GrabYourPitchforks GrabYourPitchforks merged commit 535b998 into dotnet:master Apr 27, 2020
@GrabYourPitchforks GrabYourPitchforks added the enhancement Product code improvement that does NOT require public API changes/additions label Apr 27, 2020
@danmoseley
Copy link
Member

@tannergooding do you have time to help get this reviewed? I know @GrabYourPitchforks is fully occupied with something critical. At least one other PR is blocked on this one.

@GrabYourPitchforks
Copy link
Member

@danmosemsft did you mean to comment on a different PR? This one is merged.

@benaadams benaadams deleted the SequenceEqual branch May 2, 2020 16:33
@danmoseley
Copy link
Member

Doh. My goal was to unblock @benadams
#25023 (comment)

@benaadams
Copy link
Member Author

Need to minimise code churn/merge clashes between the PRs, have done a cleanup PR to make it easier #35765

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can Span<T>.SequenceEqual be optimized further to be faster for small buffers (buffer.Length < 5)?
7 participants