Fix unrolling for LocateLastFoundChar and LocateLastFoundByte #46977

SingleAccretion · 2021-01-14T11:09:30Z

Despite what the comments say, the pattern employed by these methods is not recognized by the Jit, because the loop will not have a flag LPFLG_SIMD_LIMIT set on it, which only happens for loops with limits that have GTF_ICON_SIMD_COUNT set on them, which in turn is only set for Vector_.Count nodes during importation.

This can be confirmed by looking at the assembly:

; Method LocateLastFoundChar(System.Numerics.Vector`1[UInt16]):int
G_M37433_IG01:
       sub      rsp, 56
       vzeroupper 
G_M37433_IG02:
       vmovupd  xmm0, xmmword ptr [rcx]
       mov      eax, 1
G_M37433_IG03:
       cmp      eax, 2
       jae      SHORT G_M37433_IG06
       vmovupd  xmmword ptr [rsp+28H], xmm0
       mov      rdx, qword ptr [rsp+8*rax+28H]
       test     rdx, rdx
       jne      SHORT G_M37433_IG04
       dec      eax
       test     eax, eax
       jge      SHORT G_M37433_IG03
G_M37433_IG04:
       or       rdx, 1
       bsr      rdx, rdx
       sar      edx, 4
       lea      eax, [rdx+4*rax]
G_M37433_IG05:
       add      rsp, 56
       ret      
G_M37433_IG06:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
; Total bytes of code: 68

This PR fixes the issue by introducing a new loop counter that satisfies the Jit's requirements. I have also collected a small benchmark (source can be found here: https://gist.github.com/SingleAccretion/d17ef1b0e885b50fffed47c82afdc29e). This is only for illustrative/sanity checking purposes, as:

It seems to be sensitive to alignment.
It was collected on a machine without AVX2.
It is a microbenchmark of a method that will be inlined into another method and never be used on its own.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-4820K CPU 3.70GHz (Ivy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100
[Host] : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
DefaultJob : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT

Method	match	Mean	Error	StdDev	Median	Ratio	RatioSD	Code Size
FindLastCharOld	<0, 0(...)5535> [28]	0.7789 ns	0.0521 ns	0.0512 ns	0.7866 ns	1.24	0.15	68 B
FindLastCharNew	<0, 0(...)5535> [28]	0.6026 ns	0.0501 ns	0.0702 ns	0.6168 ns	1.00	0.00	63 B

FindLastCharOld	<6553(...)0, 0> [28]	1.6021 ns	0.0624 ns	0.0584 ns	1.6234 ns	2.05	0.17	68 B
FindLastCharNew	<6553(...)0, 0> [28]	0.7893 ns	0.0534 ns	0.0572 ns	0.7439 ns	1.00	0.00	63 B

MultimodalDistribution
LocateLastFoundCharBenchmarks.FindLastCharOld: Default -> It seems that the distribution is bimodal (mValue = 3.25)
LocateLastFoundCharBenchmarks.FindLastCharNew: Default -> It seems that the distribution is bimodal (mValue = 3.29)
LocateLastFoundCharBenchmarks.FindLastCharNew: Default -> It seems that the distribution can have several modes (mValue = 3.09)

Notes:

Placing i-- at the end of the loop body is deliberate: for (int j = 0; j < Vector<ulong>.Count; j++, i--) breaks the recognition of the pattern (for (int j = 0; j < Vector<ulong>.Count; i--, j++) works).
This change should result in machine code size regression on AVX2-enabled machines. Sharplab suggests it will not be significant.
There is some duplication of code between the Byte and Char versions which only differ in their last line. The methods could be folded into one without performance loss. However, this would disrupt the nice "Char/Byte" pattern that the source files have. My personal opinion here would be that it's not worth it.
It could be argued that fixing the source here is wrong, and we should really fix the Jit. I think that's a fair opinion, at the same time, fixing the Jit here seems to be non-trivial, and a cursory look on source.dot.net for places where Vector<T>.Count is used suggests that this problematic pattern is not widespread (only the fixed methods show up). The unrolling today is laser-focused on the for (int i = 0; i < Vector<T>.Count; i++) case, and I think it makes sense to leave it that way until more significant investment in the area is determined to be needed.

…Helpers

stephentoub · 2021-01-15T15:29:13Z

cc: @tannergooding, @GrabYourPitchforks, @benaadams

ghost · 2021-01-20T19:29:07Z

Hello @tannergooding!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.

stephentoub · 2021-01-25T03:34:17Z

It's great that we could workaround this, but is this something that should have worked via the JIT? i.e. is there an issue somewhere tracking making the original pattern "just work"?

tannergooding · 2021-01-25T05:32:44Z

@stephentoub, we have a few issues tracking loop unrolling support (see under Loop Unrolling): #43549
Today, it's basically limited to a single pattern involving 0 to *.Count where * is Vector<T>, Vector64<T>, Vector128<T>, or Vector256<T>.

Fix unrolling for LocateLastFoundChar and LocateLastFoundByte in Span…

e77e575

…Helpers

dotnet-issue-labeler bot added the area-System.Memory label Jan 14, 2021

tannergooding approved these changes Jan 15, 2021

View reviewed changes

GrabYourPitchforks approved these changes Jan 20, 2021

View reviewed changes

tannergooding added the auto-merge label Jan 20, 2021

ghost merged commit ac19cae into dotnet:master Jan 20, 2021

SingleAccretion deleted the Fix-unrolling-in-SpanHelpers branch February 10, 2021 20:12

ghost locked as resolved and limited conversation to collaborators Mar 12, 2021

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unrolling for LocateLastFoundChar and LocateLastFoundByte #46977

Fix unrolling for LocateLastFoundChar and LocateLastFoundByte #46977

SingleAccretion commented Jan 14, 2021 •

edited

Loading

stephentoub commented Jan 15, 2021

ghost commented Jan 20, 2021

stephentoub commented Jan 25, 2021

tannergooding commented Jan 25, 2021

Fix unrolling for LocateLastFoundChar and LocateLastFoundByte #46977

Fix unrolling for LocateLastFoundChar and LocateLastFoundByte #46977

Conversation

SingleAccretion commented Jan 14, 2021 • edited Loading

stephentoub commented Jan 15, 2021

ghost commented Jan 20, 2021

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

stephentoub commented Jan 25, 2021

tannergooding commented Jan 25, 2021

SingleAccretion commented Jan 14, 2021 •

edited

Loading

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.