[RyuJIT] Improve heuristic for zero-initialization of locals #8890

erozenfeld · 2017-09-06T23:42:20Z

The heuristic the jit is using for deciding how to zero-initialize locals is very simplistic. In many cases faster sequences can be used.

Here is one example. An attempt was made to switch String.Split to use Spans: stephentoub/coreclr@500978f to avoid int[] allocations. This resulted in several more temp structs being allocated and zero-initialized, which made the performance of this benchmark ~12% worse than the non-Span version:

    public static void Main()
    {
        DateTime start = DateTime.Now;

        for (int i = 0; i < 30000000; ++i)
        {
            "abc,def,ghi".Split(',');
        }

        Console.WriteLine((DateTime.Now - start).TotalMilliseconds);
    }

The current heuristic will use rep stosd in the prolog if the jit needs to initialize 16 bytes of locals (actually, the heuristic is slightly different if there are any structs larger than 24 bytes that need to be initialized but it’s not relevant for this benchmark). As an experiment I changed the heuristic so that rep stosd isn’t used for this benchmark but mov instructions are used instead. With that change we get all of the perf back compared to the array version.

Here are the two initialization sequences:

       488BF1               mov      rsi, rcx
       488D7D08             lea      rdi, [rbp+08H]
       B912000000           mov      ecx, 18
       33C0                 xor      rax, rax
       F3AB                 rep stosd 
       488BCE               mov      rcx, rsi

       33C0                 xor      rax, rax
       48894538             mov      qword ptr [rbp+38H], rax
       48894540             mov      qword ptr [rbp+40H], rax
       48894530             mov      qword ptr [rbp+30H], rax
       48894528             mov      qword ptr [rbp+28H], rax
       48894518             mov      qword ptr [rbp+18H], rax
       48894520             mov      qword ptr [rbp+20H], rax
       48894508             mov      qword ptr [rbp+08H], rax
       48894548             mov      qword ptr [rbp+48H], rax

While the second sequence is faster than the first one, we can probably do even better with xmm registers.

The jit normally favors size over speed so the block init sequence may be preferred in many cases but we should at least use IBC data when available to drive this heuristic.

category:cq
theme:zero-init
skill-level:expert
cost:medium

The text was updated successfully, but these errors were encountered:

redknightlois · 2017-09-07T00:21:20Z

There is other case which I have hit. When you stackalloc a sizeable byte array the rep stosd performance hit could be tremendous when you actually dont need the byte array to be zeroed at all. Would open a whole array of potential optimizations if you are able to hint the JIT about it.

erozenfeld · 2017-09-07T00:27:00Z

stackalloc issue is tracked by https://github.com/dotnet/coreclr/issues/1279. For stackalloc, the jit has to do zero-initialization if initlocals is set. Roslyn is considering making changes in this area. We also plan to add an option to ILLink to clear initlocals.

AndyAyersMS · 2017-10-13T20:11:32Z

Also note stos kills rcx which is almost always live and so we are forced to shuffle it out and back. Since we then generally copy rcx into a caller-saved register we often see 2 redundant moves after the stos.

AndyAyersMS · 2017-10-13T22:21:02Z

On SysV both rdi and rcx are incoming arg regs, so things are even worse there.

EgorBo · 2019-09-16T20:35:16Z

A small benchmark:

public class StackAlloc_VarSize
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static unsafe void Process(byte* data) {} // fake data processor

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static unsafe void DoWork(int size)
    {
        int local = size + 1; // triggers "InitLocals"
        byte* data = stackalloc byte[local];
        Process(data);
    }

    [Benchmark]
    [Arguments(2)]
    [Arguments(10)]
    [Arguments(63)]
    [Arguments(100)]
    [Arguments(256)]
    [Arguments(1000)]
    [Arguments(10000)]
    [Arguments(1_000_000)]
    public void StackallocBenchmark(int arraySize)
    {
        DoWork(arraySize);
    }
}

Mono-LLVM (emits @llvm.memset intrinsic to zero data)

|              Method | arraySize |          Mean |       Error |      StdDev |
|-------------------- |---------- |--------------:|------------:|------------:|
| StackallocBenchmark |         2 |      4.695 ns |   0.0063 ns |   0.0049 ns |
| StackallocBenchmark |        10 |      4.695 ns |   0.0068 ns |   0.0063 ns |
| StackallocBenchmark |        63 |      6.124 ns |   0.0113 ns |   0.0106 ns |
| StackallocBenchmark |       100 |      7.441 ns |   0.0118 ns |   0.0092 ns |
| StackallocBenchmark |       256 |      8.631 ns |   0.0118 ns |   0.0111 ns |
| StackallocBenchmark |      1000 |     18.181 ns |   0.0255 ns |   0.0226 ns |
| StackallocBenchmark |     10000 |    168.087 ns |   0.1464 ns |   0.1297 ns |
| StackallocBenchmark |   1000000 | 27,269.711 ns | 137.7818 ns | 128.8811 ns |

RuyJIT:

|              Method | arraySize |          Mean |      Error |     StdDev |
|-------------------- |---------- |--------------:|-----------:|-----------:|
| StackallocBenchmark |         2 |      3.599 ns |  0.0027 ns |  0.0024 ns |
| StackallocBenchmark |        10 |      3.599 ns |  0.0021 ns |  0.0019 ns |
| StackallocBenchmark |        63 |      5.151 ns |  0.0019 ns |  0.0018 ns |
| StackallocBenchmark |       100 |      6.696 ns |  0.0039 ns |  0.0030 ns |
| StackallocBenchmark |       256 |     11.828 ns |  0.0075 ns |  0.0071 ns |
| StackallocBenchmark |      1000 |     35.583 ns |  0.1077 ns |  0.0955 ns |
| StackallocBenchmark |     10000 |    329.660 ns |  0.4879 ns |  0.4564 ns |
| StackallocBenchmark |   1000000 | 40,438.375 ns | 15.8948 ns | 14.0903 ns |

So it seems @llvm.memset does better job for arrays > 100 items (twice faster)

Ubuntu 18, Core i7 4930K (Ivy Bridge)

cc @jkotas @erozenfeld

danmoseley · 2019-09-16T20:43:35Z

Given recent memset regression we fixed that was AMD specific (dotnet/coreclr#25763) is it worth a test on AMD as well?
cc @tannergooding .

EgorBo · 2019-09-16T20:56:42Z

heh, I wish I had an AMD cpu to test 🙂

GrabYourPitchforks · 2020-01-24T21:02:44Z

Is there a plan to address this for .NET Core 5? We're still working around this in a few places in the code - some of which involves us duplicating entire methods but using slightly different signatures - and it would be great for us to be able to remove our workarounds.

tannergooding · 2020-01-24T21:54:39Z

We're still working around this in a few places in the code

I thought we were stripping the .locals init bit for the runtime/framework assemblies?

GrabYourPitchforks · 2020-01-24T21:59:56Z

We do, but .locals init doesn't do much when your locals are Span<T> and other reference-containing types.

erozenfeld · 2020-01-24T22:21:10Z

We're still working around this in a few places in the code

Can you list those places in this issue? That will help validating heuristic changes when we get to this.

GrabYourPitchforks · 2020-01-24T22:56:15Z

@erozenfeld sure - check the search at https://github.com/dotnet/runtime/search?q=8890

benaadams · 2020-02-16T19:48:16Z

Not sure if I'm reading this right, but assuming the stack is not 32 byte aligned (when its a rThroughput of 1 clock per 32Bytes); then rep stosd can have a rThroughput of 8 clock cycles per 16 bytes whereas MOVDQ has a rThroughput of 0.5 clocks per 16 bytes? (or 0.5 clocks per 32 bytes for ymm)

benaadams · 2020-02-17T05:36:56Z

Looks to always use rep stosd isn't based on a min-threshold?

runtime/src/coreclr/src/jit/codegencommon.cpp

Lines 6349 to 6354 in 17c6c26

    
           GetEmitter()->emitIns_R_AR(INS_lea, EA_PTRSIZE, REG_EDI, genFramePointerReg(), untrLclLo); 
        
           regSet.verifyRegUsed(REG_EDI); 
        
           inst_RV_IV(INS_mov, REG_ECX, (untrLclHi - untrLclLo) / sizeof(int), EA_4BYTE); 
        
           instGen_Set_Reg_To_Zero(EA_PTRSIZE, REG_EAX); 
        
           instGen(INS_r_stosd);

benaadams · 2020-02-17T05:45:38Z

Ah my mistake its

runtime/src/coreclr/src/jit/codegencommon.cpp

Line 4744 in 17c6c26

genUseBlockInit = (genInitStkLclCnt > (largeGcStructs + 8));

Which given

runtime/src/coreclr/src/jit/codegencommon.cpp

Lines 4544 to 4545 in 17c6c26

    
           unsigned largeGcStructs = 0; // The number of "large" structs with GC pointers. Used as part of the heuristic to 
        
                                        // determine whether to use block init.

Is when its > 8 bytes?

benaadams · 2020-02-17T15:32:31Z

Had a go #32442 for x64

erozenfeld · 2020-04-22T01:33:52Z

Part of the code that is deciding whether to use block initialization is unreachable. I put it under #if 0 in case we decide to revive that logic. I hope to get that cleaned up soon.

runtime/src/coreclr/src/jit/codegencommon.cpp

Lines 4726 to 4763 in c6f540b

    
           // TODO-Review: The code below is currently unreachable. We are guaranteed to execute one of the 
        
           // 'continue' statements above. 
        
           #if 0 
        
                   /* If we don't know lifetimes of variables, must be conservative */ 
        
                   if (!compiler->backendRequiresLocalVarLifetimes()) 
        
                   { 
        
                       varDsc->lvMustInit = true; 
        
                       noway_assert(!varDsc->lvRegister); 
        
                   } 
        
                   else 
        
                   { 
        
                       if (!varDsc->lvTracked) 
        
                       { 
        
                           varDsc->lvMustInit = true; 
        
                       } 
        
                   } 
        
                   /* Is this a 'must-init' stack pointer local? */ 
        
                   if (varDsc->lvMustInit && varDsc->lvOnFrame && !counted) 
        
                   { 
        
                       if (varDsc->TypeGet() == TYP_STRUCT) 
        
                       { 
        
                           initStkLclCnt += varDsc->GetLayout()->GetGCPtrCount(); 
        
                       } 
        
                       else 
        
                       { 
        
                           assert(varTypeIsGC(varDsc->TypeGet())); 
        
                           initStkLclCnt += 1; 
        
                       } 
        
                       counted = true; 
        
                   } 
        
                   if ((compiler->lvaLclSize(varNum) > (3 * TARGET_POINTER_SIZE)) && (largeGcStructs <= 4)) 
        
                   { 
        
                       largeGcStructs++; 
        
                   } 
        
           #endif

erozenfeld self-assigned this Sep 6, 2017

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

erozenfeld mentioned this issue Feb 11, 2020

Eliminate duplicate zero initializations more aggressively. #31960

Merged

GrabYourPitchforks mentioned this issue Feb 16, 2020

Prep CompareInfo for spanification #32385

Merged

AndyAyersMS mentioned this issue Feb 16, 2020

Strange Span costs for as Memory.Span -> parameter #32396

Closed

benaadams mentioned this issue Feb 17, 2020

Use simd for small prolog zeroing (ia32/x64) #32442

Closed

benaadams mentioned this issue Feb 19, 2020

Use xmm for stack prolog zeroing rather than rep stos #32538

Merged

GrabYourPitchforks added a commit to GrabYourPitchforks/runtime that referenced this issue Apr 11, 2020

Remove workaround for dotnet#8890

ed180c0

GrabYourPitchforks mentioned this issue Apr 11, 2020

Fix weightless code point handling in IndexOf and friends, plus other CompareInfo improvements #1514

Merged

erozenfeld removed their assignment Oct 12, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

EgorBo mentioned this issue Jan 7, 2022

JIT should vectorize zeroing constant stackalloc-s #63500

Closed

danmoseley mentioned this issue Jan 31, 2022

Remove some StringBuilderCache usage from Corelib #64522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RyuJIT] Improve heuristic for zero-initialization of locals #8890

[RyuJIT] Improve heuristic for zero-initialization of locals #8890

erozenfeld commented Sep 6, 2017

redknightlois commented Sep 7, 2017

erozenfeld commented Sep 7, 2017

AndyAyersMS commented Oct 13, 2017

AndyAyersMS commented Oct 13, 2017

EgorBo commented Sep 16, 2019 •

edited

Loading

danmoseley commented Sep 16, 2019

EgorBo commented Sep 16, 2019

GrabYourPitchforks commented Jan 24, 2020

tannergooding commented Jan 24, 2020

GrabYourPitchforks commented Jan 24, 2020

erozenfeld commented Jan 24, 2020

GrabYourPitchforks commented Jan 24, 2020 •

edited

Loading

benaadams commented Feb 16, 2020 •

edited

Loading

benaadams commented Feb 17, 2020

benaadams commented Feb 17, 2020

benaadams commented Feb 17, 2020 •

edited

Loading

erozenfeld commented Apr 22, 2020

[RyuJIT] Improve heuristic for zero-initialization of locals #8890

[RyuJIT] Improve heuristic for zero-initialization of locals #8890

Comments

erozenfeld commented Sep 6, 2017

redknightlois commented Sep 7, 2017

erozenfeld commented Sep 7, 2017

AndyAyersMS commented Oct 13, 2017

AndyAyersMS commented Oct 13, 2017

EgorBo commented Sep 16, 2019 • edited Loading

danmoseley commented Sep 16, 2019

EgorBo commented Sep 16, 2019

GrabYourPitchforks commented Jan 24, 2020

tannergooding commented Jan 24, 2020

GrabYourPitchforks commented Jan 24, 2020

erozenfeld commented Jan 24, 2020

GrabYourPitchforks commented Jan 24, 2020 • edited Loading

benaadams commented Feb 16, 2020 • edited Loading

benaadams commented Feb 17, 2020

benaadams commented Feb 17, 2020

benaadams commented Feb 17, 2020 • edited Loading

erozenfeld commented Apr 22, 2020

EgorBo commented Sep 16, 2019 •

edited

Loading

GrabYourPitchforks commented Jan 24, 2020 •

edited

Loading

benaadams commented Feb 16, 2020 •

edited

Loading

benaadams commented Feb 17, 2020 •

edited

Loading