Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RyuJIT] Improve heuristic for zero-initialization of locals #8890

Open
erozenfeld opened this issue Sep 6, 2017 · 17 comments
Open

[RyuJIT] Improve heuristic for zero-initialization of locals #8890

erozenfeld opened this issue Sep 6, 2017 · 17 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Milestone

Comments

@erozenfeld
Copy link
Member

The heuristic the jit is using for deciding how to zero-initialize locals is very simplistic. In many cases faster sequences can be used.

Here is one example. An attempt was made to switch String.Split to use Spans: stephentoub/coreclr@500978f to avoid int[] allocations. This resulted in several more temp structs being allocated and zero-initialized, which made the performance of this benchmark ~12% worse than the non-Span version:

    public static void Main()
    {
        DateTime start = DateTime.Now;

        for (int i = 0; i < 30000000; ++i)
        {
            "abc,def,ghi".Split(',');
        }

        Console.WriteLine((DateTime.Now - start).TotalMilliseconds);
    }

The current heuristic will use rep stosd in the prolog if the jit needs to initialize 16 bytes of locals (actually, the heuristic is slightly different if there are any structs larger than 24 bytes that need to be initialized but it’s not relevant for this benchmark). As an experiment I changed the heuristic so that rep stosd isn’t used for this benchmark but mov instructions are used instead. With that change we get all of the perf back compared to the array version.

Here are the two initialization sequences:

       488BF1               mov      rsi, rcx
       488D7D08             lea      rdi, [rbp+08H]
       B912000000           mov      ecx, 18
       33C0                 xor      rax, rax
       F3AB                 rep stosd 
       488BCE               mov      rcx, rsi
       33C0                 xor      rax, rax
       48894538             mov      qword ptr [rbp+38H], rax
       48894540             mov      qword ptr [rbp+40H], rax
       48894530             mov      qword ptr [rbp+30H], rax
       48894528             mov      qword ptr [rbp+28H], rax
       48894518             mov      qword ptr [rbp+18H], rax
       48894520             mov      qword ptr [rbp+20H], rax
       48894508             mov      qword ptr [rbp+08H], rax
       48894548             mov      qword ptr [rbp+48H], rax

While the second sequence is faster than the first one, we can probably do even better with xmm registers.

The jit normally favors size over speed so the block init sequence may be preferred in many cases but we should at least use IBC data when available to drive this heuristic.

category:cq
theme:zero-init
skill-level:expert
cost:medium

@erozenfeld erozenfeld self-assigned this Sep 6, 2017
@redknightlois
Copy link

There is other case which I have hit. When you stackalloc a sizeable byte array the rep stosd performance hit could be tremendous when you actually dont need the byte array to be zeroed at all. Would open a whole array of potential optimizations if you are able to hint the JIT about it.

@erozenfeld
Copy link
Member Author

stackalloc issue is tracked by https://github.com/dotnet/coreclr/issues/1279. For stackalloc, the jit has to do zero-initialization if initlocals is set. Roslyn is considering making changes in this area. We also plan to add an option to ILLink to clear initlocals.

@AndyAyersMS
Copy link
Member

Also note stos kills rcx which is almost always live and so we are forced to shuffle it out and back. Since we then generally copy rcx into a caller-saved register we often see 2 redundant moves after the stos.

@AndyAyersMS
Copy link
Member

On SysV both rdi and rcx are incoming arg regs, so things are even worse there.

@EgorBo
Copy link
Member

EgorBo commented Sep 16, 2019

A small benchmark:

public class StackAlloc_VarSize
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static unsafe void Process(byte* data) {} // fake data processor

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static unsafe void DoWork(int size)
    {
        int local = size + 1; // triggers "InitLocals"
        byte* data = stackalloc byte[local];
        Process(data);
    }

    [Benchmark]
    [Arguments(2)]
    [Arguments(10)]
    [Arguments(63)]
    [Arguments(100)]
    [Arguments(256)]
    [Arguments(1000)]
    [Arguments(10000)]
    [Arguments(1_000_000)]
    public void StackallocBenchmark(int arraySize)
    {
        DoWork(arraySize);
    }
}

Mono-LLVM (emits @llvm.memset intrinsic to zero data)

|              Method | arraySize |          Mean |       Error |      StdDev |
|-------------------- |---------- |--------------:|------------:|------------:|
| StackallocBenchmark |         2 |      4.695 ns |   0.0063 ns |   0.0049 ns |
| StackallocBenchmark |        10 |      4.695 ns |   0.0068 ns |   0.0063 ns |
| StackallocBenchmark |        63 |      6.124 ns |   0.0113 ns |   0.0106 ns |
| StackallocBenchmark |       100 |      7.441 ns |   0.0118 ns |   0.0092 ns |
| StackallocBenchmark |       256 |      8.631 ns |   0.0118 ns |   0.0111 ns |
| StackallocBenchmark |      1000 |     18.181 ns |   0.0255 ns |   0.0226 ns |
| StackallocBenchmark |     10000 |    168.087 ns |   0.1464 ns |   0.1297 ns |
| StackallocBenchmark |   1000000 | 27,269.711 ns | 137.7818 ns | 128.8811 ns |

RuyJIT:

|              Method | arraySize |          Mean |      Error |     StdDev |
|-------------------- |---------- |--------------:|-----------:|-----------:|
| StackallocBenchmark |         2 |      3.599 ns |  0.0027 ns |  0.0024 ns |
| StackallocBenchmark |        10 |      3.599 ns |  0.0021 ns |  0.0019 ns |
| StackallocBenchmark |        63 |      5.151 ns |  0.0019 ns |  0.0018 ns |
| StackallocBenchmark |       100 |      6.696 ns |  0.0039 ns |  0.0030 ns |
| StackallocBenchmark |       256 |     11.828 ns |  0.0075 ns |  0.0071 ns |
| StackallocBenchmark |      1000 |     35.583 ns |  0.1077 ns |  0.0955 ns |
| StackallocBenchmark |     10000 |    329.660 ns |  0.4879 ns |  0.4564 ns |
| StackallocBenchmark |   1000000 | 40,438.375 ns | 15.8948 ns | 14.0903 ns |

So it seems @llvm.memset does better job for arrays > 100 items (twice faster)

Ubuntu 18, Core i7 4930K (Ivy Bridge)

cc @jkotas @erozenfeld

@danmoseley
Copy link
Member

Given recent memset regression we fixed that was AMD specific (dotnet/coreclr#25763) is it worth a test on AMD as well?
cc @tannergooding .

@EgorBo
Copy link
Member

EgorBo commented Sep 16, 2019

heh, I wish I had an AMD cpu to test 🙂

@GrabYourPitchforks
Copy link
Member

Is there a plan to address this for .NET Core 5? We're still working around this in a few places in the code - some of which involves us duplicating entire methods but using slightly different signatures - and it would be great for us to be able to remove our workarounds.

@tannergooding
Copy link
Member

We're still working around this in a few places in the code

I thought we were stripping the .locals init bit for the runtime/framework assemblies?

@GrabYourPitchforks
Copy link
Member

We do, but .locals init doesn't do much when your locals are Span<T> and other reference-containing types.

@erozenfeld
Copy link
Member Author

We're still working around this in a few places in the code

Can you list those places in this issue? That will help validating heuristic changes when we get to this.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Jan 24, 2020

@erozenfeld sure - check the search at https://github.com/dotnet/runtime/search?q=8890

@benaadams
Copy link
Member

benaadams commented Feb 16, 2020

Not sure if I'm reading this right, but assuming the stack is not 32 byte aligned (when its a rThroughput of 1 clock per 32Bytes); then rep stosd can have a rThroughput of 8 clock cycles per 16 bytes whereas MOVDQ has a rThroughput of 0.5 clocks per 16 bytes? (or 0.5 clocks per 32 bytes for ymm)

@benaadams
Copy link
Member

Looks to always use rep stosd isn't based on a min-threshold?

GetEmitter()->emitIns_R_AR(INS_lea, EA_PTRSIZE, REG_EDI, genFramePointerReg(), untrLclLo);
regSet.verifyRegUsed(REG_EDI);
inst_RV_IV(INS_mov, REG_ECX, (untrLclHi - untrLclLo) / sizeof(int), EA_4BYTE);
instGen_Set_Reg_To_Zero(EA_PTRSIZE, REG_EAX);
instGen(INS_r_stosd);

@benaadams
Copy link
Member

Ah my mistake its

genUseBlockInit = (genInitStkLclCnt > (largeGcStructs + 8));

Which given

unsigned largeGcStructs = 0; // The number of "large" structs with GC pointers. Used as part of the heuristic to
// determine whether to use block init.

Is when its > 8 bytes?

@benaadams
Copy link
Member

benaadams commented Feb 17, 2020

Had a go #32442 for x64

@erozenfeld
Copy link
Member Author

Part of the code that is deciding whether to use block initialization is unreachable. I put it under #if 0 in case we decide to revive that logic. I hope to get that cleaned up soon.

// TODO-Review: The code below is currently unreachable. We are guaranteed to execute one of the
// 'continue' statements above.
#if 0
/* If we don't know lifetimes of variables, must be conservative */
if (!compiler->backendRequiresLocalVarLifetimes())
{
varDsc->lvMustInit = true;
noway_assert(!varDsc->lvRegister);
}
else
{
if (!varDsc->lvTracked)
{
varDsc->lvMustInit = true;
}
}
/* Is this a 'must-init' stack pointer local? */
if (varDsc->lvMustInit && varDsc->lvOnFrame && !counted)
{
if (varDsc->TypeGet() == TYP_STRUCT)
{
initStkLclCnt += varDsc->GetLayout()->GetGCPtrCount();
}
else
{
assert(varTypeIsGC(varDsc->TypeGet()));
initStkLclCnt += 1;
}
counted = true;
}
if ((compiler->lvaLclSize(varNum) > (3 * TARGET_POINTER_SIZE)) && (largeGcStructs <= 4))
{
largeGcStructs++;
}
#endif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

10 participants