-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RyuJIT] Improve heuristic for zero-initialization of locals #8890
Comments
There is other case which I have hit. When you stackalloc a sizeable byte array the |
stackalloc issue is tracked by https://github.com/dotnet/coreclr/issues/1279. For stackalloc, the jit has to do zero-initialization if initlocals is set. Roslyn is considering making changes in this area. We also plan to add an option to ILLink to clear initlocals. |
Also note |
On SysV both |
A small benchmark: public class StackAlloc_VarSize
{
[MethodImpl(MethodImplOptions.NoInlining)]
static unsafe void Process(byte* data) {} // fake data processor
[MethodImpl(MethodImplOptions.NoInlining)]
public static unsafe void DoWork(int size)
{
int local = size + 1; // triggers "InitLocals"
byte* data = stackalloc byte[local];
Process(data);
}
[Benchmark]
[Arguments(2)]
[Arguments(10)]
[Arguments(63)]
[Arguments(100)]
[Arguments(256)]
[Arguments(1000)]
[Arguments(10000)]
[Arguments(1_000_000)]
public void StackallocBenchmark(int arraySize)
{
DoWork(arraySize);
}
} Mono-LLVM (emits @llvm.memset intrinsic to zero data)
RuyJIT:
So it seems @llvm.memset does better job for arrays > 100 items (twice faster) Ubuntu 18, Core i7 4930K (Ivy Bridge) |
Given recent memset regression we fixed that was AMD specific (dotnet/coreclr#25763) is it worth a test on AMD as well? |
heh, I wish I had an AMD cpu to test 🙂 |
Is there a plan to address this for .NET Core 5? We're still working around this in a few places in the code - some of which involves us duplicating entire methods but using slightly different signatures - and it would be great for us to be able to remove our workarounds. |
I thought we were stripping the |
We do, but |
Can you list those places in this issue? That will help validating heuristic changes when we get to this. |
@erozenfeld sure - check the search at https://github.com/dotnet/runtime/search?q=8890 |
Not sure if I'm reading this right, but assuming the stack is not 32 byte aligned (when its a rThroughput of 1 clock per 32Bytes); then |
Looks to always use runtime/src/coreclr/src/jit/codegencommon.cpp Lines 6349 to 6354 in 17c6c26
|
Ah my mistake its runtime/src/coreclr/src/jit/codegencommon.cpp Line 4744 in 17c6c26
Which given runtime/src/coreclr/src/jit/codegencommon.cpp Lines 4544 to 4545 in 17c6c26
Is when its > 8 bytes? |
Had a go #32442 for x64 |
Part of the code that is deciding whether to use block initialization is unreachable. I put it under #if 0 in case we decide to revive that logic. I hope to get that cleaned up soon. runtime/src/coreclr/src/jit/codegencommon.cpp Lines 4726 to 4763 in c6f540b
|
The heuristic the jit is using for deciding how to zero-initialize locals is very simplistic. In many cases faster sequences can be used.
Here is one example. An attempt was made to switch String.Split to use Spans: stephentoub/coreclr@500978f to avoid int[] allocations. This resulted in several more temp structs being allocated and zero-initialized, which made the performance of this benchmark ~12% worse than the non-Span version:
The current heuristic will use rep stosd in the prolog if the jit needs to initialize 16 bytes of locals (actually, the heuristic is slightly different if there are any structs larger than 24 bytes that need to be initialized but it’s not relevant for this benchmark). As an experiment I changed the heuristic so that rep stosd isn’t used for this benchmark but mov instructions are used instead. With that change we get all of the perf back compared to the array version.
Here are the two initialization sequences:
While the second sequence is faster than the first one, we can probably do even better with xmm registers.
The jit normally favors size over speed so the block init sequence may be preferred in many cases but we should at least use IBC data when available to drive this heuristic.
category:cq
theme:zero-init
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: