-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Improve struct promotion for 256-bit SIMD fields #19663
Conversation
Do you have some assembly diffs you can share? |
src/jit/lclvars.cpp
Outdated
const int MaxOffset = MAX_NumOfFieldsInPromotableStruct * XMM_REGSIZE_BYTES; | ||
// This will allow promotion of 4 Vector<T> fields on AVX2 or Vector256<T> on AVX, | ||
// or 8 Vector<T>/Vector128<T> fields on SSE2. | ||
const int MaxOffset = MAX_NumOfFieldsInPromotableStruct * YMM_REGSIZE_BYTES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this impact machines without AVX support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not detect any impact from the Vector3 benchmark.
I have run jit-diff on this change, and it shows no any difference in corelib/tests/frameworks. Although jit-diff uses crossgen that does not work with SIMD code, we can say this change has no impact on the current scalar code base. I also measured RayTracer (Vector3 benchmark), which has no execution time regression. |
Have you also tried to get the pmi diffs? CC. @AndyAyersMS |
Will try later, but there seems no managed code with more than 4 SIMD16 or 2 SIMD32 struct fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@dotnet/jit-contrib - I'd like to have another JIT dev weigh in on this. |
It might be useful/interesting to create a simple 5x4 matrix struct and see what the codegen diff looks like. Just because CoreFX doesn't have any code that leverages it, doesn't mean other libraries don't (and we don't want to accidentally regress them). |
@dotnet-bot test Ubuntu arm Cross Checked Innerloop Build and Test |
Because we generally only promote structs with primitive typed fields it's hard to get suitably large field offsets for structs with small numbers of fields. Aside from SIMD it would require a fixed field or an explicit layout. And I would guess we don't have very many of these cases floating around in the framework code (otherwise we might have spotted #19149 sooner). So you should try PMI across the test suite, but even there jit-diffs won't look as broadly as one might hope. We could also try an SPMI run on desktop I suppose. |
@CarolEidt @AndyAyersMS @tannergooding I have run pmi diff, no difference
|
@fiigii, was this just for CoreCLR or did you also try the PMI diffs for the tests, CoreFX, and various benchmarks we have? |
Yes, I ran pmi diff on corelib/tests/frameworks/benchmarks (no diff from all of them). How to run jit-diff on CoreFX? |
@fiigii can you post the last line of the analysis, showing how many methods were examined, eg something like
because from the above it looks like things ran too fast and maybe didn't look at any methods at all. |
Or maybe you already did? And the number is zero? It should be ~380K. To run PMI in its most general mode, make sure you've built the tests, and then do something like this (note the
The summary should start with something like:
|
@AndyAyersMS thanks for the guides, will re-run to make sure. |
@AndyAyersMS @CarolEidt @tannergooding I re-ran the PMI diff, it showed some difference (improvement). Corssgen diff still has no any diff. The new PMI diff result should be correct. Corelib (no diff):
Tests (improvement):
Frameworks (improvement only)
|
The small regressions in the above test pmi diff are mainly from expanding -mov rbp, rax
-lea rcx, bword ptr [rbp+8]
-lea rdx, bword ptr [rsp+C0H]
-lea rdx, bword ptr [rsp+C0H]
-mov r8d, 128
-call CORINFO_HELP_MEMCPY
+lea r8, bword ptr [rax+8]
+vmovupd ymm0, ymmword ptr[rsp+C0H]
+vmovupd ymmword ptr[r8], ymm0
+vmovupd ymm0, ymmword ptr[rsp+E0H]
+vmovupd ymmword ptr[r8+32], ymm0
+vmovupd ymm0, ymmword ptr[rsp+100H]
+vmovupd ymmword ptr[r8+64], ymm0
+vmovupd ymm0, ymmword ptr[rsp+120H]
+vmovupd ymmword ptr[r8+96], ymm0 |
BTW, PacketTracer benchmark #19662 gets 16.26% code size shrink.
|
I got the following for CoreCLR:
|
I got the following for Framework:
|
I got the following for Tests:
|
I got the following for Benchmarks:
|
The files that had diffs: |
@AndyAyersMS @CarolEidt Does the data look good to you? |
The results look good, and as expected. x86 diffs might be nice, but I don't think they're necessary. |
No, no concerns. |
No, just wanted to make sure we had any regressions covered. |
I'm working on getting x86 diffs as well, and so far they look much the same as the x64 diffs. |
@CarolEidt, @AndyAyers, @fiigii. Should we get diffs again with TieredJitting disabled? (I just spent way too long debugging another issue, only to find out it wasnt working because TieredJitting disabled that optimization). |
PMI (when run via jit-dasm-pmi, which in turn is run via jit-diff) disables tiered jitting already. If you run PMI directly via corerun then you might need to set some env vars first. |
Good to know. (I was using |
@CarolEidt, @AndyAyersMS. I'm merging this, since we've all signed off already. |
This PR improves struct promotion to unwrap more 256-bit SIMD fields, which makes PacketTracer benchmark 31% faster with #19662
Performance data (rendering a 2k image)
The data collected on
VTune characterization (module level)
Windows
Linux
The most obvious module-level change is the runtime (GC) overhead gets reduced (~33% -> ~11%) and managed code also gets better path-length (code size).
VTune characterization (managed code)
Windows
Linux
Overall, managed code gets improvement by the better code size, but there still are some inefficient codgen that I will continue to investigate and open other issues to discuss (mainly related to https://github.com/dotnet/coreclr/issues/16619)
VTune characterization (CoreCLR runtime)
Windows
Linux