Fixing the costing of GT_CNS_DBL and GT_CNS_VEC instructions #70215

tannergooding · 2022-06-03T18:17:30Z

This fixes the costing of GT_CNS_DBL and GT_CNS_VEC to be inline with the costing of GT_IND and to ensure that the cost of Zero and AllBitsSet have a more correct size estimate.

For GT_CNS_VEC, Zero is considered cheap and so we end up being able to see "we have a zero" in more places, which means we can rely on values being zero for the many optimizations around it.

-       vxorps   ymm2, ymm2, ymm2
-       align    [9 bytes for IG09]
-                        ;; size=55 bbWeight=0.50 PerfScore 5.04
+       align    [2 bytes for IG09]
+                        ;; size=43 bbWeight=0.50 PerfScore 4.88
 G_M3377_IG09:
        vpcmpeqw ymm1, ymm0, ymmword ptr[rcx+2*r9]
-       vpcmpeqw ymm1, ymm1, ymm2
-       vpmovmskb eax, ymm1
-       cmp      eax, -1
+       vptest   ymm1, ymm1

However, this also looks to "regress" some places as we aren't CSE'ing the value anymore due to the low cost. This ideally would be handled by the register allocator seeing "we need zero, and we already have zero in xmm1", which is related to #70182:

+       vxorps   xmm1, xmm1, xmm1
        vpunpcklbw xmm0, xmm0, xmm1

There is a case in MemoryExtensions:SequenceEqual and MemoryExtensions:CommonPrefixLength where a general-purpose register selection changed. However there are no new CSEs, changes to the locals or costs, or even frame size changes. So this seems suspect and potentially like something that should be investigated.

There are cases in System.Linq.Parallel, such as FirstQueryOperatorEnumerator_1:MoveNext, where the register allocator is deciding to use caller saved registers (namely ymm6/ymm7 rather than the previous ymm0/ymm1 it had chosen. Part of this seems related to #70182, but it also seems odd that it chose to spill because Zero is cheap to compute, it looks unused in at least one path, and it should likely be preferred to load constants from their emitted local slot rather than spilling.

ghost · 2022-06-03T18:17:37Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This fixes the costing of GT_CNS_DBL and GT_CNS_VEC to be inline with the costing of GT_IND and to ensure that the cost of Zero and AllBitsSet have a more correct size estimate.

For GT_CNS_VEC, Zero is considered cheap and so we end up being able to see "we have a zero" in more places, which means we can rely on values being zero for the many optimizations around it.

-       vxorps   ymm2, ymm2, ymm2
-       align    [9 bytes for IG09]
-                        ;; size=55 bbWeight=0.50 PerfScore 5.04
+       align    [2 bytes for IG09]
+                        ;; size=43 bbWeight=0.50 PerfScore 4.88
 G_M3377_IG09:
        vpcmpeqw ymm1, ymm0, ymmword ptr[rcx+2*r9]
-       vpcmpeqw ymm1, ymm1, ymm2
-       vpmovmskb eax, ymm1
-       cmp      eax, -1
+       vptest   ymm1, ymm1

However, this also looks to "regress" some places as we aren't CSE'ing the value anymore due to the low cost. This ideally would be handled by the register allocator seeing "we need zero, and we already have zero in xmm1", which is related to #70182:

+       vxorps   xmm1, xmm1, xmm1
        vpunpcklbw xmm0, xmm0, xmm1

There is a case in MemoryExtensions:SequenceEqual and MemoryExtensions:CommonPrefixLength where a general-purpose register selection changed. However there are no new CSEs, changes to the locals or costs, or even frame size changes. So this seems suspect and potentially like something that should be investigated.

There are cases in System.Linq.Parallel, such as FirstQueryOperatorEnumerator_1:MoveNext, where the register allocator is deciding to use caller saved registers (namely ymm6/ymm7 rather than the previous ymm0/ymm1 it had chosen. Part of this seems related to #70182, but it also seems odd that it chose to spill because Zero is cheap to compute, it looks unused in at least one path, and it should likely be preferred to load constants from their emitted local slot rather than spilling.

Author:	tannergooding
Assignees:	tannergooding
Labels:	`area-CodeGen-coreclr`
Milestone:	-

tannergooding · 2022-06-03T18:18:50Z

CC. @kunalspathak for the register allocator issues I'm seeing.

tannergooding · 2022-06-03T23:25:44Z

Overall its still an improvement both to throughput (-0.01% across the board, except for 32-bit x86) and codegen size: https://dev.azure.com/dnceng/public/_build/results?buildId=1805530&view=ms.vss-build-web.run-extensions-tab, so it may still be worth taking as is and following up after with the LSRA issues I flagged above.

kunalspathak · 2022-06-06T13:45:52Z

Overall looks good. Few things:

There are cases in System.Linq.Parallel, such as FirstQueryOperatorEnumerator_1:MoveNext

Can you share the dumps (before/after of those), I can take a look when I investigate #70182.

except for 32-bit x86

Do you know why?

tannergooding · 2022-06-06T15:19:41Z

Can you share the dumps (before/after of those), I can take a look when I investigate:

Here you go:
JitDump.zip

The biggest difference is Zero going from N001 ( 3, 4) to N001 ( 1, 2). This causes Zero to not be CSE'd (which is also desirable, because its a special constant and is often contained/containable in lowering).

Do you know why

No. There are some cases we CSE more and some we CSE less (namely Zero and AllBitsSet). This looks to negatively impact x86 for an unknown reason.

tannergooding · 2022-06-06T16:43:48Z

@kunalspathak any concerns with this one being merged?

The overall diffs are an improvement and we're seeing positive impact on known hot methods like System.SpanHelpers:Contains and System.SpanHelpers:IndexOfValueType (on both x64 and Arm64). The regressions on the other hand are cases like Perf_Matrix3x2:CreateScaleFromScalarWithCenterBenchmark and System.Collections.BitArray:Not, being pretty minor overall (mostly cases where we introduce a new xorps instruction, which is related to #70182).

kunalspathak

LGTM

Fixing the costing of GT_CNS_DBL and GT_CNS_VEC instructions

59c6432

ghost assigned tannergooding Jun 3, 2022

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 3, 2022

Applying formatting patch

479c660

tannergooding marked this pull request as ready for review June 3, 2022 23:24

kunalspathak self-requested a review June 6, 2022 13:09

kunalspathak approved these changes Jun 6, 2022

View reviewed changes

TIHan merged commit 2d5c8dc into dotnet:main Jun 6, 2022

DrewScoggins mentioned this pull request Jun 9, 2022

Regressions in System.Numerics.Tests.Perf_Plane #70497

Closed

This was referenced Jun 10, 2022

[Perf] Changes at 6/7/2022 2:18:03 AM dotnet/perf-autofiling-issues#5947

Closed

[Perf] Changes at 6/7/2022 2:18:03 AM dotnet/perf-autofiling-issues#5954

Closed

ghost locked as resolved and limited conversation to collaborators Jul 7, 2022

tannergooding deleted the vector-cns branch November 11, 2022 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing the costing of GT_CNS_DBL and GT_CNS_VEC instructions #70215

Fixing the costing of GT_CNS_DBL and GT_CNS_VEC instructions #70215

tannergooding commented Jun 3, 2022

ghost commented Jun 3, 2022

tannergooding commented Jun 3, 2022

tannergooding commented Jun 3, 2022 •

edited

Loading

kunalspathak commented Jun 6, 2022

tannergooding commented Jun 6, 2022

tannergooding commented Jun 6, 2022

kunalspathak left a comment

Fixing the costing of GT_CNS_DBL and GT_CNS_VEC instructions #70215

Fixing the costing of GT_CNS_DBL and GT_CNS_VEC instructions #70215

Conversation

tannergooding commented Jun 3, 2022

ghost commented Jun 3, 2022

tannergooding commented Jun 3, 2022

tannergooding commented Jun 3, 2022 • edited Loading

kunalspathak commented Jun 6, 2022

tannergooding commented Jun 6, 2022

tannergooding commented Jun 6, 2022

kunalspathak left a comment

Choose a reason for hiding this comment

tannergooding commented Jun 3, 2022 •

edited

Loading