Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the costing of GT_CNS_DBL and GT_CNS_VEC instructions #70215

Merged
merged 2 commits into from
Jun 6, 2022

Conversation

tannergooding
Copy link
Member

This fixes the costing of GT_CNS_DBL and GT_CNS_VEC to be inline with the costing of GT_IND and to ensure that the cost of Zero and AllBitsSet have a more correct size estimate.

For GT_CNS_VEC, Zero is considered cheap and so we end up being able to see "we have a zero" in more places, which means we can rely on values being zero for the many optimizations around it.

-       vxorps   ymm2, ymm2, ymm2
-       align    [9 bytes for IG09]
-                        ;; size=55 bbWeight=0.50 PerfScore 5.04
+       align    [2 bytes for IG09]
+                        ;; size=43 bbWeight=0.50 PerfScore 4.88
 G_M3377_IG09:
        vpcmpeqw ymm1, ymm0, ymmword ptr[rcx+2*r9]
-       vpcmpeqw ymm1, ymm1, ymm2
-       vpmovmskb eax, ymm1
-       cmp      eax, -1
+       vptest   ymm1, ymm1

However, this also looks to "regress" some places as we aren't CSE'ing the value anymore due to the low cost. This ideally would be handled by the register allocator seeing "we need zero, and we already have zero in xmm1", which is related to #70182:

+       vxorps   xmm1, xmm1, xmm1
        vpunpcklbw xmm0, xmm0, xmm1

There is a case in MemoryExtensions:SequenceEqual and MemoryExtensions:CommonPrefixLength where a general-purpose register selection changed. However there are no new CSEs, changes to the locals or costs, or even frame size changes. So this seems suspect and potentially like something that should be investigated.

There are cases in System.Linq.Parallel, such as FirstQueryOperatorEnumerator_1:MoveNext, where the register allocator is deciding to use caller saved registers (namely ymm6/ymm7 rather than the previous ymm0/ymm1 it had chosen. Part of this seems related to #70182, but it also seems odd that it chose to spill because Zero is cheap to compute, it looks unused in at least one path, and it should likely be preferred to load constants from their emitted local slot rather than spilling.

@ghost ghost assigned tannergooding Jun 3, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 3, 2022
@ghost
Copy link

ghost commented Jun 3, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This fixes the costing of GT_CNS_DBL and GT_CNS_VEC to be inline with the costing of GT_IND and to ensure that the cost of Zero and AllBitsSet have a more correct size estimate.

For GT_CNS_VEC, Zero is considered cheap and so we end up being able to see "we have a zero" in more places, which means we can rely on values being zero for the many optimizations around it.

-       vxorps   ymm2, ymm2, ymm2
-       align    [9 bytes for IG09]
-                        ;; size=55 bbWeight=0.50 PerfScore 5.04
+       align    [2 bytes for IG09]
+                        ;; size=43 bbWeight=0.50 PerfScore 4.88
 G_M3377_IG09:
        vpcmpeqw ymm1, ymm0, ymmword ptr[rcx+2*r9]
-       vpcmpeqw ymm1, ymm1, ymm2
-       vpmovmskb eax, ymm1
-       cmp      eax, -1
+       vptest   ymm1, ymm1

However, this also looks to "regress" some places as we aren't CSE'ing the value anymore due to the low cost. This ideally would be handled by the register allocator seeing "we need zero, and we already have zero in xmm1", which is related to #70182:

+       vxorps   xmm1, xmm1, xmm1
        vpunpcklbw xmm0, xmm0, xmm1

There is a case in MemoryExtensions:SequenceEqual and MemoryExtensions:CommonPrefixLength where a general-purpose register selection changed. However there are no new CSEs, changes to the locals or costs, or even frame size changes. So this seems suspect and potentially like something that should be investigated.

There are cases in System.Linq.Parallel, such as FirstQueryOperatorEnumerator_1:MoveNext, where the register allocator is deciding to use caller saved registers (namely ymm6/ymm7 rather than the previous ymm0/ymm1 it had chosen. Part of this seems related to #70182, but it also seems odd that it chose to spill because Zero is cheap to compute, it looks unused in at least one path, and it should likely be preferred to load constants from their emitted local slot rather than spilling.

Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr

Milestone: -

@tannergooding
Copy link
Member Author

CC. @kunalspathak for the register allocator issues I'm seeing.

@tannergooding tannergooding marked this pull request as ready for review June 3, 2022 23:24
@tannergooding
Copy link
Member Author

tannergooding commented Jun 3, 2022

Overall its still an improvement both to throughput (-0.01% across the board, except for 32-bit x86) and codegen size: https://dev.azure.com/dnceng/public/_build/results?buildId=1805530&view=ms.vss-build-web.run-extensions-tab, so it may still be worth taking as is and following up after with the LSRA issues I flagged above.

@kunalspathak kunalspathak self-requested a review June 6, 2022 13:09
@kunalspathak
Copy link
Member

Overall looks good. Few things:

There are cases in System.Linq.Parallel, such as FirstQueryOperatorEnumerator_1:MoveNext

Can you share the dumps (before/after of those), I can take a look when I investigate #70182.

except for 32-bit x86

Do you know why?

@tannergooding
Copy link
Member Author

Can you share the dumps (before/after of those), I can take a look when I investigate:

Here you go:
JitDump.zip

The biggest difference is Zero going from N001 ( 3, 4) to N001 ( 1, 2). This causes Zero to not be CSE'd (which is also desirable, because its a special constant and is often contained/containable in lowering).

Do you know why

No. There are some cases we CSE more and some we CSE less (namely Zero and AllBitsSet). This looks to negatively impact x86 for an unknown reason.

@tannergooding
Copy link
Member Author

@kunalspathak any concerns with this one being merged?

The overall diffs are an improvement and we're seeing positive impact on known hot methods like System.SpanHelpers:Contains and System.SpanHelpers:IndexOfValueType (on both x64 and Arm64). The regressions on the other hand are cases like Perf_Matrix3x2:CreateScaleFromScalarWithCenterBenchmark and System.Collections.BitArray:Not, being pretty minor overall (mostly cases where we introduce a new xorps instruction, which is related to #70182).

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TIHan TIHan merged commit 2d5c8dc into dotnet:main Jun 6, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 7, 2022
@tannergooding tannergooding deleted the vector-cns branch November 11, 2022 15:51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants