-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing the costing of GT_CNS_DBL and GT_CNS_VEC instructions #70215
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis fixes the costing of For - vxorps ymm2, ymm2, ymm2
- align [9 bytes for IG09]
- ;; size=55 bbWeight=0.50 PerfScore 5.04
+ align [2 bytes for IG09]
+ ;; size=43 bbWeight=0.50 PerfScore 4.88
G_M3377_IG09:
vpcmpeqw ymm1, ymm0, ymmword ptr[rcx+2*r9]
- vpcmpeqw ymm1, ymm1, ymm2
- vpmovmskb eax, ymm1
- cmp eax, -1
+ vptest ymm1, ymm1 However, this also looks to "regress" some places as we aren't CSE'ing the value anymore due to the low cost. This ideally would be handled by the register allocator seeing "we need zero, and we already have zero in + vxorps xmm1, xmm1, xmm1
vpunpcklbw xmm0, xmm0, xmm1 There is a case in There are cases in
|
CC. @kunalspathak for the register allocator issues I'm seeing. |
Overall its still an improvement both to throughput (-0.01% across the board, except for 32-bit x86) and codegen size: https://dev.azure.com/dnceng/public/_build/results?buildId=1805530&view=ms.vss-build-web.run-extensions-tab, so it may still be worth taking as is and following up after with the LSRA issues I flagged above. |
Overall looks good. Few things:
Can you share the dumps (before/after of those), I can take a look when I investigate #70182.
Do you know why? |
Here you go: The biggest difference is
No. There are some cases we CSE more and some we CSE less (namely |
@kunalspathak any concerns with this one being merged? The overall diffs are an improvement and we're seeing positive impact on known hot methods like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This fixes the costing of
GT_CNS_DBL
andGT_CNS_VEC
to be inline with the costing ofGT_IND
and to ensure that the cost ofZero
andAllBitsSet
have a more correct size estimate.For
GT_CNS_VEC
,Zero
is considered cheap and so we end up being able to see "we have a zero" in more places, which means we can rely on values being zero for the many optimizations around it.However, this also looks to "regress" some places as we aren't CSE'ing the value anymore due to the low cost. This ideally would be handled by the register allocator seeing "we need zero, and we already have zero in
xmm1
", which is related to #70182:+ vxorps xmm1, xmm1, xmm1 vpunpcklbw xmm0, xmm0, xmm1
There is a case in
MemoryExtensions:SequenceEqual
andMemoryExtensions:CommonPrefixLength
where a general-purpose register selection changed. However there are no new CSEs, changes to the locals or costs, or even frame size changes. So this seems suspect and potentially like something that should be investigated.There are cases in
System.Linq.Parallel
, such asFirstQueryOperatorEnumerator_1:MoveNext
, where the register allocator is deciding to usecaller saved registers
(namelyymm6/ymm7
rather than the previousymm0
/ymm1
it had chosen. Part of this seems related to #70182, but it also seems odd that it chose to spill becauseZero
is cheap to compute, it looks unused in at least one path, and it should likely be preferred to load constants from their emitted local slot rather than spilling.