Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicate cmp instruction, avoid boolean zero extension #70003

Closed
thinkbeforecoding opened this issue May 31, 2022 · 12 comments · Fixed by #82750
Closed

Avoid duplicate cmp instruction, avoid boolean zero extension #70003

thinkbeforecoding opened this issue May 31, 2022 · 12 comments · Fixed by #82750
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@thinkbeforecoding
Copy link

thinkbeforecoding commented May 31, 2022

I'm working on comparison optimization for the F# compiler to implement branchless compare dotnet/fsharp#13098

The following implementation improves average performance, using only a cgt - clt:

image

image

However, I noticed that the same cmp ecx edx instruction is issued twice, and this is unnecessary as setg and movzx don't change flags.

I also thought that the following version would be even shorter:

image

image

but it's actually longer. The result from cgt/clt is zero extended with movzx even if we use only the lower byte part for the subtraction, then is sign extended to 32bits.

The result code could be:

image

If someone is ok to guide me through this kind of JIT optimization, I'd happily implement it.

category:cq
theme:codegen
skill-level:intermediate
cost:medium
impact:medium

@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 31, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@am11 am11 added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 31, 2022
@ghost
Copy link

ghost commented May 31, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

I'm working on comparison optimization for the F# compiler to implement branchless compare dotnet/fsharp#13098

The following implementation improves average performance, using only a cgt - clt:

image

image

However, I noticed that the same cmp ecx edx instruction is issued twice, and this is unnecessary as setg and movzx don't change flags.

I also thought that the following version would be even shorter:

image

image

but it's actually longer. The result from cgt/clt is zero extended with movzx even if we use only the lower byte part for the subtraction, then is sign extended to 32bits.

The result code could be:

image

If someone is ok to guide me through this kind of JIT optimization, I'd happily implement it.

Author: thinkbeforecoding
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@EgorBo
Copy link
Member

EgorBo commented May 31, 2022

I assume it's not possible to express this in pure C# and F# right? (relying on cmp's return value as int, only maybe with Unsafe.As)

@EgorBo
Copy link
Member

EgorBo commented May 31, 2022

.NET 7.0 codegen is different:

G_M60686_IG01:              ;; offset=0000H
                                                ;; size=0 bbWeight=1    PerfScore 0.00
G_M60686_IG02:              ;; offset=0000H
       33C0                 xor      eax, eax
       3BCA                 cmp      ecx, edx
       0F9FC0               setg     al
       3BCA                 cmp      ecx, edx
       0F9CC2               setl     dl
       0FB6D2               movzx    rdx, dl
       2BC2                 sub      eax, edx
                                                ;; size=17 bbWeight=1    PerfScore 3.25
G_M60686_IG03:              ;; offset=0011H
       C3                   ret
                                                ;; size=1 bbWeight=1    PerfScore 1.00

; Total bytes of code 18

@tannergooding
Copy link
Member

tannergooding commented May 31, 2022

I assume it's not possible to express this in pure C# and F# right?

Right. There is now a proposal for C# to handle cmp ? 0 : 1 as emitting just cmp however: dotnet/roslyn#61483

.NET 7.0 codegen is different:

So almost there, its just missing the chance to remove the second compare and avoid the extension until after the sub. Noting that avoiding the extension can sometimes have "negative" impact since its a "partial register read/write"

@thinkbeforecoding
Copy link
Author

Yes it was submitted by @dsyme , in F# it is possible to emit inline IL so you can (as in the sample above) emit directly clt, cgt, ces IL instructions. But if roslyn can emit cgt for x > y? 1 : 0, it's basically the same.
However, x > y? (sbyte) 1 : (sbyte) 0 should remove the xor or movsx or movzx.
And get rid of the double cmp.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label May 31, 2022
@JulieLeeMSFT
Copy link
Member

cc @dotnet/jit-contrib.
Moving to Future as we are approaching .NET 7 code complete date.

@thinkbeforecoding
Copy link
Author

Looking at the code in https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/codegenxarch.cpp,
I've seen that there already seems to have logic to avoid the movzx instruction when return type is known to be a byte:

https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/codegenxarch.cpp#L1433

The logic to see if emitting a cmp based on existing flags also exists...

https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/codegenxarch.cpp#L6464

But how is the genCompareInt method determining the targetType (it's using tree->TypeGet()) ? The clt/cgt IL instruction specify no return type.. the the type is maybe conditioned by the next IL instruction ?

@EgorBo
Copy link
Member

EgorBo commented Jun 7, 2022

cc @TIHan

@TIHan TIHan self-assigned this Oct 31, 2022
@TIHan TIHan modified the milestones: Future, 8.0.0 Jan 17, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jan 25, 2023
@TIHan
Copy link
Contributor

TIHan commented Jan 25, 2023

@thinkbeforecoding I made a PR to resolve this in case you are interested: #81143

@thinkbeforecoding
Copy link
Author

Definitely! Thank you.
The PR will also give me hints where to look at next time I see such possible optimization.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 25, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 28, 2023
@TIHan
Copy link
Contributor

TIHan commented Feb 28, 2023

@thinkbeforecoding - The other PR is closed, but I made a new one: #82750

It's still work-in-progress, but the implementation of it is a lot more straightforward than my previous PR. This is because we recently added the ability to look back at more than one previous instruction, which allows us to see if the cmp is redundant or not.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 5, 2023
@ghost ghost locked as resolved and limited conversation to collaborators May 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
6 participants