Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal eq compilation on structs compared to equivalent C++ code #106269

Closed
Kmeakin opened this issue Dec 29, 2022 · 9 comments · Fixed by #124299
Closed

Suboptimal eq compilation on structs compared to equivalent C++ code #106269

Kmeakin opened this issue Dec 29, 2022 · 9 comments · Fixed by #124299
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Kmeakin
Copy link
Contributor

Kmeakin commented Dec 29, 2022

The eq implementation produces worse code than the equivalent C++ code (Rust, C++)

I tried this code:

pub struct S {
    a: u8,
    b: u8,
    c: u8,
    d: u8,
}

pub fn eq(s1: &S, s2: &S) -> bool {
    s1.a == s2.a && s1.b == s2.b && s1.c == s2.c && s1.d == s2.d
}

I expected to see this happen:

The resulting assembly should load and compare a single u64:

example::eq:
        ldr     w8, [x0]
        ldr     w9, [x1]
        cmp     w8, w9
        cset    w0, eq
        ret

Instead, this happened:

The assembly loads and compares each u8 individually:

example::eq:
        ldrb    w9, [x0]
        mov     w8, wzr
        ldrb    w10, [x1]
        cmp     w9, w10
        b.ne    .LBB0_4
        ldrb    w9, [x0, #1]
        ldrb    w10, [x1, #1]
        cmp     w9, w10
        b.ne    .LBB0_4
        ldrb    w8, [x0, #2]
        ldrb    w9, [x1, #2]
        cmp     w8, w9
        b.ne    .LBB0_5
        ldrb    w8, [x0, #3]
        ldrb    w9, [x1, #3]
        cmp     w8, w9
        cset    w8, eq
.LBB0_4:
        mov     w0, w8
        ret
.LBB0_5:
        mov     w0, wzr
        ret

Meta

rustc --version --verbose:

rustc 1.68.0-nightly (88c58e3c2 2022-12-26)
binary: rustc
commit-hash: 88c58e3c2c097ebffac425d9e080dcb1aadf790e
commit-date: 2022-12-26
host: x86_64-unknown-linux-gnu
release: 1.68.0-nightly
LLVM version: 15.0.6
Compiler returned: 0
@Kmeakin Kmeakin added the C-bug Category: This is a bug. label Dec 29, 2022
@Noratrieb Noratrieb added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Dec 29, 2022
@leonardo-m
Copy link

With more aggressive compilation flags I am seeing (latest Nightly):

example::eq:
        mov     eax, dword ptr [rdi]
        cmp     eax, dword ptr [rsi]
        sete    al
        ret

@Kmeakin
Copy link
Contributor Author

Kmeakin commented Dec 29, 2022

With more aggressive compilation flags I am seeing (latest Nightly):

example::eq:
        mov     eax, dword ptr [rdi]
        cmp     eax, dword ptr [rsi]
        sete    al
        ret

What flags did you use?

@leonardo-m
Copy link

-O -Z mir-opt-level=4

@nikic nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Dec 29, 2022
@nikic
Copy link
Contributor

nikic commented Dec 29, 2022

Looks like the first two comparisons gets flattened into one block, and that prevents detection of the whole pattern.

@clubby789
Copy link
Contributor

SeparateConstSwitch (enabled at mir-opt-level >= 4) is the MIR opt which merges the short-circuiting failure paths into 1, and allows SimplifyCFG to reduce this to a pattern which can be turned into a memcmp.

@scottmcm
Copy link
Member

scottmcm commented Jan 6, 2023

#106294 might improve this.

I noticed in https://rust.godbolt.org/z/q768EYdqq that it's doing <2 x i8> dances:

  %0 = load <2 x i8>, ptr %s1, align 1, !dbg !11
  %1 = load <2 x i8>, ptr %s2, align 1, !dbg !12
  %2 = icmp eq <2 x i8> %0, %1, !dbg !11
  %3 = extractelement <2 x i1> %2, i64 0, !dbg !11
  %4 = extractelement <2 x i1> %2, i64 1, !dbg !11

which might be no longer needed once it's !noundef and it can thus just compare them both directly.


There's also something going on here related to the short-circuiting that I don't understand.

If I change it to

pub fn eq(s1: &S, s2: &S) -> bool {
    (s1.a == s2.a) & (s1.b == s2.b) & (s1.c == s2.c) & (s1.d == s2.d)
}

Then it optimizes as expected

define noundef zeroext i1 @_ZN7example2eq17h1d431ecac5604099E(ptr noalias nocapture noundef readonly align 1 dereferenceable(4) %s1, ptr noalias nocapture noundef readonly align 1 dereferenceable(4) %s2) unnamed_addr #0 !dbg !6 {
  %0 = load i32, ptr %s1, align 1, !dbg !11
  %1 = load i32, ptr %s2, align 1, !dbg !12
  %2 = icmp eq i32 %0, %1, !dbg !13
  ret i1 %2, !dbg !14
}

https://rust.godbolt.org/z/ThfqaW3cr

It's definitely marked dereferenceable(4) either way, so LLVM ought to be able to move the loads early either way.

@nikic
Copy link
Contributor

nikic commented Jan 6, 2023

@scottmcm Short-circuiting is relevant in that moving everything into one block is generally non-profitable. There is a special pass that detects this pattern (MergeICmps) and converts it into memcmp/bcmp, which is then later expanded into a target-specific efficient comparison sequence. This is a backend IR pass, so you don't see it in --emit=llvm-ir output.

However, it requires a pretty specific pattern, and does not handle the case where the control flow has been partially flattened and one pair of comparisons uses a select rather than a branch.

Your X86 example adds insult to injury by actually vectorizing those two i8 comparisons: https://llvm.godbolt.org/z/qPe6Kfq6e That looks pretty clearly non-profitable: https://llvm.godbolt.org/z/dT85WE8xn

Edit: Filed llvm/llvm-project#59867 for the SLPVectorizer issue.

@workingjubilee
Copy link
Member

workingjubilee commented Feb 20, 2023

I suppose this opt would be specifically enabled by our optimization semantics allowing spurious reads. No change since the noundef add, though. The SeparateConstSwitch pass may be affected by #107009 being accepted.

@Noratrieb Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023
@Kmeakin Kmeakin closed this as completed Apr 13, 2024
@clubby789
Copy link
Contributor

Looks like this is fixed now, we should have a test for this if we don't already

@clubby789 clubby789 reopened this Apr 13, 2024
@clubby789 clubby789 added the E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. label Apr 13, 2024
bors added a commit to rust-lang-ci/rust that referenced this issue Apr 29, 2024
Add test for issue 106269

Closes rust-lang#106269

Made this an assembly test as the LLVM codegen is still quite verbose and doesn't really indicate the behaviour we want
@bors bors closed this as completed in 4b6c191 Apr 30, 2024
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Apr 30, 2024
Rollup merge of rust-lang#124299 - clubby789:106269-test, r=nikic

Add test for issue 106269

Closes rust-lang#106269

Made this an assembly test as the LLVM codegen is still quite verbose and doesn't really indicate the behaviour we want
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants