Use more optimal Ord implementation for integers #63767

tesuji · 2019-08-21T00:41:41Z

Compare results

Old assembly:

example::cmp1:
  mov eax, dword ptr [rdi]
  mov ecx, dword ptr [rsi]
  cmp eax, ecx
  setae dl
  add dl, dl
  add dl, -1
  xor esi, esi
  cmp eax, ecx
  movzx eax, dl
  cmove eax, esi
  ret

New assembly:

example::cmp2:
  mov eax, dword ptr [rdi]
  xor ecx, ecx
  cmp eax, dword ptr [rsi]
  seta cl
  mov eax, 255
  cmovae eax, ecx
  ret

Old llvm-mca statistics:

Iterations:        100
Instructions:      1100
Total Cycles:      243
Total uOps:        1300

Dispatch Width:    6
uOps Per Cycle:    5.35
IPC:               4.53
Block RThroughput: 2.2

New llvm-mca statistics:

Iterations:        100
Instructions:      700
Total Cycles:      217
Total uOps:        1100

Dispatch Width:    6
uOps Per Cycle:    5.07
IPC:               3.23
Block RThroughput: 1.8

nagisa · 2019-08-21T00:46:00Z

@bors r+

bors · 2019-08-21T00:46:02Z

📌 Commit 0337cc1 has been approved by nagisa

matthiaskrgr · 2019-08-21T06:05:53Z

Could you add a comment explaining that the ordering is performance critical here? (perhaps with a link to the original ticket)
This should make sure that it es not changed back to something slower by accident.

hellow554 · 2019-08-21T06:11:26Z

What about adding a assembly testcase as well to prevent regressions (e.g. due to other optimizations?)

tesuji · 2019-08-21T06:22:32Z

What about adding a assembly testcase as well to prevent regressions (e.g. due to other optimizations?)

I don't know how. Could you give a mentor?

hellow554 · 2019-08-21T07:41:11Z

@lzutao You can take a look at https://rust-lang.github.io/rustc-guide/tests/intro.html and the codegen test cases in https://github.com/rust-lang/rust/tree/master/src/test/codegen especially at

rust/src/test/codegen/float_math.rs

Lines 1 to 50 in bea0372

    
           // compile-flags: -C no-prepopulate-passes 
        
           #![crate_type = "lib"] 
        
           #![feature(core_intrinsics)] 
        
           use std::intrinsics::{fadd_fast, fsub_fast, fmul_fast, fdiv_fast, frem_fast}; 
        
           // CHECK-LABEL: @add 
        
           #[no_mangle] 
        
           pub fn add(x: f32, y: f32) -> f32 { 
        
           // CHECK: fadd float 
        
           // CHECK-NOT: fast 
        
               x + y 
        
           } 
        
           // CHECK-LABEL: @addition 
        
           #[no_mangle] 
        
           pub fn addition(x: f32, y: f32) -> f32 { 
        
           // CHECK: fadd fast float 
        
               unsafe { 
        
                   fadd_fast(x, y) 
        
               } 
        
           } 
        
           // CHECK-LABEL: @subtraction 
        
           #[no_mangle] 
        
           pub fn subtraction(x: f32, y: f32) -> f32 { 
        
           // CHECK: fsub fast float 
        
               unsafe { 
        
                   fsub_fast(x, y) 
        
               } 
        
           } 
        
           // CHECK-LABEL: @multiplication 
        
           #[no_mangle] 
        
           pub fn multiplication(x: f32, y: f32) -> f32 { 
        
           // CHECK: fmul fast float 
        
               unsafe { 
        
                   fmul_fast(x, y) 
        
               } 
        
           } 
        
           // CHECK-LABEL: @division 
        
           #[no_mangle] 
        
           pub fn division(x: f32, y: f32) -> f32 { 
        
           // CHECK: fdiv fast float 
        
               unsafe { 
        
                   fdiv_fast(x, y) 
        
               } 
        
           }

I guess

src/test/codegen/integer-cmp.rs

tesuji · 2019-08-22T02:03:26Z

The CI is green.

nagisa · 2019-08-22T02:33:52Z

@bors r+

bors · 2019-08-22T02:33:53Z

📌 Commit f5b16f6 has been approved by nagisa

@nagisa

…gisa Use more optimal Ord implementation for integers Closes rust-lang#63758 r? @nagisa ### Compare results ([godbolt link](https://godbolt.org/z/dsbczy)) Old assembly: ```asm example::cmp1: mov eax, dword ptr [rdi] mov ecx, dword ptr [rsi] cmp eax, ecx setae dl add dl, dl add dl, -1 xor esi, esi cmp eax, ecx movzx eax, dl cmove eax, esi ret ``` New assembly: ```asm example::cmp2: mov eax, dword ptr [rdi] xor ecx, ecx cmp eax, dword ptr [rsi] seta cl mov eax, 255 cmovae eax, ecx ret ``` Old llvm-mca statistics: ``` Iterations: 100 Instructions: 1100 Total Cycles: 243 Total uOps: 1300 Dispatch Width: 6 uOps Per Cycle: 5.35 IPC: 4.53 Block RThroughput: 2.2 ``` New llvm-mca statistics: ``` Iterations: 100 Instructions: 700 Total Cycles: 217 Total uOps: 1100 Dispatch Width: 6 uOps Per Cycle: 5.07 IPC: 3.23 Block RThroughput: 1.8 ```

@ghost

Rollup of 7 pull requests Successful merges: - #63624 (When declaring a declarative macro in an item it's only accessible inside it) - #63737 (Fix naming misspelling) - #63767 (Use more optimal Ord implementation for integers) - #63782 (Fix confusion in theme picker functions) - #63788 (Add amanjeev to rustc-guide toolstate) - #63796 (Tweak E0308 on opaque types) - #63805 (Apply few Clippy suggestions) Failed merges: r? @ghost

@ghost

Rollup of 7 pull requests Successful merges: - #63624 (When declaring a declarative macro in an item it's only accessible inside it) - #63737 (Fix naming misspelling) - #63767 (Use more optimal Ord implementation for integers) - #63782 (Fix confusion in theme picker functions) - #63788 (Add amanjeev to rustc-guide toolstate) - #63796 (Tweak E0308 on opaque types) - #63805 (Apply few Clippy suggestions) Failed merges: r? @ghost

Add `Ord::cmp` for primitives as a `BinOp` in MIR There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches. Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level: 1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest. 2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations. Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic. Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical. Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)? But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)? And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers. Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it. As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR. The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks. Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues. (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.) --- r? `@ghost` But first I should check that perf is ok with this ~~...and my true nemesis, tidy.~~

Add `Ord::cmp` for primitives as a `BinOp` in MIR Update: most of this OP was written months ago. See rust-lang#118310 (comment) below for where we got to recently that made it ready for review. --- There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches. Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level: 1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest. 2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations. Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic. Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical. Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)? But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)? And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers. Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it. As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR. The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks. Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues. (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.) --- r? `@ghost` But first I should check that perf is ok with this ~~...and my true nemesis, tidy.~~

Use more optimal Ord implementation for integers

0337cc1

rust-highfive assigned nagisa Aug 21, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 21, 2019

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 21, 2019

Add comment to avoid accidentally remove the changes.

96983fc

This comment has been minimized.

Sign in to view

nagisa reviewed Aug 21, 2019

View reviewed changes

src/test/codegen/integer-cmp.rs Outdated Show resolved Hide resolved

Add codegen test for integers compare

f5b16f6

tesuji force-pushed the integer-ord-suboptimal branch from c184fa0 to f5b16f6 Compare August 21, 2019 15:51

Centril mentioned this pull request Aug 22, 2019

Rollup of 7 pull requests #63807

Merged

bors merged commit f5b16f6 into rust-lang:master Aug 22, 2019

tesuji deleted the integer-ord-suboptimal branch August 23, 2019 01:11

scottmcm mentioned this pull request Sep 2, 2019

Even more optimal Ord implementation for integers #64082

Closed

scottmcm mentioned this pull request Nov 26, 2023

Add Ord::cmp for primitives as a BinOp in MIR #118310

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use more optimal Ord implementation for integers #63767

Use more optimal Ord implementation for integers #63767

tesuji commented Aug 21, 2019 •

edited

Loading

nagisa commented Aug 21, 2019

bors commented Aug 21, 2019

matthiaskrgr commented Aug 21, 2019

hellow554 commented Aug 21, 2019

tesuji commented Aug 21, 2019

hellow554 commented Aug 21, 2019

This comment has been minimized.

tesuji commented Aug 22, 2019

nagisa commented Aug 22, 2019

bors commented Aug 22, 2019

Use more optimal Ord implementation for integers #63767

Use more optimal Ord implementation for integers #63767

Conversation

tesuji commented Aug 21, 2019 • edited Loading

Compare results

nagisa commented Aug 21, 2019

bors commented Aug 21, 2019

matthiaskrgr commented Aug 21, 2019

hellow554 commented Aug 21, 2019

tesuji commented Aug 21, 2019

hellow554 commented Aug 21, 2019

This comment has been minimized.

tesuji commented Aug 22, 2019

nagisa commented Aug 22, 2019

bors commented Aug 22, 2019

tesuji commented Aug 21, 2019 •

edited

Loading