Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BinaryHeap performance #78857

Merged
merged 4 commits into from
Nov 13, 2020
Merged

Improve BinaryHeap performance #78857

merged 4 commits into from
Nov 13, 2020

Conversation

SkiFire13
Copy link
Contributor

By changing the condition in the loops from child < end to child < end - 1 we're guaranteed that right = child + 1 < end and since finding the index of the biggest sibling can be done with an arithmetic operation we can remove a branch from the loop body. The case where there's no right child, i.e. child == end - 1 is instead handled outside the loop, after it ends; note that if the loops ends early we can use return instead of break since the check child == end - 1 will surely fail.

I've also removed a call to <[T]>::swap that was hiding a bound check that wasn't being optimized by LLVM.

A quick benchmarks on my pc shows that the gains are pretty significant:

name before ns/iter after ns/iter diff ns/iter diff % speedup
find_smallest_1000 352,565 260,098 -92,467 -26.23% x 1.36
from_vec 676,795 473,934 -202,861 -29.97% x 1.43
into_sorted_vec 469,511 304,275 -165,236 -35.19% x 1.54
pop 483,198 373,778 -109,420 -22.64% x 1.29

The other 2 benchmarks for BinaryHeap (peek_mut_deref_mut and push) weren't impacted and as such didn't show any significant change.

@rust-highfive
Copy link
Collaborator

r? @KodrAus

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Nov 7, 2020
@tavianator
Copy link
Contributor

I don't know if this makes a difference performance-wise (or even readability-wise), but you could avoid the comparisons with end - 1 by computing the right child by default instead of the left (child = 2 * pos + 2).

@SkiFire13
Copy link
Contributor Author

@tavianator I didn't think of that, but I wouldn't expect it to make much of a difference (the compiler stores end-1 in a register before the loop so no other calculation is required inside). I benched it anyway, and surprising it even performs worse.

name left child ns/iter right child ns/iter diff ns/iter diff % speedup
find_smallest_1000 260,098 268,056 +7,958 +3.06% x 0.97
from_vec 473,934 533,500 +59,566 +12.57% x 0.89
into_sorted_vec 304,275 365,226 +60,951 +20.03% x 0.83
pop 373,778 392,372 +18,594 +4.97% x 0.95

I also tested a reduced version on godbolt, and looks like LLVM is doing a worse job at optimizing sift_down_range's loop (13 vs 16 instructions for a u64, that's like 23% more instructions!) which might explain the slowdown, in particular hole.get(child - 1) is not optimized as good as hole.get(child + 1) previously was.

@tavianator
Copy link
Contributor

Huh, thanks for checking. That codegen is kinda odd.

Copy link
Contributor

@KodrAus KodrAus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I just left one comment to include your rationale for the new unsafe block inline. r=me after that.

library/alloc/src/collections/binary_heap.rs Show resolved Hide resolved
@SkiFire13
Copy link
Contributor Author

I've added the safety comment, I hope it's clear enough. I'm tempted to document other unsafe usages as well, but that should probably have its own PR.

r=KodrAus

@Xanewok
Copy link
Member

Xanewok commented Nov 11, 2020

@bors r=KodrAus

(You need to mention the bot and be either in the reviewer set or be delegated to, see cheatsheet at https://bors.rust-lang.org/)

@bors
Copy link
Contributor

bors commented Nov 11, 2020

📌 Commit 387568c has been approved by KodrAus

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 11, 2020
@bors
Copy link
Contributor

bors commented Nov 12, 2020

⌛ Testing commit 387568c with merge d75378aab28aecf8b1521a40409823dd9c0f928b...

@bors
Copy link
Contributor

bors commented Nov 12, 2020

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Nov 12, 2020
@kennytm
Copy link
Member

kennytm commented Nov 12, 2020

@bors retry

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 12, 2020
m-ou-se added a commit to m-ou-se/rust that referenced this pull request Nov 12, 2020
Improve BinaryHeap performance

By changing the condition in the loops from `child < end` to `child < end - 1` we're guaranteed that `right = child + 1 < end` and since finding the index of the biggest sibling can be done with an arithmetic operation we can remove a branch from the loop body. The case where there's no right child, i.e. `child == end - 1` is instead handled outside the loop, after it ends; note that if the loops ends early we can use `return` instead of `break` since the check `child == end - 1` will surely fail.

I've also removed a call to `<[T]>::swap` that was hiding a bound check that [wasn't being optimized by LLVM](https://godbolt.org/z/zrhdGM).

A quick benchmarks on my pc shows that the gains are pretty significant:

|name                 |before ns/iter  |after ns/iter  |diff ns/iter  |diff %    |speedup |
|---------------------|----------------|---------------|--------------|----------|--------|
|find_smallest_1000   | 352,565        | 260,098       |     -92,467  | -26.23%  | x 1.36 |
|from_vec             | 676,795        | 473,934       |    -202,861  | -29.97%  | x 1.43 |
|into_sorted_vec      | 469,511        | 304,275       |    -165,236  | -35.19%  | x 1.54 |
|pop                  | 483,198        | 373,778       |    -109,420  | -22.64%  | x 1.29 |

The other 2 benchmarks for `BinaryHeap` (`peek_mut_deref_mut` and `push`) weren't impacted and as such didn't show any significant change.
bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 12, 2020
Rollup of 7 pull requests

Successful merges:

 - rust-lang#76730 (Fix rustdoc rendering of by-value mutable arguments in async fn)
 - rust-lang#78836 (Implement destructuring assignment for structs and slices)
 - rust-lang#78857 (Improve BinaryHeap performance)
 - rust-lang#78950 (Add asm register information for SPIR-V)
 - rust-lang#78970 (update rustfmt to v1.4.25)
 - rust-lang#78972 (Update cargo)
 - rust-lang#78987 (extend min_const_generics param ty tests)

Failed merges:

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 4088981 into rust-lang:master Nov 13, 2020
@rustbot rustbot added this to the 1.49.0 milestone Nov 13, 2020
@SkiFire13 SkiFire13 deleted the bheap-opt branch November 13, 2020 09:32
sekineh pushed a commit to sekineh/binary-heap-plus-rs that referenced this pull request Jan 6, 2021
* Remove useless branches from sift_down_range loop

* Remove branches from sift_down_to_bottom loop

* Remove useless bound checks from into_sorted_vec
bors bot added a commit to rust-itertools/itertools that referenced this pull request Jan 20, 2021
518: Remove unpredicted branch from kmerge::sift_down r=jswrenn a=SkiFire13

This is pretty much a port from rust-lang/rust#78857

Compared with the previous implementation, this adds more branches for bound checks which aren't present on the stdlib version, however they should be predicted almost always. The speedup should come from the removal of an unpredictable branch from the loop body, in favor of boolean arithmetic.

The benchmarks seem to agree:

```
before:

test kmerge default ... bench:        6812 ns/iter (+/- 18)
test kmerge tenway ... bench:      223673 ns/iter (+/- 769)

after:

test kmerge default ... bench:        6212 ns/iter (+/- 43)
test kmerge tenway ... bench:      190700 ns/iter (+/- 419)
```

Co-authored-by: Giacomo Stevanato <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants