implements weighted shuffle using N-ary tree #259

behzadnouri · 2024-03-14T23:49:40Z

This is port of firedancer's implementation of weighted shuffle:
https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c

Problem

#185 implemented weighted shuffle using binary tree. Though asymptotically a
binary tree has better performance, compared to a Fenwick tree, it is
less cache local resulting in smaller improvements and in particular
slower WeightedShuffle::new.

Summary of Changes

In order to improve cache locality and reduce the overheads of
traversing the tree, this commit instead uses a generalized N-ary tree
with fanout of 16, showing significant improvements in both
WeightedShuffle::new and WeightedShuffle::shuffle.

With 4000 weights:

N-ary tree (fanout 16):

test bench_weighted_shuffle_new     ... bench:      36,244 ns/iter (+/- 243)
test bench_weighted_shuffle_shuffle ... bench:     149,082 ns/iter (+/- 1,474)

Binary tree:

test bench_weighted_shuffle_new     ... bench:      58,514 ns/iter (+/- 229)
test bench_weighted_shuffle_shuffle ... bench:     269,961 ns/iter (+/- 16,446)

Fenwick tree:

test bench_weighted_shuffle_new     ... bench:      39,413 ns/iter (+/- 179)
test bench_weighted_shuffle_shuffle ... bench:     364,771 ns/iter (+/- 2,078)

The improvements become even more significant as there are more items to
shuffle. With 20_000 weights:

N-ary tree (fanout 16):

test bench_weighted_shuffle_new     ... bench:     200,659 ns/iter (+/- 4,395)
test bench_weighted_shuffle_shuffle ... bench:     941,928 ns/iter (+/- 26,492)

Binary tree:

test bench_weighted_shuffle_new     ... bench:     881,114 ns/iter (+/- 12,343)
test bench_weighted_shuffle_shuffle ... bench:   1,822,257 ns/iter (+/- 12,772)

Fenwick tree:

test bench_weighted_shuffle_new     ... bench:     276,936 ns/iter (+/- 14,692)
test bench_weighted_shuffle_shuffle ... bench:   2,644,713 ns/iter (+/- 49,252)

codecov-commenter · 2024-03-25T08:06:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.9%. Comparing base (a916edb) to head (55d94d1).

Additional details and impacted files

@@            Coverage Diff            @@
##           master     #259     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         840      840             
  Lines      228105   228102      -3     
=========================================
- Hits       186837   186832      -5     
- Misses      41268    41270      +2

This is port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c anza-xyz#185 implemented weighted shuffle using binary tree. Though asymptotically a binary tree has better performance, compared to a Fenwick tree, it is less cache local resulting in smaller improvements and in particular slower WeightedShuffle::new. In order to improve cache locality and reduce the overheads of traversing the tree, this commit instead uses a generalized N-ary tree with fanout of 16, showing significant improvements in both WeightedShuffle::new and WeightedShuffle::shuffle. With 4000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 36,244 ns/iter (+/- 243) test bench_weighted_shuffle_shuffle ... bench: 149,082 ns/iter (+/- 1,474) Binary tree: test bench_weighted_shuffle_new ... bench: 58,514 ns/iter (+/- 229) test bench_weighted_shuffle_shuffle ... bench: 269,961 ns/iter (+/- 16,446) Fenwick tree: test bench_weighted_shuffle_new ... bench: 39,413 ns/iter (+/- 179) test bench_weighted_shuffle_shuffle ... bench: 364,771 ns/iter (+/- 2,078) The improvements become even more significant as there are more items to shuffle. With 20_000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 200,659 ns/iter (+/- 4,395) test bench_weighted_shuffle_shuffle ... bench: 941,928 ns/iter (+/- 26,492) Binary tree: test bench_weighted_shuffle_new ... bench: 881,114 ns/iter (+/- 12,343) test bench_weighted_shuffle_shuffle ... bench: 1,822,257 ns/iter (+/- 12,772) Fenwick tree: test bench_weighted_shuffle_new ... bench: 276,936 ns/iter (+/- 14,692) test bench_weighted_shuffle_shuffle ... bench: 2,644,713 ns/iter (+/- 49,252)

gregcusack

those performance numbers look great! why did you settle on FANOUT of 16? just performed the best when you tested or?? Looks like firedancer is using 9. either way lgtm!

behzadnouri · 2024-03-26T05:20:27Z

why did you settle on FANOUT of 16? just performed the best when you tested or??

yeah, ran the benchmarks with different values of BIT_SHIFT and 4 seemed fastest.

mergify · 2024-03-26T07:28:11Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

This is port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c #185 implemented weighted shuffle using binary tree. Though asymptotically a binary tree has better performance, compared to a Fenwick tree, it has less cache locality resulting in smaller improvements and in particular slower WeightedShuffle::new. In order to improve cache locality and reduce the overheads of traversing the tree, this commit instead uses a generalized N-ary tree with fanout of 16, showing significant improvements in both WeightedShuffle::new and WeightedShuffle::shuffle. With 4000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 36,244 ns/iter (+/- 243) test bench_weighted_shuffle_shuffle ... bench: 149,082 ns/iter (+/- 1,474) Binary tree: test bench_weighted_shuffle_new ... bench: 58,514 ns/iter (+/- 229) test bench_weighted_shuffle_shuffle ... bench: 269,961 ns/iter (+/- 16,446) Fenwick tree: test bench_weighted_shuffle_new ... bench: 39,413 ns/iter (+/- 179) test bench_weighted_shuffle_shuffle ... bench: 364,771 ns/iter (+/- 2,078) The improvements become even more significant as there are more items to shuffle. With 20_000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 200,659 ns/iter (+/- 4,395) test bench_weighted_shuffle_shuffle ... bench: 941,928 ns/iter (+/- 26,492) Binary tree: test bench_weighted_shuffle_new ... bench: 881,114 ns/iter (+/- 12,343) test bench_weighted_shuffle_shuffle ... bench: 1,822,257 ns/iter (+/- 12,772) Fenwick tree: test bench_weighted_shuffle_new ... bench: 276,936 ns/iter (+/- 14,692) test bench_weighted_shuffle_shuffle ... bench: 2,644,713 ns/iter (+/- 49,252) (cherry picked from commit 30eecd6)

#429) implements weighted shuffle using N-ary tree (#259) This is port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c #185 implemented weighted shuffle using binary tree. Though asymptotically a binary tree has better performance, compared to a Fenwick tree, it has less cache locality resulting in smaller improvements and in particular slower WeightedShuffle::new. In order to improve cache locality and reduce the overheads of traversing the tree, this commit instead uses a generalized N-ary tree with fanout of 16, showing significant improvements in both WeightedShuffle::new and WeightedShuffle::shuffle. With 4000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 36,244 ns/iter (+/- 243) test bench_weighted_shuffle_shuffle ... bench: 149,082 ns/iter (+/- 1,474) Binary tree: test bench_weighted_shuffle_new ... bench: 58,514 ns/iter (+/- 229) test bench_weighted_shuffle_shuffle ... bench: 269,961 ns/iter (+/- 16,446) Fenwick tree: test bench_weighted_shuffle_new ... bench: 39,413 ns/iter (+/- 179) test bench_weighted_shuffle_shuffle ... bench: 364,771 ns/iter (+/- 2,078) The improvements become even more significant as there are more items to shuffle. With 20_000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 200,659 ns/iter (+/- 4,395) test bench_weighted_shuffle_shuffle ... bench: 941,928 ns/iter (+/- 26,492) Binary tree: test bench_weighted_shuffle_new ... bench: 881,114 ns/iter (+/- 12,343) test bench_weighted_shuffle_shuffle ... bench: 1,822,257 ns/iter (+/- 12,772) Fenwick tree: test bench_weighted_shuffle_new ... bench: 276,936 ns/iter (+/- 14,692) test bench_weighted_shuffle_shuffle ... bench: 2,644,713 ns/iter (+/- 49,252) (cherry picked from commit 30eecd6) Co-authored-by: behzad nouri <[email protected]>

…-xyz#259) (anza-xyz#429) implements weighted shuffle using N-ary tree (anza-xyz#259) This is port of firedancer's implementation of weighted shuffle: https://github.com/firedancer-io/firedancer/blob/3401bfc26/src/ballet/wsample/fd_wsample.c anza-xyz#185 implemented weighted shuffle using binary tree. Though asymptotically a binary tree has better performance, compared to a Fenwick tree, it has less cache locality resulting in smaller improvements and in particular slower WeightedShuffle::new. In order to improve cache locality and reduce the overheads of traversing the tree, this commit instead uses a generalized N-ary tree with fanout of 16, showing significant improvements in both WeightedShuffle::new and WeightedShuffle::shuffle. With 4000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 36,244 ns/iter (+/- 243) test bench_weighted_shuffle_shuffle ... bench: 149,082 ns/iter (+/- 1,474) Binary tree: test bench_weighted_shuffle_new ... bench: 58,514 ns/iter (+/- 229) test bench_weighted_shuffle_shuffle ... bench: 269,961 ns/iter (+/- 16,446) Fenwick tree: test bench_weighted_shuffle_new ... bench: 39,413 ns/iter (+/- 179) test bench_weighted_shuffle_shuffle ... bench: 364,771 ns/iter (+/- 2,078) The improvements become even more significant as there are more items to shuffle. With 20_000 weights: N-ary tree (fanout 16): test bench_weighted_shuffle_new ... bench: 200,659 ns/iter (+/- 4,395) test bench_weighted_shuffle_shuffle ... bench: 941,928 ns/iter (+/- 26,492) Binary tree: test bench_weighted_shuffle_new ... bench: 881,114 ns/iter (+/- 12,343) test bench_weighted_shuffle_shuffle ... bench: 1,822,257 ns/iter (+/- 12,772) Fenwick tree: test bench_weighted_shuffle_new ... bench: 276,936 ns/iter (+/- 14,692) test bench_weighted_shuffle_shuffle ... bench: 2,644,713 ns/iter (+/- 49,252) (cherry picked from commit 30eecd6) Co-authored-by: behzad nouri <[email protected]>

behzadnouri mentioned this pull request Mar 14, 2024

implements weighted shuffle using binary tree #185

Merged

behzadnouri force-pushed the weighted-shuffle-radix branch 2 times, most recently from f0b4ac2 to 55d94d1 Compare March 25, 2024 06:32

behzadnouri force-pushed the weighted-shuffle-radix branch from 55d94d1 to e8e5583 Compare March 25, 2024 10:19

behzadnouri requested a review from gregcusack March 25, 2024 10:20

gregcusack approved these changes Mar 26, 2024

View reviewed changes

behzadnouri merged commit 30eecd6 into anza-xyz:master Mar 26, 2024
37 checks passed

behzadnouri deleted the weighted-shuffle-radix branch March 26, 2024 05:21

behzadnouri added the v1.18 label Mar 26, 2024

mergify bot mentioned this pull request Mar 26, 2024

v1.18: implements weighted shuffle using N-ary tree (backport of #259) #429

Merged

willhickey mentioned this pull request Mar 28, 2024

v1.18 commits - please ignore #475

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implements weighted shuffle using N-ary tree #259

implements weighted shuffle using N-ary tree #259

behzadnouri commented Mar 14, 2024 •

edited

Loading

codecov-commenter commented Mar 25, 2024

gregcusack left a comment

behzadnouri commented Mar 26, 2024

mergify bot commented Mar 26, 2024

implements weighted shuffle using N-ary tree #259

implements weighted shuffle using N-ary tree #259

Conversation

behzadnouri commented Mar 14, 2024 • edited Loading

Problem

Summary of Changes

codecov-commenter commented Mar 25, 2024

Codecov Report

gregcusack left a comment

Choose a reason for hiding this comment

behzadnouri commented Mar 26, 2024

mergify bot commented Mar 26, 2024

behzadnouri commented Mar 14, 2024 •

edited

Loading