Reduce sort bucket exchange overhead #1635

ronawho · 2022-07-26T14:02:46Z

Previously, each task would write its local bucket counts directly to
the global counts in transposed order. Each task has 64K buckets with
64K/numLocales going to each locale. With 8K aggregation buffers this
meant that at over 8 locales we'd have single-use aggregators that would
get very small at higher scales. This meant that we were paying the
non-trivial aggregation creation cost and not amortizing it with
multiple uses. And worse, each task was flushing in lockstep to locale
0, then 1, and so on. So there were many single-use aggregators being
used in lockstep, which created bottlenecks at scale.

Improve that here by combining all the buckets from local tasks on a
node and chunking up the write. This allows us to send fewer and
fuller aggregation buffers and write to all locales simultaneously
instead of doing so in lockstep. This reduces sort startup costs,
particularly at scale.

At 40 nodes on a Cray CS with HDR InfiniBand this takes a trivial sized
sort on 32-bit uints from 0.60s down to 0.15s. This is the largest
InfiniBand system I have access to, but I also simulated 240 nodes by
running 6x oversubscribed on these 40 nodes. For the same trivial
problem size this takes us from 90.0s down to 1.7s. This isn't really a
fair timing since overheads will be exaggerated but if we're at 1.7s
oversubscribed we'll be even lower on real hardware.

Back when I had access to a large InfiniBand system I saw a trivial sort
take 20-25s at 512 nodes. Simulating 512 nodes on this 40 node system
the trivial sort takes 7s. That's with over 12x oversubscription, so I
would guess on real hardware it'd be down around a second or two.

I believe we can further improve the bucket exchange by doing RDMA
writes of the buffers in non transposed order and then transposing
before/after scanning, but that's a larger algorithmic change and this
has a substantial improvement so I think is still a worthwhile stepping
stone with a simpler implementation.

Perf results in table form:

Config	Before	Now
40 node IB	0.6s	0.15s
240 node IB ( 6x oversub)	90.0s	1.70s
512 node IB (~12x oversub)	560.0s	7.10s
512 node XC	3.0s	0.50s

Part of #1404

Previously, each task would write its local bucket counts directly to the global counts in transposed order. Each task has 64K buckets with 64K/numLocales going to each locale. With 8K aggregation buffers this meant that at over 8 locales we'd have single-use aggregators that would get very small at higher scales. This meant that we were paying the non-trivial aggregation creation cost and not amortizing it with multiple uses. And worse, each task was flushing in lockstep to locale 0, then 1, and so on. So there were many single-use aggregators being used in lockstep, which created bottlenecks at scale. Improve that here by combining all the buckets from local tasks on a node and chunking up the write. This allows us to send fewer and fuller aggregation buffers and write to all locales simultaneously instead of doing so in lockstep. This reduces sort startup costs, particularly at scale. At 40 nodes on a Cray CS with HDR InfiniBand this takes a trivial sized sort on 32-bit uints from 0.60s down to 0.15s. This is the largest InfiniBand system I have access to, but I also simulated 240 nodes by running 6x oversubscribed on these 40 nodes. For the same trivial problem size this takes us from 90.0s down to 1.7s. This isn't really a fair timing since overheads will be exaggerated but if we're at 1.7s oversubscribed we'll be even lower on real hardware. Back when I had access to a large InfiniBand system I saw a trivial sort take 20-25s at 512 nodes. Simulating 512 nodes on this 40 node system the trivial sort takes 7s. That's with over 12x oversubscription, so I would guess on real hardware it'd be down around a second or two. I believe we can further improve the bucket exchange by doing RDMA writes of the buffers in non transposed order and then transposing before/after scanning, but that's a larger algorithmic change and this has a substantial improvement so I think is still a worthwhile stepping stone with a simpler implementation. Perf results in table form: | Config | Before | Now | | -------------------------- | -----: | ----: | | 40 node IB | 0.6s | 0.15s | | 240 node IB ( 6x oversub) | 90.0s | 1.70s | | 512 node IB (~12x oversub) | 560.0s | 7.10s | | 512 node XC | 3.0s | 0.50s | Part of 1404

ronawho · 2022-07-26T14:14:14Z

Drafting only because I'm not around if there's any fallout from this. I feel pretty confident about the change if you want to try it out, but if you want to wait I'll be back on the 8th.

src/RadixSortLSD.chpl

mhmerrill · 2022-07-28T18:30:03Z

Drafting only because I'm not around if there's any fallout from this. I feel pretty confident about the change if you want to try it out, but if you want to wait I'll be back on the 8th.

i will also be out until Aug 8th.

reuster986

This is great!

Ethan-DeBandi99

Based on Elliot's description here this looks really good. The performance gains are exciting as well.

stress-tess

This is awesome! What a great speedup

ronawho requested review from reuster986 and mhmerrill July 26, 2022 14:02

ronawho mentioned this pull request Jul 26, 2022

Reduce sort overheads #1404

Open

mhmerrill reviewed Jul 28, 2022

View reviewed changes

src/RadixSortLSD.chpl Show resolved Hide resolved

mhmerrill requested review from mhmerrill, Ethan-DeBandi99 and stress-tess July 28, 2022 18:27

ronawho marked this pull request as ready for review August 10, 2022 14:17

reuster986 approved these changes Aug 10, 2022

View reviewed changes

Ethan-DeBandi99 approved these changes Aug 10, 2022

View reviewed changes

stress-tess approved these changes Aug 10, 2022

View reviewed changes

stress-tess merged commit 8ff5d48 into Bears-R-Us:master Aug 10, 2022

ronawho deleted the better-bucket-exchange branch August 10, 2022 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce sort bucket exchange overhead #1635

Reduce sort bucket exchange overhead #1635

ronawho commented Jul 26, 2022

ronawho commented Jul 26, 2022

mhmerrill commented Jul 28, 2022

reuster986 left a comment

Ethan-DeBandi99 left a comment

stress-tess left a comment

Reduce sort bucket exchange overhead #1635

Reduce sort bucket exchange overhead #1635

Conversation

ronawho commented Jul 26, 2022

ronawho commented Jul 26, 2022

mhmerrill commented Jul 28, 2022

reuster986 left a comment

Choose a reason for hiding this comment

Ethan-DeBandi99 left a comment

Choose a reason for hiding this comment

stress-tess left a comment

Choose a reason for hiding this comment