-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce sort bucket exchange overhead #1635
Conversation
Previously, each task would write its local bucket counts directly to the global counts in transposed order. Each task has 64K buckets with 64K/numLocales going to each locale. With 8K aggregation buffers this meant that at over 8 locales we'd have single-use aggregators that would get very small at higher scales. This meant that we were paying the non-trivial aggregation creation cost and not amortizing it with multiple uses. And worse, each task was flushing in lockstep to locale 0, then 1, and so on. So there were many single-use aggregators being used in lockstep, which created bottlenecks at scale. Improve that here by combining all the buckets from local tasks on a node and chunking up the write. This allows us to send fewer and fuller aggregation buffers and write to all locales simultaneously instead of doing so in lockstep. This reduces sort startup costs, particularly at scale. At 40 nodes on a Cray CS with HDR InfiniBand this takes a trivial sized sort on 32-bit uints from 0.60s down to 0.15s. This is the largest InfiniBand system I have access to, but I also simulated 240 nodes by running 6x oversubscribed on these 40 nodes. For the same trivial problem size this takes us from 90.0s down to 1.7s. This isn't really a fair timing since overheads will be exaggerated but if we're at 1.7s oversubscribed we'll be even lower on real hardware. Back when I had access to a large InfiniBand system I saw a trivial sort take 20-25s at 512 nodes. Simulating 512 nodes on this 40 node system the trivial sort takes 7s. That's with over 12x oversubscription, so I would guess on real hardware it'd be down around a second or two. I believe we can further improve the bucket exchange by doing RDMA writes of the buffers in non transposed order and then transposing before/after scanning, but that's a larger algorithmic change and this has a substantial improvement so I think is still a worthwhile stepping stone with a simpler implementation. Perf results in table form: | Config | Before | Now | | -------------------------- | -----: | ----: | | 40 node IB | 0.6s | 0.15s | | 240 node IB ( 6x oversub) | 90.0s | 1.70s | | 512 node IB (~12x oversub) | 560.0s | 7.10s | | 512 node XC | 3.0s | 0.50s | Part of 1404
Drafting only because I'm not around if there's any fallout from this. I feel pretty confident about the change if you want to try it out, but if you want to wait I'll be back on the 8th. |
i will also be out until Aug 8th. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on Elliot's description here this looks really good. The performance gains are exciting as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! What a great speedup
Previously, each task would write its local bucket counts directly to
the global counts in transposed order. Each task has 64K buckets with
64K/numLocales going to each locale. With 8K aggregation buffers this
meant that at over 8 locales we'd have single-use aggregators that would
get very small at higher scales. This meant that we were paying the
non-trivial aggregation creation cost and not amortizing it with
multiple uses. And worse, each task was flushing in lockstep to locale
0, then 1, and so on. So there were many single-use aggregators being
used in lockstep, which created bottlenecks at scale.
Improve that here by combining all the buckets from local tasks on a
node and chunking up the write. This allows us to send fewer and
fuller aggregation buffers and write to all locales simultaneously
instead of doing so in lockstep. This reduces sort startup costs,
particularly at scale.
At 40 nodes on a Cray CS with HDR InfiniBand this takes a trivial sized
sort on 32-bit uints from 0.60s down to 0.15s. This is the largest
InfiniBand system I have access to, but I also simulated 240 nodes by
running 6x oversubscribed on these 40 nodes. For the same trivial
problem size this takes us from 90.0s down to 1.7s. This isn't really a
fair timing since overheads will be exaggerated but if we're at 1.7s
oversubscribed we'll be even lower on real hardware.
Back when I had access to a large InfiniBand system I saw a trivial sort
take 20-25s at 512 nodes. Simulating 512 nodes on this 40 node system
the trivial sort takes 7s. That's with over 12x oversubscription, so I
would guess on real hardware it'd be down around a second or two.
I believe we can further improve the bucket exchange by doing RDMA
writes of the buffers in non transposed order and then transposing
before/after scanning, but that's a larger algorithmic change and this
has a substantial improvement so I think is still a worthwhile stepping
stone with a simpler implementation.
Perf results in table form:
Part of #1404