Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce sort bucket exchange overhead #1635

Merged
merged 1 commit into from
Aug 10, 2022

Conversation

ronawho
Copy link
Contributor

@ronawho ronawho commented Jul 26, 2022

Previously, each task would write its local bucket counts directly to
the global counts in transposed order. Each task has 64K buckets with
64K/numLocales going to each locale. With 8K aggregation buffers this
meant that at over 8 locales we'd have single-use aggregators that would
get very small at higher scales. This meant that we were paying the
non-trivial aggregation creation cost and not amortizing it with
multiple uses. And worse, each task was flushing in lockstep to locale
0, then 1, and so on. So there were many single-use aggregators being
used in lockstep, which created bottlenecks at scale.

Improve that here by combining all the buckets from local tasks on a
node and chunking up the write. This allows us to send fewer and
fuller aggregation buffers and write to all locales simultaneously
instead of doing so in lockstep. This reduces sort startup costs,
particularly at scale.

At 40 nodes on a Cray CS with HDR InfiniBand this takes a trivial sized
sort on 32-bit uints from 0.60s down to 0.15s. This is the largest
InfiniBand system I have access to, but I also simulated 240 nodes by
running 6x oversubscribed on these 40 nodes. For the same trivial
problem size this takes us from 90.0s down to 1.7s. This isn't really a
fair timing since overheads will be exaggerated but if we're at 1.7s
oversubscribed we'll be even lower on real hardware.

Back when I had access to a large InfiniBand system I saw a trivial sort
take 20-25s at 512 nodes. Simulating 512 nodes on this 40 node system
the trivial sort takes 7s. That's with over 12x oversubscription, so I
would guess on real hardware it'd be down around a second or two.

I believe we can further improve the bucket exchange by doing RDMA
writes of the buffers in non transposed order and then transposing
before/after scanning, but that's a larger algorithmic change and this
has a substantial improvement so I think is still a worthwhile stepping
stone with a simpler implementation.

Perf results in table form:

Config Before Now
40 node IB 0.6s 0.15s
240 node IB ( 6x oversub) 90.0s 1.70s
512 node IB (~12x oversub) 560.0s 7.10s
512 node XC 3.0s 0.50s

Part of #1404

Previously, each task would write its local bucket counts directly to
the global counts in transposed order. Each task has 64K buckets with
64K/numLocales going to each locale. With 8K aggregation buffers this
meant that at over 8 locales we'd have single-use aggregators that would
get very small at higher scales. This meant that we were paying the
non-trivial aggregation creation cost and not amortizing it with
multiple uses. And worse, each task was flushing in lockstep to locale
0, then 1, and so on. So there were many single-use aggregators being
used in lockstep, which created bottlenecks at scale.

Improve that here by combining all the buckets from local tasks on a
node and chunking up the write. This allows us to send fewer and
fuller aggregation buffers and write to all locales simultaneously
instead of doing so in lockstep. This reduces sort startup costs,
particularly at scale.

At 40 nodes on a Cray CS with HDR InfiniBand this takes a trivial sized
sort on 32-bit uints from 0.60s down to 0.15s. This is the largest
InfiniBand system I have access to, but I also simulated 240 nodes by
running 6x oversubscribed on these 40 nodes. For the same trivial
problem size this takes us from 90.0s down to 1.7s. This isn't really a
fair timing since overheads will be exaggerated but if we're at 1.7s
oversubscribed we'll be even lower on real hardware.

Back when I had access to a large InfiniBand system I saw a trivial sort
take 20-25s at 512 nodes. Simulating 512 nodes on this 40 node system
the trivial sort takes 7s. That's with over 12x oversubscription, so I
would guess on real hardware it'd be down around a second or two.

I believe we can further improve the bucket exchange by doing RDMA
writes of the buffers in non transposed order and then transposing
before/after scanning, but that's a larger algorithmic change and this
has a substantial improvement so I think is still a worthwhile stepping
stone with a simpler implementation.

Perf results in table form:

| Config                     | Before | Now   |
| -------------------------- | -----: | ----: |
|  40 node IB                |   0.6s | 0.15s |
| 240 node IB (  6x oversub) |  90.0s | 1.70s |
| 512 node IB (~12x oversub) | 560.0s | 7.10s |
| 512 node XC                |   3.0s | 0.50s |

Part of 1404
@ronawho
Copy link
Contributor Author

ronawho commented Jul 26, 2022

Drafting only because I'm not around if there's any fallout from this. I feel pretty confident about the change if you want to try it out, but if you want to wait I'll be back on the 8th.

@mhmerrill
Copy link
Contributor

Drafting only because I'm not around if there's any fallout from this. I feel pretty confident about the change if you want to try it out, but if you want to wait I'll be back on the 8th.

i will also be out until Aug 8th.

@ronawho ronawho marked this pull request as ready for review August 10, 2022 14:17
Copy link
Collaborator

@reuster986 reuster986 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

Copy link
Contributor

@Ethan-DeBandi99 Ethan-DeBandi99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on Elliot's description here this looks really good. The performance gains are exciting as well.

Copy link
Member

@stress-tess stress-tess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! What a great speedup

@stress-tess stress-tess merged commit 8ff5d48 into Bears-R-Us:master Aug 10, 2022
@ronawho ronawho deleted the better-bucket-exchange branch August 10, 2022 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants