Part of #1404: Avoid flushing aggregators in lockstep #1826
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When flushing, previously all aggregators would flush to locale 0, then 1 and so on in lockstep. This created a many-to-one bottleneck, especially at scale. This updates aggregators to start flushing to the locale they last aggregated a value for. Keeping track of the last locale should be trivially fast and getting rid of the lockstep comm behavior should improve performance of all aggregated operations, including sort, at higher scales.
I'll have more comprehensive performance results later, but on an older 240 node SGI InfiniBand machine I see a small sort (512 KiB per node) go from ~1.6s to ~0.7s and a medium sort (512 MiB per node) go from ~2.5s to ~1.8s.
Part of #1404