Don't put all your eggs in one basket!
Sorting mpisort!
function for distributed MPI-based sorting algorithms following the
standard Julia Base.sort!
signature; at the moment, one optimised algorithm is provided:
Sampling with interpolated histograms sorting algorithm (pronounced sigh sort, like anything MPI-related), optimised for minimum inter-rank communication and memory footprint. Features:
- Does not require that distributed data fits into the memory of a single node. No IO either.
- Works for any comparison-based data, with additional optimisations for numeric elements.
- Optimised for minimum MPI communication; can use Julia threads on each shared-memory node.
- The node-local arrays may have different sizes; sorting will try to balance the number of elements held by each MPI rank.
- Works with any
AbstractVector
, including accelerators such as GPUs (see Note). - Implements the standard Julia
sort!
API, and naturally works for custom data, comparisons, orderings, etc.
# File: mpisort.jl
# Run as: mpiexec -n 4 julia --threads=2 mpisort.jl
using MPI
using MPISort
using Random
# Initialise MPI, get communicator for all ranks, rank index, number of ranks
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
nranks = MPI.Comm_size(comm)
# Generate local array on each MPI rank - even with different number of elements
rng = Xoshiro(rank)
num_elements = 50 + rank * 2
local_array = rand(rng, 1:500, num_elements)
# Sort arrays across all MPI ranks
alg = SIHSort(comm)
sorted_local_array = mpisort!(local_array; alg=alg)
# Print each local array sequentially
for i in 0:nranks - 1
rank == i && @show rank sorted_local_array alg.stats
MPI.Barrier(comm)
end
Note: because the data is redistributed between nodes, the vector size must change - hence it
is different to the in-place Base.sort!
. The input vector is mutated, but another vector - with
potentially different size and elements - is returned. This is the reason for a different function
signature (mpisort!
with a return value); however, it has the exact same inputs as Base.sort!
.
Different sorting settings:
# Automatically uses MPI.COMM_WORLD as communicator; doesn't save sorting stats
sorted_local_array = mpisort!(local_array; alg=SIHSort())
# Reverse sorting; specify communicator explicitly
sorted_local_array = mpisort!(local_array; alg=SIHSort(comm), rev=true)
# Specify key to sort by; see https://docs.julialang.org/en/v1/base/sort/
sorted_local_array = mpisort!(local_array; alg=SIHSort(), by=x->x["key"])
# Different ordering; see https://docs.julialang.org/en/v1/base/sort/#Alternate-orderings
sorted_local_array = mpisort!(local_array; alg=SIHSort(), order=Reverse)
# Save sorting stats
alg = SIHSort(comm)
sorted_local_array = mpisort!(local_array; alg=alg)
@show alg.stats.splitters # `nranks - 1` elements splitting arrays between nodes
@show alg.stats.num_elements # `nranks` integers specifying number of elements on each node
# Use different in-place local sorter
alg = SIHSort(comm, nothing) # Default: standard Base.sort!
alg = SIHSort(comm, QuickSort) # Specify algorithm, passed to Base.sort!(...; alg=<Algorithm>)
alg = SIHSort(comm, v -> mysorter!(v)) # Pass any function that sorts a local vector in-place
Only optimised collective MPI communication is used, in order: Gather, Bcast, Reduce, Bcast, Alltoall, Allreduce, Alltoallv. I am not aware of a non-IO based algorithm with less communication (if you do know one, please open an issue!).
If SIHSort
is:
Where
Except for the final redistribution on a single new array of length
SIHSort
is generic over the input array type, so it can work with GPU arrays - e.g. Julia
CUDA.CuArray
- and benefit from MPI-configured, optimised inter-GPU connects.
However, to be fully performant, it needs:
- A single-node
sort
implementation - at the moment, onlyCUDA.CuArray
has one; there is great potential in aKernelAbstractions.jl
sorter, we really need one! - A fully GPU-based
searchsortedlast
implementation; we do not have one yet, so we rely on a binary search where each tested element is copied to the CPU (!), which of course is not great. More great potential in some optimisedKernelAbstractions.jl
kernels!
While it works currently, it is not ideal: sorting 1,000,000 Int32
values split across 2 MPI
ranks takes ~0.015s on my Intel i9 CPU and ~0.034s on my NVidia Quadro RTX4000 with Max-Q.
This algorithm builds on prior art:
- [1] Harsh V, Kale L, Solomonik E. Histogram sort with sampling. : followed main ideas and theoretical results, but with deterministic sampling and original communication and interpolation optimisations.
- [2] Sundar H, Malhotra D, Biros G. Hyksort: a new variant of hypercube quicksort on distributed memory architectures.
- [3] Shi H, Schaeffer J. Parallel sorting by regular sampling.
- [4] Solomonik E, Kale LV. Highly scalable parallel sorting.
- [5] John Lapeyre, integer base-2 logarithm - https://github.com/jlapeyre/ILog2.jl.
- [6] Byrne S, Wilcox LC, Churavy V. MPI. jl: Julia bindings for the Message Passing Interface. : absolute heroes who made MPI a joy to use in Julia.
MPISort.jl
is MIT-licensed. Enjoy.