-
Notifications
You must be signed in to change notification settings - Fork 10
Optimizing Parallel Performance
NTPoly is specifically designed to be run on large matrices using large parallel machines. In this section, we will describe some tips for maximizing parallel performance.
When writing a program using NTPoly, the first step is to initialize the
process grid using the ConstructProcessGrid
subroutine. This subroutine takes
four arguments: an MPI communicator (usually MPI_COMM_WORLD
), the number
of process rows, process columns, and the number of process slices. As a result,
matrices are stored in a three dimensional process grid, according to the
algorithm in reference [1].
To maximize parallel performance, we first recommend setting the number of process slices to 1, and making the number of process rows and columns either square, or having one be a factor of two bigger than the other. Then, when you detect that parallel performance is degrading, begin increasing the number of process slices. For example, on the K computer we found that keeping the number of process slices equal to 1 usually works through 512 nodes, but after that it is better to increase the number of slices.
NTPoly uses an MPI+OpenMP hybrid parallelization scheme. Our numerical experiments on the K computer show that using one MPI task per node, and one OpenMP thread per core provides the best overall performance.
In massively parallel simulations, load balancing between processors plays an important role in determining overall performance. In NTPoly, load imbalance comes from certain processors hold more matrix elements than others (depending on the sparsity pattern).
One simple way to alleviate this is to shuffle the rows and columns of a matrix
when calculations are being performed. NTPoly provides this capability through
the LoadBalancerModule
. First, construct a random permutation using the
ConstructRandomPermutation
subroutine. Then the constructed random
permutation can be passed to the solver parameter constructor, and will
be used to load balance the subsequent calculations. See the examples for
more details.
[1] Solomonik, Edgar, and James Demmel. "Communication-optimal parallel 2.5 D matrix multiplication and LU factorization algorithms." In European Conference on Parallel Processing, pp. 90-109. Springer, Berlin, Heidelberg, 2011.