Skip to content

Optimizing Parallel Performance

william-dawson edited this page Feb 27, 2018 · 3 revisions

Overview

NTPoly is specifically designed to be run on large matrices using large parallel machines. In this section, we will describe some tips for maximizing parallel performance.

Process Grid Shape

When writing a program using NTPoly, the first step is to initialize the process grid using the ConstructProcessGrid subroutine. This subroutine takes four arguments: an MPI communicator (usually MPI_COMM_WORLD), the number of process rows, process columns, and the number of process slices. As a result, matrices are stored in a three dimensional process grid, according to the algorithm in reference [1].

To maximize parallel performance, we first recommend setting the number of process slices to 1, and making the number of process rows and columns either square, or having one be a factor of two bigger than the other. Then, when you detect that parallel performance is degrading, begin increasing the number of process slices. For example, on the K computer we found that keeping the number of process slices equal to 1 usually works through 512 nodes, but after that it is better to increase the number of slices.

MPI+OpenMP Hybrid Parallelization

NTPoly uses an MPI+OpenMP hybrid parallelization scheme. Our numerical experiments on the K computer show that using one MPI task per node, and one OpenMP thread per core provides the best overall performance.

Load Balancing

In massively parallel simulations, load balancing between processors plays an important role in determining overall performance. In NTPoly, load imbalance comes from certain processors hold more matrix elements than others (depending on the sparsity pattern).

One simple way to alleviate this is to shuffle the rows and columns of a matrix when calculations are being performed. NTPoly provides this capability through the LoadBalancerModule. First, construct a random permutation using the ConstructRandomPermutation subroutine. Then the constructed random permutation can be passed to the solver parameter constructor, and will be used to load balance the subsequent calculations. See the examples for more details.

[1] Solomonik, Edgar, and James Demmel. "Communication-optimal parallel 2.5 D
matrix multiplication and LU factorization algorithms." In European Conference
on Parallel Processing, pp. 90-109. Springer, Berlin, Heidelberg, 2011.