-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
297 lines (242 loc) · 19.2 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
*** Copyright Notice ***
Copyright (c) 2014-2015, NVIDIA CORPORATION. All rights reserved.
The U.S. Department of Energy funded the development of this software
under subcontract B609478 with Lawrence Livermore National Security (LLNS).
HPGMG, Copyright (c) 2014, The Regents of the University of
California, through Lawrence Berkeley National Laboratory (subject to
receipt of any required approvals from the U.S. Dept. of Energy). All
rights reserved.
If you have questions about your rights to use or distribute this
software, please contact Berkeley Lab's Technology Transfer Department
NOTICE. This software is owned by the U.S. Department of Energy. As
such, the U.S. Government has been granted for itself and others
acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide
license in the Software to reproduce, prepare derivative works, and
perform publicly and display publicly. Beginning five (5) years after
the date permission to assert copyright is obtained from the U.S.
Department of Energy, and subject to any subsequent five (5) year
renewals, the U.S. Government is granted for itself and others acting
on its behalf a paid-up, nonexclusive, irrevocable, worldwide license
in the Software to reproduce, prepare derivative works, distribute
copies to the public, perform publicly and display publicly, and to
permit others to do so.
****************************
About
=====
HPGMG is a compact benchmark designed to proxy the geometric MG solves found in
applications built from AMR MG frameworks like CHOMBO or BoxLib. At a high
level, the benchmark solves Au=f where u and f are cell-centered (finite volume)
3D structured grids. The operator A is a fourth order finite volume discretization
of the helmholtz operator (a*alpha[]*u[] - b* div beta[] grad u[]), where a and
b are scalar constants and alpha[] and beta[] are spatially varyingcoefficients.
HPGMG supports both periodic and homogeneous dirichlet boundary conditions.
The benchmark generates a u_exact[] for a large cubical 3D grid partitioned into
subdomains(boxes) which are distributed across the supercomputer. It then
manually differentiates u[] for form f[], then uses a multigrid solver to
calculate a u[]. It may then use u_exact[] to test correctness and order.
By default, HPGMG solves a poisson equation (a==0) with homogeneous dirichlet
boundary conditions.
The basic relaxation operator is Gauss-Seidel Red-Black (GSRB) which is applied
twice (red/black/red/black) at every level up and and down the v-cycle. HPGMG
also includes Jacobi, L1Jacobi, Symmetric Gauss-Seidel, and Chebyshev Polynomial
smoothers. HPGMG implements both a truncated v-cycle (u-cycle) in which boxes
are restricted locally and a true distributed v-cycle in which restriction and
interpolation(prolongation) become true distributed operations. The latter
allows the global problem to be restricted to as little as one cell. At the
bottom of the v-cycle, HPGMG switches to one of a few bottom solvers (BiCGStab
is used often used in codes like BoxLib). Once a sufficiently accurate solution
is obtained on this coarse grid, it is interpolated back up the v-cycle.
HPGMG also includes the option to use Full Multigrid (FMG) in which one executes
an f-cycle that should hit the discretization error in one pass (instead of 10
v-cycles). This can provide a substantial performance boost, but presents a
number of performance optimization challenges.
Notes on the GPU implementation
===================
New options:
-DHOST_LEVEL_SIZE_THRESHOLD // max number of grid elements in MG level to run on CPU, other levels will be run on GPU
-DMAX_SOLVES // controls maximum number of solves after warm-up
-DCUDA_UM_ALLOC // use custom Unified Memory allocator instead of malloc (this is required to work properly)
-DCUDA_UM_ZERO_COPY // use zero-copy mapped memory for boundary levels to avoid unnecessary migration
-DUSE_ERROR // enforce device synchronization and error checking for CUDA calls (recommended for debugging)
Optimizations:
-DUSE_REG // use register optimizations
-DUSE_TEX // use read-only/texture cache for memory access
-DUSE_SHM // enables shared memory (disabled by default)
Currently some macros are required for the GPU implementation to work properly,
such as CUDA_UM_ALLOC and BLOCKCOPY_TILE_I. Others are optional but strongly
recommended for best performance: HOST_LEVEL_SIZE_THRESHOLD, CUDA_UM_ALLOC,
CUDA_UM_ZERO_COPY, USE_REG and USE_TEX. All CUDA kernels are located in
[cuda/](cuda/) folder and filenames usually match corresponding CPU
functionality. 4-th order scheme is enabled by default, if you wish to change
to the old 2nd order scheme: replace operators.fv4.* with operators.fv2.* in
local.mk and recompile. The GPU code is tested on single node with one, two
and four GPUs using CUDA-aware MVAPICH 2.1 and OpenMPI 1.8.7 with CUDA 7.
GPU performance tuning
===================
Recommended HOST_LEVEL_SIZE_THRESHOLD value is around 10K elements for
Kepler/Maxwell GPUs. Recommended block copy dimensions (BLOCKCOPY_TILE_*) are
64x2x8 but one can experiment to tune performance on particular hardware. For
better performance boundary conditions require different tile sizes which can
be controlled by setting BC_TILE_*. Supported GPU smoothers are GSRB and
Chebyshev, although GSRB is not optimized and provides only a baseline
implementation. For multi-GPU configurations it is recommended to run as many
MPI ranks as you have GPUs in your system. Please note that if peer mappings
are not available between GPUs then the system will fall back to using
zero-copy memory which can perform very slowly. This issue can be resolved by
setting CUDA_VISIBLE_DEVICES environment variable to constrain which GPUs are
visible for the system, or by setting CUDA_MANAGED_FORCE_DEVICE_ALLOC to a
non-zero value.
Compilation
===========
Although no make file currently exists, compilation is straightforward.
There are a few basic arguments that should be used. Most are selfexplanatory.
-DUSE_MPI // compiles the distributed (MPI) version
-DUSE_CG // use CG as a bottom (coarse grid) solver
-DUSE_BICGSTAB // use BiCGStab as a bottom (coarse grid) solver
-DUSE_CABICGSTAB // use CABiCGStab as a bottom (coarse grid) solver (makes more sense with U-Cycles)
-DUSE_SUBCOMM // build a subcommunicator for each level in the MG v-cycle to minimize the scope of MPI_AllReduce()
-DUSE_FCYCLES // use the Full Multigrid (FMG) solver... HPGMG benchmark should include this option
// note, the choice of FMG is orthogonal from U-Cycles and V-Cycles
-DUSE_VCYCLES // use true distributed V-Cycles in the multigrid solver... this is the suggested option
-DUSE_UCYCLES // use truncated V-Cycles (U-Cycles) in the multigrid solver... a legacy option for understanding the performance implications
-DUSE_CHEBY // use a Chebyshev Polynomial smoother (degree is specified with CHEBYSHEV_DEGREE)
-DUSE_GSRB // use the GSRB smoother (the number of pre/posts smooths is specified by NUM_SMOOTHS)
-DUSE_JACOBI // use a weighted Jacobi smoother with a weight of 2/3
-DUSE_L1JACOBI // use a L1 Jacobi smoother (each row's weight is the L1 norm of that row)
-DBLOCKCOPY_TILE_I=### // parallelism for all ghost zone, restriction, interpolation, and (now) operators (and eventually BC's) is organized the (cache/thread) block concept.
-DBLOCKCOPY_TILE_J=### // That is, boxes are decomposed into tiles of size BLOCKCOPY_TILE_I x BLOCKCOPY_TILE_J x BLOCKCOPY_TILE_J. Users may tune to find the optimal block size.
-DBLOCKCOPY_TILE_K=### // Smaller blocks fit in cache and express more TLP (good for MIC/BGQ/GPUs/...). However, the unit stride for small blocks is reduced (bad for CPUs which rely on prefetchers)
// If these are ommited, the code relies on its defaults.
-DBOX_ALIGN_JSTRIDE=### // Data allocation is now performed on a level-by-level basis (rather than block-by-block).
-DBOX_ALIGN_KSTRIDE=### // In order to guarantee SIMD alignment, you can pad the unit-stride to a nice round number (e.g. 2, 4, or 8) so that j+/-1 is SIMD-aligned.
-DBOX_ALIGN_VOLUME=### // Similarly, you can pad the kStride (or volume) so that k+/-1 (or vector+/-1) is SIMD-aligned
// If these are ommited, the code relies on its defaults.
-DMAX_COARSE_DIM=### // provides a means of constraining the maximum coarse dimension. By default, the maximum is 11 (i.e. maximum coarse grid is 11^3)
Let us consider an example for Edison, the Cray XC30 at NERSC where the MPI compiler uses icc and is invoked as 'cc'.
cc -Ofast -xAVX -fopenmp level.c operators.fv4.c mg.c solvers.c hpgmg-fv.c timers.c -DUSE_MPI -DUSE_SUBCOMM -DUSE_FCYCLES -DUSE_GSRB -DUSE_BICGSTAB -o run.edison
Conversely, true flat MPI should omit the -fopenmp flag.
cc -Ofast -xAVX level.c operators.fv4.c mg.c solvers.c hpgmg-fv.c timers.c -DUSE_MPI -DUSE_SUBCOMM -DUSE_FCYCLES -DUSE_GSRB -DUSE_BICGSTAB -o run.edison.flat
On Mira (an IBM BlueGene/Q), one can use the following...
soft add +mpiwrapper-xl
mpixlc_r -O5 -qsmp=omp:noauto level.c operators.fv4.c mg.c solvers.c hpgmg-fv.c timers.c -DUSE_MPI -DUSE_FCYCLES -DUSE_GSRB -DUSE_BICGSTAB -o run.bgq
-or-
mpixlc_r -O5 -qsmp=omp:noauto level.c operators.fv4.c mg.c solvers.c hpgmg-fv.c timers.c -DUSE_MPI -DUSE_FCYCLES -DUSE_GSRB -DUSE_BICGSTAB -o run.bgq
In order to compile with IBM's HPM counters...
mpixlc_r -O5 -qsmp=omp:noauto level.c operators.fv4.c mg.c solvers.c hpgmg-fv.c timers.c -DUSE_MPI -DUSE_FCYCLES -DUSE_GSRB -DUSE_BICGSTAB -o run.bgq \
-DUSE_HPM -I/bgsys/drivers/ppcfloor/bgpm/include -L/soft/perftools/hpctw/lib -L/bgsys/drivers/ppcfloor/bgpm/lib /bgsys/drivers/ppcfloor/bgpm/lib/libbgpm.a -lbgpm -lmpihpm_smp -lmpitrace
On Babbage (Xeon Phi cluster at NERSC), one can use the following for native mode compilation...
mpiicc -mmic -Ofast -fopenmp level.c operators.fv4.c mg.c solvers.c hpgmg-fv.c timers.c -DUSE_MPI -DUSE_SUBCOMM -DUSE_FCYCLES -DUSE_GSRB -DUSE_BICGSTAB -o run.babbage
-or-
mpiicc -mmic -Ofast -fopenmp level.c operators.fv4.c mg.c solvers.c hpgmg-fv.c timers.c -DUSE_MPI -DUSE_SUBCOMM -DUSE_FCYCLES -DUSE_GSRB -DUSE_BICGSTAB -o run.babbage
4th order HPGMG-FV
==================
Included in this release is the 4th order Finite Volume Full Multigrid Implementation. Unlike a 2nd order implementation, the 4th order version requires a different operator file (operators.fv4.c instead of operators.7pt.c). Currently, it is highly recommended one use GSRB over the Chebyshev or Jacobi smoothers. By default, most smoothers make 6 pases thru the data (instead of 4) in order to provide sufficient error and residual. This has the effect of increasing the time spent in smoothing by 50%.
Nominally, compared to the 2nd order method, the 4th order smoother performs ...
- 4x the flops
- 4x the MPI messages
- 2x the MPI data movement
- 1x the DRAM data movement
- provides 4 more bits of accuracy for every 8x increase in the problem size (instead of 2 bits)
In order to compile the older 2nd order version on Edison, one may use the following command line...
cc -Ofast -xAVX -fopenmp level.c operators.fv2.c mg.c solvers.c hpgmg-fv.c timers.c -DUSE_MPI -DUSE_SUBCOMM -DUSE_FCYCLES -DUSE_GSRB -DUSE_BICGSTAB -o run.edison
Running the benchmark
=====================
The benchmark exploits OpenMP and/or MPI for parallelism.
You must thus set OMP_NUM_THREADS correctly. For a machine like Edison at NERSC, this is simply
% export OMP_NUM_THREADS=12
Moreover, on multisocket architectures or when using MPI, you must set affinity correctly.
The benchmark takes 2 arguments.
./run.hpgmg [log2BoxSize] [Target # of boxes per process]
- log2BoxSize is the log base 2 of the dimension of each box on the finnest grid (e.g. 6 is a good proxy for real applications)
- the target number of boxes per process is a loose bound on memory per process
Given these constraints, the benchmark will then calculate the largest cubical domain it can run.
The code supports nested OpenMP parallelism which can be enabled by setting
OMP_NESTED=true. At each multigrid level, the code will try and determine
the best balance between coarse and fine-frained parallelism.
On edison, (the Cray XC30 at nersc), one uses aprun to invoke mpi jobs. A job script may include the following...
#PBS -l mppwidth=96
export OMP_NUM_THREADS=12
aprun -n 8 -d 12 -N 2 -S 1 -ss -cc numa_node ./run.hpgmg 6 8
This will launch 8 mpi processes (-n 8) of 12 threads (-d 12 == OMP_NUM_THREADS)
with 2 processes per node (-N 2), 1 process per NUMA node (-S 1) with the
appropriate NUMA controls (-ss -cc numa_node). Moreover, the dimension of each
box is 2^6 on a side (64^3) and there is a target of 8 boxes per process on the
finest grid. The resultant problem is thus 256^3 on the finest grid.
On Mira (the IBM Blue Gene/Q at Argonne), one may use qsub to directly queue
the benchmark. For example...
qsub -t 00:10:00 -n 64 --proccount 64 --mode c1 -A [ALLOCATION] --env BG_SHAREDMEMSIZE=32MB:PAMID_VERBOSE=1:BG_COREDUMPDISABLED=1:BG_SMP_FAST_WAKEUP=YES:BG_THREADLAYOUT=1:OMP_PROC_BIND=TRUE:OMP_NUM_THREADS=64:OMP_WAIT_POLICY=active:OMP_NESTED=true ./run.bgq 6 8
Will run the benchmark on 64 processes spread over 64 nodes with 1 process per
node (c1) and 64 threads per process. Each process is allocated 8 64^3 boxes.
At one point, the additional environment variables listed were found to
accelerate performance.
On Babbage (the Xeon Phi Cluster at NERSC)
mpirun.mic -n 8 -hostfile micfile.$PBS_JOBID -ppn 1 -env OMP_NUM_THREADS 120 -env KMP_AFFINITY balanced ./run.babbage 7 1
Understanding the Results
=========================
During execution, the benchmark will output some debug information for understanding convergence and performance.
The following is an example and examines a key subset of this information.
+ aprun -n 512 -d 12 -N 2 -S 1 -ss -cc numa_node ./run.hpgmg.edison 7 8
Requested MPI_THREAD_FUNNELED, got MPI_THREAD_FUNNELED
512 MPI Tasks of 12 threads
truncating the v-cycle at 2^3 subdomains
creating domain... done
128 x 128 x 128 (per subdomain)
256 x 256 x 256 (per process)
2048 x 2048 x 2048 (overall)
1-deep ghost zones
allocated 1865 MB
This initial output details how MPI and OpenMP were initialized.
Moreover, it notes how deep the v-cycle is (down to 2^3 boxes)
It then shows the progress as it creates the structured grids noting their respective sizes and the total memory explicitly allocated with malloc().
Thus, the 2K^3 overall problem represents 8 billion degrees of freedom.
MGSolve...
v-cycle= 1, norm=0.00002091903646017090 (2.091904e-05)
v-cycle= 2, norm=0.00000079708396334668 (7.970840e-07)
v-cycle= 3, norm=0.00000007951502395414 (7.951502e-08)
v-cycle= 4, norm=0.00000000581619537788 (5.816195e-09)
v-cycle= 5, norm=0.00000000048970464287 (4.897046e-10)
v-cycle= 6, norm=0.00000000003900568126 (3.900568e-11)
v-cycle= 7, norm=0.00000000000318039461 (3.180395e-12)
v-cycle= 8, norm=0.00000000000025703104 (2.570310e-13)
v-cycle= 9, norm=0.00000000000002088201 (2.088201e-14)
v-cycle=10, norm=0.00000000000000170463 (1.704634e-15)
v-cycle=11, norm=0.00000000000000014284 (1.428395e-16)
done
As the multigrid solver progresses, the max (inf) norm of the residual is reported after each v-cycle.
One expects to reduce the norm by one digit on each v-cycle.
Thus to attain a norm less than 1e-15, we required 11 v-cycles.
0 1 2 3 4 5 6
128^3 64^3 32^3 16^3 8^3 4^3 2^3 total
smooth 2.244879 0.288221 0.020186 0.003279 0.000672 0.000267 0.000000 2.557504
residual 0.569046 0.035340 0.001833 0.000328 0.000077 0.000036 0.000030 0.606691
restriction 0.041538 0.003994 0.000310 0.000072 0.000032 0.000028 0.000000 0.045975
interpolation 0.076533 0.006586 0.000567 0.000105 0.000038 0.000032 0.000000 0.083860
applyOp 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.001715 0.001715
BLAS1 0.157396 0.004949 0.000776 0.000184 0.000055 0.000027 0.014614 0.178002
BLAS3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
communication 0.314615 0.069810 0.024858 0.017584 0.009740 0.005763 0.318338 0.760707
local exchange 0.047781 0.008262 0.001819 0.000730 0.000368 0.000233 0.001743 0.060936
pack MPI buffers 0.047688 0.007722 0.001089 0.000569 0.000294 0.000215 0.001630 0.059207
unpack MPI buffers 0.022835 0.004058 0.001226 0.000530 0.000349 0.000231 0.001712 0.030940
MPI_Isend 0.002422 0.002161 0.000856 0.000659 0.000779 0.000374 0.002755 0.010005
MPI_Irecv 0.000456 0.000402 0.000152 0.000205 0.000119 0.000079 0.000677 0.002089
MPI_Waitall 0.169658 0.047091 0.019666 0.014850 0.007801 0.004603 0.022721 0.286390
MPI_collectives 0.023637 0.000000 0.000000 0.000000 0.000000 0.000000 0.286850 0.310487
-------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Total by level 3.386964 0.404774 0.047960 0.021436 0.010595 0.006159 0.334945 4.212834
Total time in MGBuild 0.081082
Total time in MGSolve 4.235200
" v-cycles 4.213100
number of v-cycles 11
Bottom solver iterations 397
Finally, we see a timing report. Vertically are the key operations within the v-cycle (communication is further broken down into its constituient operations). Horizontally is a breakdown of time (in seconds) by level in the v-cycle. Thus, one can see the difference in time spent in each operation at each level. These times are totaled by level and by function. Finally, the total time required to build the solver (note geometric multigrid solves can be built extremely quickly), the time spent in the solver, the number of v-cycles, and the total number of bottom solver (e.g. BiCGStab) iterations summed across all v-cycles is reported.
We thus observe that this 8 billion DOF problem was solved in 4.23 seconds on 512 processes (6144 cores). It required 397 BiCGStab iterations on the coarse grid spread over 11 vcycles (approx 36 terations per v-cycle).
As the time spent smoothing the fine grid was non-trivial (2.244 seconds), one might be motivated to analyze it.
Each box has (128+2)^3 cells including ghost zones.
There are 8 boxes (2 2 2) per process
Each call to smooth moves 64 bytes of data per cell per stencil sweep.
There are 4 calls to smooth and 2 stencil sweeps per smooth in the v-cycle.
There are 11 v-cycles.
Thus, smooth requires one move *at least* 8 * 130^3 * 64 * 8 * 11 bytes of data = 98.99e9 bytes.
Moving this data in 2.244 seconds suggests each process attained an average DRAM bandwidth of 44 GB/s. This is quite good given this was vanilla OpenMP code without optimization and one could never hope for better than 54GB/s on this machine.