forked from ANL-CESAR/RSBench
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGES.txt
340 lines (293 loc) · 17.1 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
=====================================================================
NEW IN VERSION 13
=====================================================================
- (Feature) Added HIP port of RSBench. This port is based closely
off the CUDA version, and was generated using an automated code
conversion utility with only a few manual changes required.
- Fixed threads per block for CUDA/HIP/OpenCL to all use 256 threads.
Other models will select this value themselves, but it may be
worth testing configurations maually with those models as well.
- Added a warning about GPU timers to output.
=====================================================================
NEW IN VERSION 12
=====================================================================
- (Feature) Ports of RSBench in a variety of accelerator oriented
programming models have been added to the repo. The traditional
OpenMP threading model of RSBench is now located in the
"RSBench/src/openmp-threading" directory. The other programming
models are:
- OpenMP target offloading, for use with OpenMP 4.5+ compilers
that can target a variety of acecelerator architectures.
- CUDA for use on NVIDIA GPUs.
- OpenCL for use on CPU and accelerators. Note that this port
may need to be heavily re-optimized if running on a non-GPU
architecture.
- SYCL for use on CPU and accelerators. Note that this port
may need to be heavily re-optimized if running on a non-GPU
architecture.
Note that the accelerator-oriented models only implement the event
based model of parallelism, as this model is expected to be
superior to the history-based model for accelerator architectures.
As such, running with the accelerator models will require the
"-m event" flag to be used.
- (Optimization) For the CUDA and openmp-threading code bases,
several different optimized kernel variants have been developed.
These optimized kernels can be selected with the "-k <kernel ID #>"
argument. The optimized kernels work by sorting values by energy
and material to reduce thread divergence and greatly improve
cache locality. All baseline (default) kernel (-k 0) is generally
the same across all programming models. In a future release, we
plan on implementing these optimizations for all programming models
if possible.
- (Feature) Verification mode is now default and required, so is no
longer an option in the Makefile. A new and faster method of
verifying results was developed and added, making it have much less
of an impact on performance than the previous methods did. As the
new method generates a different hash than the old one, the expected
has values have changed in v12.
- (Removal) Removed PAPI performance counters from code and Makefile.
- (Removal) A number of optional/outdated Makefile options were
removed in an effort to clean up the code base and simplify things.
- (Feature) To service the new verification mode, a new PRNG scheme
has been implemented. We now use a specific LCG for all random
number generation rather than relying on C standard rand() for
some parts of initialization. Additionally, instead of selecting
seeds by thread ID, we now base the PRNG stream off of a single
seed that gets forwarded using a log(N) algorithm. This ensures
that each sample is uncorrelated so will better ensure the randomness
of our energy and material samples as compared to the old scheme.
- (Feature) The basic data structures in RSBench have all been changed
to use a single 1D memory allocation each so as to make it as
easy as possible to offload heap memory to devices without having
to flatten things manually for each device implementation. Several
structures were consolidated into a single structure to make it
easier to see what data arrays need to be moved to a device.
=====================================================================
NEW IN VERSION 11
=====================================================================
- (Feature) Added in an option to
use an event based simulation algorithm as opposed to the default
history based algorithm. This same change was featured in the
XSBench v18 update. This mode can be enabled with the "-m
event" command line option. The central difference between the
default history based algorithm and the event based algorithm is
the dependence or independence of macroscopic cross section
lookups. In the default history based method, the simulation
parallelism is expressed over a large number of independent
particle "histories", with each particle history requiring (by
default) 34 sequential and dependent macroscopic cross section
lookups to be executed. In the event based method, all macroscopic
cross sections are pooled together in a way that allows them to
be executed independently in any order. On CPU architectures,
both algorithms run in approximately the same speed, though in
full Monte Carlo applications there are some added costs associated
with the event based method (namely, the need to frequently store
and load particle states) that are not represented in XSBench.
However, the model event based algorithm in XSBench is useful for
examining how the dependence or independence of the loop over
macroscopic cross sections may affect performance on different
computer architectures.
The event based mode in RSBench is very similar to the algorithm
expressed in v9 and before, so performance results collected
with v9 and previous should be comparable to performance results
collected with the event mode in v11. On most CPU architectures
and with most compilers, the history and event based methods run
in approximately the same time, with some CPUs showing a small
(less than 20%) speedup with the event based method.
- With the addition of the different event and history based options,
the main function of the code has been broken down into some
smaller functions. Notably, the main parallel simulation phase
has been moved to a history and event based function in the
"simulation.c" file.
- Added different verification checksums for the new event based mode.
While in a real application, the history and event based methods
would produce the same answers, in the context of this mini-app
for programming convenience purposes they do not produce the same
answer/checksum.
=====================================================================
NEW IN VERSION 10
=====================================================================
- In previous versions of RSBench, it was assumed that the method
of parallelism in the app would not be altered, as the app was
written to respect some loop dependencies of fully featured MC
applications. However, to foster more experimentation with
different methods of parallelizing the XS lookup kernel, we have
decided to add an extra loop (over particles) into RSBench so that
loop dependencies can be explicit in RSBench. This should make it
easier for people to optimize or port RSBench without worry that
they are breaking any implicit loop dependencies. All loop
dependencies should now be apparent without requiring any knowledge
of full MC applications.
From a performance standpoint, the change does not really affect
anything regarding default performance, so performance results for
CPUs (run without modification of the source code) should be
virtually identical between v9 and v10.
RSBench now has its default OpenMP thread level parallelism
expressed over independent particle histories. Each particle
then executes 34 macro_xs lookups, which are dependent, meaning
that this loop cannot be parallelized as each iteration is
dependent on results from the previous iteration. For each
macro_xs, there is an independent inner loop over micro_xs lookups,
which is not parallelized by default in RSBench, but could be
if desired provided that atomic or controlled writes are used
for the reduction of the macroscopic XS vector.
The introduction of particles into RSBench follows a similar
change made in v17 of XSBench.
- (Performance) To re-iterate, there is no expected performance
change for CPU architectures running the default code. The addition
of the particle loop into the code was done only to allow those
altering the code to more transparently see loop dependencies as
to avoid parallelizing or pre-processing loops in a manner which
would not be possible in a real MC app.
There is a slight change in performance in the smaller problem
size due to the OpenMP dynamic work chunk size. In v9, this was
set as 1 macro_xs lookup per chunk. In v10, as we are now
parallelizing over particle histories, the minimum chunk size
is now effectively 34 macro_xs lookups. For the large problem,
the thread work scheduling overhead is small compared to the
cost of a macro_xs lookup, so there isn't any difference. It's
just in the small problem size where the scheduling overhead
begins to be significant so the performance difference between
v9 and v10 shows up. v9 can be made equivalent in performance
to v10 by simply increasing the dynamic chunk size on the
macro_xs OpenMP loop.
- (Option) In support of this change, the user now has the ability
to alter the number of particle histories they want to simulate.
This can be controlled with the "-p" command line argument.
By default, 300,000 particle histories will be simulated.
This is now the recommended argument for users to adjust if
the want to increase/decrease the runtime of the problem. Real
MC simulations may use anywhere from O(10^5) to O(10^10)
particles per power iteration, so a wide range of values here
is acceptable.
- (Option) The number of lookups to perform per particle history.
This is the "-l" option, which defaults to 34.
This option previously referred to the total number of lookups,
but now refers to the number of lookups per particle.
The default value reflects the average number of XS lookups that
a particle will need over its lifetime in a light water reactor.
One may want to adjust this value if targeting a different
reactor type.
- (Defaults) Due to the addition of the particle abstraction, the
default parameters were changed slightly. With a new default of
300,000 particles and 34 lookups/particle, a total of 10.2 million
XS lookups will be performed under the default configuration. This
is slightly larger to what was seen in v9, which only performed
10 million lookups.
- (Feature) Verification mode has been added. The new mode can
be used by enabling the "VERIFY" flag in the makefile. In this
mode, the inputs and results results from each macro_xs lookup
are hashed and added to a running integer total, and a final
single hash value is returned at the end of the simulation. This
is a great tool for those wishing to alter or optimize the code,
as most mistakes will result in the hash value being changed.
Note that the verification mode does come with a small performance
penalty, as a (non-trivial) hashing operation is performed at
each macro_xs lookup result. The verification mode is therefore
intended to check for bugs or unintended result alterations, but
should be disabled when collecting performance metrics or
other runtime information.
=====================================================================
NEW IN VERSION 9
=====================================================================
- Changed the faddeeva function result variable in the micro_xs
lookup kernel to a complex double type, rather than simply a
double for proper arithmetic. Has a very small (several percent)
change in performance.
=====================================================================
NEW IN VERSION 8
=====================================================================
- Fixed a bug in initialization of one of the randomized arrays.
It was originally being allocated correctly but not properly
set to fully randomized values, instead just using whatever values
were in memory at the time (often zero). This has a very small,
but non-negligible, imact on performance.
=====================================================================
NEW IN VERSION 7
=====================================================================
- Temperature dependence default status has been reverted. Doppler
broadening / temperature dependence is now ON by default. This
means the Faddeeva function will be called by default.
- The MIT Open Source Faddeeva function library used in Version
6 is no longer used, and that code has been removed. While accurate,
it was very slow, so we are using a much faster approximation.
This approximation loosens some of the restrictions due to the
nature of actual reactor physics calculations not needing full
machine precision. The new algorithm now uses a fast three term
asymptotic approximation when the absolute value of Z is
less than 6. For |z| > 6, a much higher precision computation
is done using the Abrarov approximation. The slower Abrarov region
of complex space is only expected to be hit 0.0027% to 0.5% of the
time. The randomized resonance data in RSBench has been tuned to
approximate this frequency (assuming worst case 0.5%) of usage.
- Added counters and a printout to calculate the percentage of
faddeeva calls that have |z| < 6, thus requiring the higher
fidelity and slower calculation. Changes made to the code during
porting (particularly if changing RNGs) should ensure that this
percentage remains approximately constant.
=====================================================================
NEW IN VERSION 6
=====================================================================
- Temperature dependence has now been changed to off by default
- The Faddeeva function evaluation has now been moved to an open
source library (included in this repo). This allows for true
complex evaluation of the pole data rather than the real conjugate
error function approximation previously used. It was previously
assumed that real erfc() could be used to approximate the full
computation of the complex version, but in reality it has been
found to be siginificantly more computationally difficult to compute
this function when complex space is added. As a result, this slows
down computation significantly.
- Note that the new source files and Temperature dependence changes
DO NOT AFFECT the single temp (no Doppler) mode.
=====================================================================
NEW IN VERSION 5
=====================================================================
- Added a new temperature dependence feature that
uses Doppler broadening to translate pole data from 0K to any
arbitrary material temperature. This is accomplished by evaluating
the Faddeeva function (via an exponential multiplied by the standard
C error function).
This new Doppler mode is by default enabled in Version 5. It is
set to default, but can be disabled via the "-d" flag on the command
line.
Impact varies by compiler, but is relatively small (only a
~10-15% slowdown).
- Changed the default runtime parameters from 250 windows per nuclide
down to 100 windows per nuclide (increasing the number of poles
per lookup from an average of 4 to an average of 10). This
directly impacts performance, so expect lookups/sec to change in
Version 5 as compared to Version 4. This change was made as more
nuclides have had multipole data generated for them in the time
since RSBench was originally written, so estimates for these variables
have been updated to more accurately reflect the full scale code.
=====================================================================
NEW IN VERSION 4
=====================================================================
- Minor bug fix that caused incorrect operation when compiled with
GCC (intel worked fine). This was due to a bug in OpenMP causing
private variables to not be correctly copied into the parallel
region. The result was that the RNG was improperly initialized,
generating the same value every time, speeding up the calculation
significantly. Intel's OMP implementation was correct so all
results from intel runs should have the same performance.
=====================================================================
NEW IN VERSION 3
=====================================================================
- [Bug Fix] Fixed an issue with the window pole assignment
function (generate_window_params). There was a counter that was
not being properly incremented, causing some windows to be
assigned the same poles, and causing an issue with a border case
receiving many more poles than others. This issue has been fixed
so that all windows should now recieve about the same number of
poles, and all windows should be unique. This change effects
performance slightly, showing a 10-15% speedup in version 3
vs. version 2.
=====================================================================
NEW IN VERSION 2
=====================================================================
- [Optimization] Moved the sigTfactor dynamic array allocation out
of the inner program loop and back up to the top so millions of
allocs/free's are saved. This appears to increase performance
significantly (~33%) when compared to v1.
=====================================================================