CHANGES.txt

=====================================================================
NEW IN VERSION 13
=====================================================================
- (Feature) Added HIP port of RSBench. This port is based closely
  off the CUDA version, and was generated using an automated code
  conversion utility with only a few manual changes required.

- Fixed threads per block for CUDA/HIP/OpenCL to all use 256 threads.
  Other models will select this value themselves, but it may be
  worth testing configurations maually with those models as well.

- Added a warning about GPU timers to output.

=====================================================================
NEW IN VERSION 12
=====================================================================
- (Feature) Ports of RSBench in a variety of accelerator oriented
  programming models have been added to the repo. The traditional
  OpenMP threading model of RSBench is now located in the 
  "RSBench/src/openmp-threading" directory. The other programming
  models are:

    - OpenMP target offloading, for use with OpenMP 4.5+ compilers
      that can target a variety of acecelerator architectures.

    - CUDA for use on NVIDIA GPUs.

    - OpenCL for use on CPU and accelerators. Note that this port
      may need to be heavily re-optimized if running on a non-GPU
      architecture.

    - SYCL for use on CPU and accelerators. Note that this port
      may need to be heavily re-optimized if running on a non-GPU
      architecture.

  Note that the accelerator-oriented models only implement the event
  based model of parallelism, as this model is expected to be 
  superior to the history-based model for accelerator architectures.
  As such, running with the accelerator models will require the
  "-m event" flag to be used.

- (Optimization) For the CUDA and openmp-threading code bases,
  several different optimized kernel variants have been developed.
  These optimized kernels can be selected with the "-k <kernel ID #>"
  argument. The optimized kernels work by sorting values by energy
  and material to reduce thread divergence and greatly improve
  cache locality. All baseline (default) kernel (-k 0) is generally
  the same across all programming models. In a future release, we
  plan on implementing these optimizations for all programming models
  if possible. 

- (Feature) Verification mode is now default and required, so is no
  longer an option in the Makefile. A new and faster method of
  verifying results was developed and added, making it have much less
  of an impact on performance than the previous methods did. As the
  new method generates a different hash than the old one, the expected
  has values have changed in v12.

- (Removal) Removed PAPI performance counters from code and Makefile.

- (Removal) A number of optional/outdated Makefile options were
  removed in an effort to clean up the code base and simplify things.

- (Feature) To service the new verification mode, a new PRNG scheme
  has been implemented. We now use a specific LCG for all random
  number generation rather than relying on C standard rand() for
  some parts of initialization. Additionally, instead of selecting
  seeds by thread ID, we now base the PRNG stream off of a single
  seed that gets forwarded using a log(N) algorithm. This ensures
  that each sample is uncorrelated so will better ensure the randomness
  of our energy and material samples as compared to the old scheme.

- (Feature) The basic data structures in RSBench have all been changed
  to use a single 1D memory allocation each so as to make it as
  easy as possible to offload heap memory to devices without having
  to flatten things manually for each device implementation. Several
  structures were consolidated into a single structure to make it
  easier to see what data arrays need to be moved to a device.

=====================================================================
NEW IN VERSION 11
=====================================================================
- (Feature) Added in an option to
  use an event based simulation algorithm as opposed to the default
  history based algorithm. This same change was featured in the
  XSBench v18 update. This mode can be enabled with the "-m
  event" command line option. The central difference between the
  default history based algorithm and the event based algorithm is
  the dependence or independence of macroscopic cross section
  lookups. In the default history based method, the simulation
  parallelism is expressed over a large number of independent
  particle "histories", with each particle history requiring (by
  default) 34 sequential and dependent macroscopic cross section
  lookups to be executed. In the event based method, all macroscopic
  cross sections are pooled together in a way that allows them to
  be executed independently in any order. On CPU architectures,
  both algorithms run in approximately the same speed, though in
  full Monte Carlo applications there are some added costs associated
  with the event based method (namely, the need to frequently store
  and load particle states) that are not represented in XSBench.
  However, the model event based algorithm in XSBench is useful for
  examining how the dependence or independence of the loop over
  macroscopic cross sections may affect performance on different
  computer architectures.

  The event based mode in RSBench is very similar to the algorithm
  expressed in v9 and before, so performance results collected
  with v9 and previous should be comparable to performance results
  collected with the event mode in v11. On most CPU architectures
  and with most compilers, the history and event based methods run
  in approximately the same time, with some CPUs showing a small
  (less than 20%) speedup with the event based method.

- With the addition of the different event and history based options,
  the main function of the code has been broken down into some
  smaller functions. Notably, the main parallel simulation phase
  has been moved to a history and event based function in the
  "simulation.c" file.

- Added different verification checksums for the new event based mode.
  While in a real application, the history and event based methods
  would produce the same answers, in the context of this mini-app
  for programming convenience purposes they do not produce the same
  answer/checksum.

=====================================================================
NEW IN VERSION 10
=====================================================================
- In previous versions of RSBench, it was assumed that the method
  of parallelism in the app would not be altered, as the app was
  written to respect some loop dependencies of fully featured MC
  applications. However, to foster more experimentation with
  different methods of parallelizing the XS lookup kernel, we have
  decided to add an extra loop (over particles) into RSBench so that
  loop dependencies can be explicit in RSBench. This should make it
  easier for people to optimize or port RSBench without worry that
  they are breaking any implicit loop dependencies. All loop 
  dependencies should now be apparent without requiring any knowledge
  of full MC applications.

  From a performance standpoint, the change does not really affect
  anything regarding default performance, so performance results for
  CPUs (run without modification of the source code) should be
  virtually identical between v9 and v10.

  RSBench now has its default OpenMP thread level parallelism
  expressed over independent particle histories. Each particle
  then executes 34 macro_xs lookups, which are dependent, meaning
  that this loop cannot be parallelized as each iteration is
  dependent on results from the previous iteration. For each
  macro_xs, there is an independent inner loop over micro_xs lookups,
  which is not parallelized by default in RSBench, but could be
  if desired provided that atomic or controlled writes are used
  for the reduction of the macroscopic XS vector.

  The introduction of particles into RSBench follows a similar
  change made in v17 of XSBench.

- (Performance) To re-iterate, there is no expected performance
  change for CPU architectures running the default code. The addition
  of the particle loop into the code was done only to allow those
  altering the code to more transparently see loop dependencies as
  to avoid parallelizing or pre-processing loops in a manner which
  would not be possible in a real MC app.

  There is a slight change in performance in the smaller problem
  size due to the OpenMP dynamic work chunk size. In v9, this was
  set as 1 macro_xs lookup per chunk. In v10, as we are now
  parallelizing over particle histories, the minimum chunk size
  is now effectively 34 macro_xs lookups. For the large problem,
  the thread work scheduling overhead is small compared to the
  cost of a macro_xs lookup, so there isn't any difference. It's
  just in the small problem size where the scheduling overhead
  begins to be significant so the performance difference between
  v9 and v10 shows up. v9 can be made equivalent in performance
  to v10 by simply increasing the dynamic chunk size on the
  macro_xs OpenMP loop.

- (Option) In support of this change, the user now has the ability
  to alter the number of particle histories they want to simulate.
  This can be controlled with the "-p" command line argument.
  By default, 300,000 particle histories will be simulated.
  This is now the recommended argument for users to adjust if
  the want to increase/decrease the runtime of the problem. Real
  MC simulations may use anywhere from O(10^5) to O(10^10)
  particles per power iteration, so a wide range of values here
  is acceptable. 

- (Option) The number of lookups to perform per particle history.
  This is the "-l" option, which defaults to 34.
  This option previously referred to the total number of lookups,
  but now refers to the number of lookups per particle.
  The default value reflects the average number of XS lookups that
  a particle will need over its lifetime in a light water reactor.
  One may want to adjust this value if targeting a different
  reactor type.

- (Defaults) Due to the addition of the particle abstraction, the
  default parameters were changed slightly. With a new default of
  300,000 particles and 34 lookups/particle, a total of 10.2 million
  XS lookups will be performed under the default configuration. This
  is slightly larger to what was seen in v9, which only performed
  10 million lookups.

- (Feature) Verification mode has been added. The new mode can
  be used by enabling the "VERIFY" flag in the makefile. In this
  mode, the inputs and results results from each macro_xs lookup
  are hashed and added to a running integer total, and a final
  single hash value is returned at the end of the simulation. This
  is a great tool for those wishing to alter or optimize the code,
  as most mistakes will result in the hash value being changed.

  Note that the verification mode does come with a small performance
  penalty, as a (non-trivial) hashing operation is performed at
  each macro_xs lookup result. The verification mode is therefore
  intended to check for bugs or unintended result alterations, but
  should be disabled when collecting performance metrics or
  other runtime information.


=====================================================================
NEW IN VERSION 9
=====================================================================
- Changed the faddeeva function result variable in the micro_xs
  lookup kernel to a complex double type, rather than simply a 
  double for proper arithmetic. Has a very small (several percent)
  change in performance.

=====================================================================
NEW IN VERSION 8
=====================================================================
- Fixed a bug in initialization of one of the randomized arrays.
  It was originally being allocated correctly but not properly
  set to fully randomized values, instead just using whatever values
  were in memory at the time (often zero). This has a very small,
  but non-negligible, imact on performance.

=====================================================================
NEW IN VERSION 7
=====================================================================
- Temperature dependence default status has been reverted. Doppler
  broadening / temperature dependence is now ON by default. This
  means the Faddeeva function will be called by default.

- The MIT Open Source Faddeeva function library used in Version
  6 is no longer used, and that code has been removed. While accurate,
  it was very slow, so we are using a much faster approximation.
  This approximation loosens some of the restrictions due to the
  nature of actual reactor physics calculations not needing full
  machine precision.  The new algorithm now uses a fast three term
  asymptotic approximation when the absolute value of Z is
  less than 6.  For |z| > 6, a much higher precision computation
  is done using the Abrarov approximation. The slower Abrarov region
  of complex space is only expected to be hit 0.0027% to 0.5% of the
  time. The randomized resonance data in RSBench has been tuned to
  approximate this frequency (assuming worst case 0.5%) of usage.

- Added counters and a printout to calculate the percentage of
  faddeeva calls that have |z| < 6, thus requiring the higher
  fidelity and slower calculation. Changes made to the code during
  porting (particularly if changing RNGs) should ensure that this
  percentage remains approximately constant.

=====================================================================
NEW IN VERSION 6
=====================================================================
- Temperature dependence has now been changed to off by default

- The Faddeeva function evaluation has now been moved to an open
  source library (included in this repo). This allows for true
  complex evaluation of the pole data rather than the real conjugate
  error function approximation previously used. It was previously
  assumed that real erfc() could be used to approximate the full
  computation of the complex version, but in reality it has been
  found to be siginificantly more computationally difficult to compute
  this function when complex space is added. As a result, this slows
  down computation significantly.

- Note that the new source files and Temperature dependence changes
  DO NOT AFFECT the single temp (no Doppler) mode.

=====================================================================
NEW IN VERSION 5
=====================================================================
- Added a new temperature dependence feature that
uses Doppler broadening to translate pole data from 0K to any
arbitrary material temperature. This is accomplished by evaluating
the Faddeeva function (via an exponential multiplied by the standard
C error function).

This new Doppler mode is by default enabled in Version 5.  It is
set to default, but can be disabled via the "-d" flag on the command
line.

Impact varies by compiler, but is relatively small (only a
~10-15% slowdown).

- Changed the default runtime parameters from 250 windows per nuclide
down to 100 windows per nuclide (increasing the number of poles
per lookup from an average of 4 to an average of 10). This
directly impacts performance, so expect lookups/sec to change in
Version 5 as compared to Version 4. This change was made as more
nuclides have had multipole data generated for them in the time
since RSBench was originally written, so estimates for these variables
have been updated to more accurately reflect the full scale code.

=====================================================================
NEW IN VERSION 4
=====================================================================
- Minor bug fix that caused incorrect operation when compiled with
GCC (intel worked fine). This was due to a bug in OpenMP causing
private variables to not be correctly copied into the parallel
region. The result was that the RNG was improperly initialized,
generating the same value every time, speeding up the calculation
significantly. Intel's OMP implementation was correct so all
results from intel runs should have the same performance.

=====================================================================
NEW IN VERSION 3
=====================================================================
- [Bug Fix] Fixed an issue with the window pole assignment
function (generate_window_params). There was a counter that was
not being properly incremented, causing some windows to be
assigned the same poles, and causing an issue with a border case
receiving many more poles than others. This issue has been fixed
so that all windows should now recieve about the same number of
poles, and all windows should be unique. This change effects
performance slightly, showing a 10-15% speedup in version 3
vs. version 2.


=====================================================================
NEW IN VERSION 2
=====================================================================
- [Optimization] Moved the sigTfactor dynamic array allocation out
  of the inner program loop and back up to the top so millions of
  allocs/free's are saved. This appears to increase performance
  significantly (~33%) when compared to v1.
=====================================================================