Spatter Refactor - C++, MPI, Parsing, CMake updates #165

JDTruj2018 · 2023-12-15T07:28:02Z

Re-factoring to C++ to simplify parsing and memory management. The goal is to be as backwards-compatible as possible, from command line flags and behavior, to performance and output.

Currently, the only difference in command line flags is the usage of -u <--pattern-scatter> rather than -h <--pattern-scatter> to free up the -h flag for --help and the -q <--no-print-header> flag's functionality has been moved to -v <--verbosity>. We will have to find a solution for the previous -v <--vector-len> flag. Finally, the -f <--kernel-file> flag is now used to direct to the JSON file, rather than overloading the -p flag.

Changes:
--help --> -h <--help>
-h <--pattern-scatter> --> -u <--pattern-scatter>
-q <--no-print-header> --> -v <--verbose>
-v <--vector-len> --> ?
-f <--kernel-file> --> -f <--json-file>

Are we interested in completing/pursuing this?

Complete:

To-do:

Remove unneeded flags/variables (WIP: Jered) - May need to readdress in bugfixes.
Aggregate, compress, validate (WIP: Connor)
~~- [ ] Multiple Deltas Capability~~
~~- [ ] OpenCL Support~~
~~- [ ] Op Support~~
~~- [ ] PAPI Support~~
~~- [ ] RO_Hilbert, RO_Morton, Strided Support~~
~~- [ ] vector-len Support~~
~~- [ ] Validation moved to GPU Validation and Readd Validate Flag #194~~

Future releases:

Performance alignment with current Spatter for Scatter, Multiscatter, Gather-Scatter (WIP: Jered/Patrick/Jeff) - Progress in CPU performance of new/old Spatter varies #189
Harmonic mean results (WIP: Jered) - Progress in Check Harmonic Mean and std dev output #196
Update README with any changes to CLI flags (WIP: Jered) - Progress in General documentation improvements #77
Update Jupyter notebook and add a test (WIP: Jeff) - Progress in Add CI\CD to check execution of the Getting Started Notebook. #193
Random Seed - testing existing kernel with PGC (WIP: Jered) - Progress in Add random delta option #37
Local Work/Block Size for GPU backend - Progress in Add option flag and configuration to set block size for CUDA backend #195
Atomics for OpenMP - Progress in Option for atomics for Scatter with OpenMP and CUDA backends #178

Not Needed:
~~- Old utility files (trace_util.c for example)~~

…be changed on command line

JDTruj2018 · 2023-12-15T08:26:09Z

@plavin Could you take a look at the CUDA kernels when you get the chance? I'll post updated performance number below and if things look okay to you we can try to port this over to the Spatter main branch.

My work around to get everything working with delta, count, and wrap without having to do the templating
currently assumes a local_work_size of min(pattern.size(), 1024) and sets this to the threads_per_block, and then calculate the blocks_per_grid from the global_work_size / local_work_size. Then in the kernel itself, I do the following (this is the gather example):

 float cuda_gather_wrapper(const size_t *pattern, const double *sparse,
    double *dense, const size_t pattern_length, const size_t delta,
    const size_t wrap, const size_t count) {
  cudaEvent_t start, stop;

  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  int threads_per_block = min(pattern_length, (size_t)1024);
  int blocks_per_grid =
      ((pattern_length * count) + threads_per_block - 1) / threads_per_block;

  cudaDeviceSynchronize();
  cudaEventRecord(start);

  cuda_gather<<<blocks_per_grid, threads_per_block>>>(
      pattern, sparse, dense, pattern_length, delta, wrap, count);

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);

  float time_ms = 0;
  cudaEventElapsedTime(&time_ms, start, stop);

  cudaEventDestroy(start);
  cudaEventDestroy(stop);

  return time_ms;
}

__global__ void cuda_gather(const size_t *pattern, const double *sparse,
    double *dense, const size_t pattern_length, const size_t delta,
    const size_t wrap, const size_t count) {
  size_t total_id =
      (size_t)((size_t)blockDim.x * (size_t)blockIdx.x + (size_t)threadIdx.x);
  size_t j = total_id % pattern_length; // pat_idx
  size_t i = total_id / pattern_length; // count_idx

  double x;

  if (i < count) {
    // dense[j + pattern_length * (i % wrap)] = sparse[pattern[j] + delta * i]; // configuration 1
    x = sparse[pattern[j] + delta * i]; // configuration 2
    if (x == 0.5)
      dense[0] = x;
  }
}

I've ran this in 2 configurations for the gather:

Only setting x
Assigning the gathered value to the correct place in the dense array

Note that Spatter original uses configuration 2.

JDTruj2018 · 2023-12-15T17:15:37Z

Some early performance numbers on Haswell

JDTruj2018 · 2023-12-15T20:39:02Z

GPU Performance still needs some work (this is a V100)

~~Original:~~

../../spatter/build_cuda_workflow/spatter -b cuda -pUNIFORM:1024:1 -l 1048576 -k gather -q2 config bytes time(s) bw(MB/s) 0 8589934592 0.007443 1154079.646614 Min 25% Med 75% Max 1.15408e+06 1.15408e+06 1.15408e+06 1.15408e+06 1.15408e+06 H.Mean H.StdErr

~~New:~~

~~./src/spatter-driver -pUNIFORM:1024:1 -l 1048576 -k gather -b cuda -v 0 config bytes time(s) bw(MB/s) 0 8589934592 0.0127528 673570~~

Update (#154) :

@plavin @jyoung3131
Looks like I have the kernels working on CUDA. Here is the performance comparison for the 2 different gather configurations mentioned in a comment above (Note that the original kernels use Configuration 2):

Only setting x
Assigning the gathered value to the correct place in the dense array

Configuration 1:

Configuration 2:

Fixes Pattern Parsing

Update Tests

…ackend

Fix Tests for Serial and OpenMP Backends

jyoung3131

We've reviewed the codebase, tested multiple variations, and reviewed all the outstanding issues. While there are likely some small bugfixes and additional testing, we've pulled these out into smaller issues that can be addressed, as needed.

JDTruj2018 added 23 commits November 21, 2023 17:27

Initial refactor commit to cpp

645bb44

Basic CUDA support

8270258

Added JSON Parsing

672f4ed

lower-case kernel and backend and default config values for JSON can …

acf127a

…be changed on command line

MPI Implementation

6ae2cf3

Consolidate repetitive code

a2d643c

Initial OpenMP

4e21064

Testing skeleton

19b67e0

More tests

5b454b6

MS1 implementation and test

b892a62

More testing re-factored

07a9fb2

Laplacian and JSON string patterns + tests

00a3e29

OpenMP thread count command line tests

e611898

Removed old OMP test

4ab4d06

number of threads from JSON file

2c8b2dc

Initial MultiGather, MultiScatter, ScatterGather implementation

0a23a82

Beginning to add delta, wrap, and count parameters

c09baef

Revamped concurrent and multilevel tests

d05f468

Converted last of tests to c++

31caf4e

Moved repeated parsing actions to function in Input.hh

0573769

Tests compile! (but don't pass yet)

99acb6c

More CL flags

f8b2703

boundary, pattern-size, single delta kernels on CPU with wrap and count

c7e270e

JDTruj2018 marked this pull request as draft December 15, 2023 07:28

Flatten pattern by accounting for delta prior to copying to GPU

24d927c

Forgot to call resize on cuda_patterns

498cbda

Output to match original spatter

7f36507

radelja and others added 12 commits July 15, 2024 15:23

Fixed regular expression for custom patterns

bb414c2

Merge pull request #14 from radelja/refactor-fix-patterns

52cb89b

Fixes Pattern Parsing

Removed vectors used for multiple deltas

afb5dc7

Fixed issue with kernel option parsing

b6333cc

Print config stats after each config is finished

3fbacb7

Update tests to reflect changes to refactor

98e846c

Removed tests for unsupported features

de49d97

Removed unnecessary transfers from device to host

1284309

Removed variables intended for unsupported features

456ba36

Updated test for uniform stride

c7791e3

Updated omp-threads test to only run on OPENMP backend

121bc93

Merge pull request #15 from radelja/refactor-fix-tests

975a53f

Update Tests

jyoung3131 marked this pull request as ready for review July 22, 2024 16:03

jyoung3131 and others added 7 commits July 22, 2024 14:25

Merge remote-tracking branch 'remotes/spatter/main' into refactor

3b5abb0

Fix issue with local-work-size option

f2de48b

Updated local-work-size and shared-memory tests to only run on CUDA b…

bfe39c9

…ackend

Updated default boundary value

a212b89

Fix issue with shared-memory option

91a49a4

Fix issue with random option and update its test

2972e01

Merge pull request #16 from radelja/refactor-fix-args

d46b0a2

Fix Tests for Serial and OpenMP Backends

jyoung3131 changed the title ~~WIP: Refactor~~ Spatter Refactor - C++, MPI, Parsing, CMake updates Jul 24, 2024

jyoung3131 mentioned this pull request Jul 24, 2024

Memory leak in argtable3 #70

Closed

This was referenced Jul 24, 2024

struct run_config is too large #104

Closed

Spatter trace description for trace replay with architectural simulators #89

Closed

ASAN reporting stack-use-after-scope in parse_p #103

Closed

jyoung3131 approved these changes Jul 24, 2024

View reviewed changes

jyoung3131 merged commit ad239dd into hpcgarage:main Jul 24, 2024
3 checks passed

This was referenced Aug 5, 2024

Add CUDA API-based error checking to GPU kernels in CUDA backend #158

Closed

Check for errors on CUDA calls #159

Closed

Remove Unused Kernels and Wrappers #113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spatter Refactor - C++, MPI, Parsing, CMake updates #165

Spatter Refactor - C++, MPI, Parsing, CMake updates #165

JDTruj2018 commented Dec 15, 2023 •

edited by jyoung3131

Loading

JDTruj2018 commented Dec 15, 2023 •

edited

Loading

JDTruj2018 commented Dec 15, 2023

JDTruj2018 commented Dec 15, 2023 •

edited

Loading

jyoung3131 left a comment

Spatter Refactor - C++, MPI, Parsing, CMake updates #165

Spatter Refactor - C++, MPI, Parsing, CMake updates #165

Conversation

JDTruj2018 commented Dec 15, 2023 • edited by jyoung3131 Loading

JDTruj2018 commented Dec 15, 2023 • edited Loading

JDTruj2018 commented Dec 15, 2023

JDTruj2018 commented Dec 15, 2023 • edited Loading

jyoung3131 left a comment

Choose a reason for hiding this comment

JDTruj2018 commented Dec 15, 2023 •

edited by jyoung3131

Loading

JDTruj2018 commented Dec 15, 2023 •

edited

Loading

JDTruj2018 commented Dec 15, 2023 •

edited

Loading