Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spatter Refactor - C++, MPI, Parsing, CMake updates #165

Merged
merged 120 commits into from
Jul 24, 2024

Conversation

JDTruj2018
Copy link
Collaborator

@JDTruj2018 JDTruj2018 commented Dec 15, 2023

Re-factoring to C++ to simplify parsing and memory management. The goal is to be as backwards-compatible as possible, from command line flags and behavior, to performance and output.

Currently, the only difference in command line flags is the usage of -u <--pattern-scatter> rather than -h <--pattern-scatter> to free up the -h flag for --help and the -q <--no-print-header> flag's functionality has been moved to -v <--verbosity>. We will have to find a solution for the previous -v <--vector-len> flag. Finally, the -f <--kernel-file> flag is now used to direct to the JSON file, rather than overloading the -p flag.

Changes:
--help --> -h <--help>
-h <--pattern-scatter> --> -u <--pattern-scatter>
-q <--no-print-header> --> -v <--verbose>
-v <--vector-len> --> ?
-f <--kernel-file> --> -f <--json-file>

Are we interested in completing/pursuing this?

Complete:

  • Serial, OpenMP, and CUDA Support
  • MPI Support
  • Building Tests
  • Pattern, Pattern Gather, Pattern Scatter, Delta, Delta Gather, Delta Scatter Support
  • UNIFORM, MS1, and LAPLACIAN Pattern Support
  • Gather, Scatter, Multiscatter, Multigather, Scattergather kernels with single delta Support
  • Wrap and Count and single delta support on CPU
  • Boundary support (remap)
  • Pattern Size support (truncating patterns)
  • JSON Parsing Support (nlohmann-json)
  • Command Line Parsing Support
  • Help and Usage Messages
  • CMake Support
  • Config Name Support
  • Verbosity Levels
  • Set number of threads support
  • Set number of runs support
  • Basic result reporting
  • Binary Traces Support
  • Update CUDA Kernels for Delta, Wrap, and Count values (particularly for long patterns). Need a workaround for multi-level kernels, but should be fair for gather, scatter, and gs.
  • Verify Performance
  • Strong Scaling by Splitting Pattern amongst Ranks
  • Atomics for CUDA
  • Atomics Options
  • Aligned Allocation
  • Clean reporting (with MPI) - allreduce for stats
  • Catch CUDA Errors (WIP: Shubham, Patrick, Jeff)
  • Performance alignment with current Spatter for Gather, Multigather (WIP: Jered/Patrick/Jeff)
  • Multiple Target Vectors for OpenMP Backend (WIP: Patrick)
  • Passing Tests

To-do:

  • Remove unneeded flags/variables (WIP: Jered) - May need to readdress in bugfixes.
  • Aggregate, compress, validate (WIP: Connor)
    - [ ] Multiple Deltas Capability
    - [ ] OpenCL Support
    - [ ] Op Support
    - [ ] PAPI Support
    - [ ] RO_Hilbert, RO_Morton, Strided Support
    - [ ] vector-len Support
    - [ ] Validation moved to GPU Validation and Readd Validate Flag #194

Future releases:

Not Needed:
- Old utility files (trace_util.c for example)

@JDTruj2018 JDTruj2018 marked this pull request as draft December 15, 2023 07:28
@JDTruj2018
Copy link
Collaborator Author

JDTruj2018 commented Dec 15, 2023

@plavin Could you take a look at the CUDA kernels when you get the chance? I'll post updated performance number below and if things look okay to you we can try to port this over to the Spatter main branch.

My work around to get everything working with delta, count, and wrap without having to do the templating
currently assumes a local_work_size of min(pattern.size(), 1024) and sets this to the threads_per_block, and then calculate the blocks_per_grid from the global_work_size / local_work_size. Then in the kernel itself, I do the following (this is the gather example):

 float cuda_gather_wrapper(const size_t *pattern, const double *sparse,
    double *dense, const size_t pattern_length, const size_t delta,
    const size_t wrap, const size_t count) {
  cudaEvent_t start, stop;

  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  int threads_per_block = min(pattern_length, (size_t)1024);
  int blocks_per_grid =
      ((pattern_length * count) + threads_per_block - 1) / threads_per_block;

  cudaDeviceSynchronize();
  cudaEventRecord(start);

  cuda_gather<<<blocks_per_grid, threads_per_block>>>(
      pattern, sparse, dense, pattern_length, delta, wrap, count);

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);

  float time_ms = 0;
  cudaEventElapsedTime(&time_ms, start, stop);

  cudaEventDestroy(start);
  cudaEventDestroy(stop);

  return time_ms;
}
__global__ void cuda_gather(const size_t *pattern, const double *sparse,
    double *dense, const size_t pattern_length, const size_t delta,
    const size_t wrap, const size_t count) {
  size_t total_id =
      (size_t)((size_t)blockDim.x * (size_t)blockIdx.x + (size_t)threadIdx.x);
  size_t j = total_id % pattern_length; // pat_idx
  size_t i = total_id / pattern_length; // count_idx

  double x;

  if (i < count) {
    // dense[j + pattern_length * (i % wrap)] = sparse[pattern[j] + delta * i]; // configuration 1
    x = sparse[pattern[j] + delta * i]; // configuration 2
    if (x == 0.5)
      dense[0] = x;
  }
}

I've ran this in 2 configurations for the gather:

  1. Only setting x
  2. Assigning the gathered value to the correct place in the dense array

Note that Spatter original uses configuration 2.

@JDTruj2018
Copy link
Collaborator Author

Some early performance numbers on Haswell

image

@JDTruj2018
Copy link
Collaborator Author

JDTruj2018 commented Dec 15, 2023

GPU Performance still needs some work (this is a V100)

Original:

../../spatter/build_cuda_workflow/spatter -b cuda -pUNIFORM:1024:1 -l 1048576 -k gather -q2
config  bytes        time(s)      bw(MB/s)    
0       8589934592   0.007443     1154079.646614

Min         25%          Med          75%          Max         
1.15408e+06  1.15408e+06  1.15408e+06  1.15408e+06  1.15408e+06 
H.Mean       H.StdErr    

New:

./src/spatter-driver -pUNIFORM:1024:1 -l 1048576 -k gather -b cuda -v 0
config         bytes          time(s)        bw(MB/s)       
0              8589934592     0.0127528      673570 

Update (#154) :

@plavin @jyoung3131
Looks like I have the kernels working on CUDA. Here is the performance comparison for the 2 different gather configurations mentioned in a comment above (Note that the original kernels use Configuration 2):

  1. Only setting x
  2. Assigning the gathered value to the correct place in the dense array

Configuration 1:

image

Configuration 2:

image

@jyoung3131 jyoung3131 marked this pull request as ready for review July 22, 2024 16:03
@jyoung3131 jyoung3131 changed the title WIP: Refactor Spatter Refactor - C++, MPI, Parsing, CMake updates Jul 24, 2024
Copy link
Contributor

@jyoung3131 jyoung3131 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've reviewed the codebase, tested multiple variations, and reviewed all the outstanding issues. While there are likely some small bugfixes and additional testing, we've pulled these out into smaller issues that can be addressed, as needed.

@jyoung3131 jyoung3131 merged commit ad239dd into hpcgarage:main Jul 24, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants