Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of kokkos serial backend vs. plain serial code #297

Open
markdewing opened this issue Jan 21, 2022 · 2 comments
Open

Performance of kokkos serial backend vs. plain serial code #297

markdewing opened this issue Jan 21, 2022 · 2 comments
Labels

Comments

@markdewing
Copy link
Contributor

The time from the kokkos serial backend (kokkos --serial) is slower for one thread than the standalone serial code (serial).

Looking at profiles, the function kernel_connect ( in plugin-PixelTriplets) takes significantly more time in the kokkos version. From the instructions retired, it is clearly performing more operations in the kokkos version.

The loops do not perform their iterations in the same order.

<outer loop from kokkos>
  <kernel_connect function body>
     for (int idx = firstCellIndex, nt = (nCells()); idx < nt; idx += leagueSize * blockDim) {
     ...
       for (int j = first; j < numberOfPossibleNeighbors; j += stride) {

The serial version has no outer loop. However, printing the values for idx and j shows the same values get accessed, just in a different order between the versions.

It looks like (based on instructions retired), the kokkos version is doing more work (by a factor of 2x or more), in routines like areAlignedRZ. But based on printing out how many times it's called, they should be the same.

@markdewing
Copy link
Contributor Author

In plugin-SiPixelClusterizer, there's a difference between the serial code and Kokkos code.
In gpuClusterChargeCut.h, there is a loop that sets some variables to zero.

Kokkos version (around line 97)

         for (uint32_t i = teamMember.team_rank(); i < ::gpuClustering::MaxNumClustersPerModules;
               i += teamMember.team_size()) {
            charge(i) = 0;
            ok(i) = 0;
            newclusId(i) = 0;
          }

The serial version (around line 58)

      for (auto i = threadIdx.x; i < nclus; i += blockDim.x) {
        charge[i] = 0;
      }

Two reasons the Kokkos version is doing more work:

  • the serial version of the loop bound is nclus (about 10) but the Kokkos version loop bound is MaxNumClustersPerModules (1024).
  • the Kokkos version also sets ok and newclusid to zero. These get set in a later loop, I'm not sure this is necessary.

If the Kokkos loop is changed to match the serial loop, the performance becomes similar.

Notes

  • The starting value (0) and increment (1) of the loops is the same for both.
  • The performance is similar at least for a sufficiently large number of events - there must be some startup overhead to Kokkos. 1000 events is sufficient - both get 27-28 events/s on my machine. Using 100 events shows no drop in performance of serial code (27 events/s) where the kokkos serial version clearly gets lower performance (24.5 events/s)
  • I didn't check whether there is a correctness issue with the change to the Kokkos loop.

Link to Kokkos version:
https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/kokkos/plugin-SiPixelClusterizer/kokkos/gpuClusterChargeCut.h

Link to serial version:
https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/serial/plugin-SiPixelClusterizer/gpuClusterChargeCut.h

@makortel
Copy link
Collaborator

@markdewing Thanks, good catch. Digging from history both of these (the loop bound and initialization of additional variables) was done in #80, and probably this comment #80 (comment) explains the reasoning. Since both serial and cuda versions do otherwise, I think we should do the same for kokkos version too since they appear to affect performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants