Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement independent_groups and independent_group_elements for Alpaka kernels #43624

Merged
merged 5 commits into from
Dec 21, 2023

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Dec 21, 2023

PR description:

Implement independent_groups(acc, groups) and independent_group_elements(acc, elements) for Alpaka kernels:

  • independent_groups(acc, groups) returns a range than spans the group indices from 0 to groups, with one group per block;
  • independent_group_elements(acc, elements) returns a range that spans all the elements within the given group, from 0 to elements.

Add uniform_groups and uniform_group_elements type aliases.

Extend elements_with_stride with an optional starting value.

Add a test for independent_groups, independent_group_elements and elements_with_stride with a starting value.

PR validation:

The new integration test passes on CPU and on GPU.

  - `independent_groups(acc, groups)` returns a range than spans the group
    indices from 0 to `groups`, with one group per block;

  - `independent_group_elements(acc, elements)` returns a range that spans all
    the elements within the given group, from 0 to `elements`.
@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 21, 2023

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 21, 2023

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43624/38271

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard (Andrea Bocci) for master.

It involves the following packages:

  • HeterogeneousCore/AlpakaInterface (heterogeneous)

@makortel, @fwyzard can you please review it and eventually sign? Thanks.
@missirol, @makortel, @rovere this is something you requested to watch as well.
@antoniovilela, @sextonkennedy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81e25c/36618/summary.html
COMMIT: 3dfb1a9
CMSSW: CMSSW_14_0_X_2023-12-20-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/43624/36618/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43624/38273

@cmsbuild
Copy link
Contributor

Pull request #43624 was updated. @fwyzard, @makortel, @cmsbuild can you please check and sign again.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43624/38276

@cmsbuild
Copy link
Contributor

Pull request #43624 was updated. @cmsbuild, @fwyzard, @makortel can you please check and sign again.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 21, 2023

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 21, 2023

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 21, 2023

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 21, 2023

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 21, 2023

@antoniovilela @rappoccio, if @makortel does not have any comments, can you pick this for pre2 ?

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81e25c/36634/summary.html
COMMIT: d5e466b
CMSSW: CMSSW_14_0_X_2023-12-21-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/43624/36634/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 35 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39740
  • DQMHistoTests: Total failures: 1411
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 38329
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@antoniovilela
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 9fca027 into cms-sw:master Dec 21, 2023
14 checks passed
Copy link
Contributor

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few remarks that can be addressed later.

I'd also suggest to provide a concise recipe-like documentation, e.g. in HeterogeneousCore/AlpakaInterface/README.md, that would list the (most common) use cases, and what loop structure to use in each case (with link(s) to more details, e.g. the comments in this header).

* If the work division has more than 63 blocks, the first 63 will perform one iteration of the loop, while the other
* blocks will exit immediately.
* If the work division has less than 63 blocks, some of the blocks will perform more than one iteration, in order to
* cover then whole problem space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further above had

All threads in a block see the same loop iterations

I think it would be good to clarify what this means for the behavior of the elements of the last group (given how important this is for syncBlockThreads()).

In this example I'd guess the threads 0-7 of the block executing the group 62 would run the loop body, and the threads 8-15 would not (i.e. in a way the threads 8-15 do not see the same loop iterations)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's as the comment says: all threads (0-15) will execute the loop body.

It's up to the inner loop (or the user, if they don't use an inner loop) to run only on a subset of the threads.

In fact,

for (auto group: uniform_groups(acc, 1000) { ... }

should be exactly the same as

for (auto group: independent_groups(acc, round_up_by(1000, alpaka::getWorkDiv<alpaka::Grid, alpaka::Elems>)) { ... }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, this loop is about the groups/blocks, and not about elements, thanks.

So if one would do

for (auto group : uniform_groups(acc, 1000) {
  for (auto element : uniform_group_elements(acc, group, 1000) {
     ...
  }
}

it's the inner loop (uniform_group_elements) where one needs to be careful with alpaka::syncBlockThreads(), right?

And there it would happen that the threads 0-7 of the block executing group 62 would run the inner loop, and there threads 8-15 would not, right?

If this is the case, how about adding a warning in the comment of uniform_group_elements (perhaps also for elements_in_block) that the alpaka::syncBlockThreads() is not generally safe to be called within that loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the inner loop (uniform_group_elements) where one needs to be careful with alpaka::syncBlockThreads(), right?

Yes.

In fact, it does not make sense to call alpaka::syncBlockThreads() within the inner loop.
One should always split the inner loop in multiple loops, and synchronise between them:

for (auto group : uniform_groups(acc, 1000) {
  for (auto element : uniform_group_elements(acc, group, 1000) {
     // no synchronisations here
     ...
  }
  alpaka::syncBlockThreads();
  for (auto element : uniform_group_elements(acc, group, 1000) {
     // no synchronisations here
     ...
  }
  alpaka::syncBlockThreads();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there it would happen that the threads 0-7 of the block executing group 62 would run the inner loop, and there threads 8-15 would not, right?

All threads would execute the for (auto element : uniform_group_elements(acc, group, 1000) loop.
However, the number of iterations of the inner loop will be different for the different threads: 0-7 will execute 1 iteration, 8-15 will execute 0 iterations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the case, how about adding a warning in the comment of uniform_group_elements (perhaps also for elements_in_block) that the alpaka::syncBlockThreads() is not generally safe to be called within that loop?

Sure, see #43632 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the number of iterations of the inner loop will be different for the different threads: 0-7 will execute 1 iteration, 8-15 will execute 0 iterations.

Thanks, this is what I was after, but written more accurately.

stride_{alpaka::getWorkDiv<alpaka::Grid, alpaka::Blocks>(acc)[0u]},
extent_{groups} {}

class iterator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle this class could be named as const_iterator as in practice it provides only const access. Such name would avoid wondering of why begin() const returns a non-const iterator, before realizing the iterator itself gives only const access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the type name, since I don't expect anybody will ever use the nested iterator types explicitly.
For the same reason I don't mind renaming them - and I assume the same comment applies to the other range-loop types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #43632 .

*
* Iterating over the range yields the local element index, between `0` and `elements - 1`. The threads in the block
* will perform one or more iterations, depending on the number of elements per thread, and on the number of threads
* per block, ocmpared with the total number of elements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* per block, ocmpared with the total number of elements.
* per block, compared with the total number of elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #43632 .

stride_{alpaka::getWorkDiv<alpaka::Block, alpaka::Threads>(acc)[0u] * elements_},
extent_{extent} {}

class iterator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here I'd find const_iterator clearer name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in #43632 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants