Implement independent_groups and independent_group_elements for Alpaka kernels #43624

fwyzard · 2023-12-21T00:02:01Z

PR description:

Implement independent_groups(acc, groups) and independent_group_elements(acc, elements) for Alpaka kernels:

independent_groups(acc, groups) returns a range than spans the group indices from 0 to groups, with one group per block;
independent_group_elements(acc, elements) returns a range that spans all the elements within the given group, from 0 to elements.

Add uniform_groups and uniform_group_elements type aliases.

Extend elements_with_stride with an optional starting value.

Add a test for independent_groups, independent_group_elements and elements_with_stride with a starting value.

PR validation:

The new integration test passes on CPU and on GPU.

- `independent_groups(acc, groups)` returns a range than spans the group indices from 0 to `groups`, with one group per block; - `independent_group_elements(acc, elements)` returns a range that spans all the elements within the given group, from 0 to `elements`.

fwyzard · 2023-12-21T00:02:08Z

please test

cmsbuild · 2023-12-21T00:02:23Z

cms-bot internal usage

cmsbuild · 2023-12-21T00:07:52Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43624/38271

This PR adds an extra 24KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File HeterogeneousCore/AlpakaInterface/interface/workdivision.h modified in PR(s): Adding HeterogeneousCore/AlpakaUtilities and changes into HeterogeneousCore/AlpakaInterface #40932, Porting Pixel Tracks to Alpaka [Not to Merge] #41117
- File HeterogeneousCore/AlpakaInterface/test/BuildFile.xml modified in PR(s): Porting Pixel Tracks to Alpaka [Not to Merge] #41117, Ports prefixScan, OneToManyAssoc and HistoContainer from CUDAUtilities. #43064

cmsbuild · 2023-12-21T00:08:13Z

A new Pull Request was created by @fwyzard (Andrea Bocci) for master.

It involves the following packages:

HeterogeneousCore/AlpakaInterface (heterogeneous)

@makortel, @fwyzard can you please review it and eventually sign? Thanks.
@missirol, @makortel, @rovere this is something you requested to watch as well.
@antoniovilela, @sextonkennedy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

cmsbuild · 2023-12-21T02:54:44Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81e25c/36618/summary.html
COMMIT: 3dfb1a9
CMSSW: CMSSW_14_0_X_2023-12-20-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/43624/36618/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially added 14 lines to the logs
Reco comparison results: 25 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3247277
DQMHistoTests: Total failures: 1219
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3246036
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 200 log files, 161 edm output root files, 48 DQM output files
TriggerResults: no differences found

cmsbuild · 2023-12-21T10:50:55Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43624/38273

This PR adds an extra 24KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File HeterogeneousCore/AlpakaInterface/interface/workdivision.h modified in PR(s): Adding HeterogeneousCore/AlpakaUtilities and changes into HeterogeneousCore/AlpakaInterface #40932, Porting Pixel Tracks to Alpaka [Not to Merge] #41117
- File HeterogeneousCore/AlpakaInterface/test/BuildFile.xml modified in PR(s): Porting Pixel Tracks to Alpaka [Not to Merge] #41117, Ports prefixScan, OneToManyAssoc and HistoContainer from CUDAUtilities. #43064

cmsbuild · 2023-12-21T10:51:17Z

Pull request #43624 was updated. @fwyzard, @makortel, @cmsbuild can you please check and sign again.

cmsbuild · 2023-12-21T11:16:36Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43624/38276

This PR adds an extra 28KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File HeterogeneousCore/AlpakaInterface/interface/workdivision.h modified in PR(s): Adding HeterogeneousCore/AlpakaUtilities and changes into HeterogeneousCore/AlpakaInterface #40932, Porting Pixel Tracks to Alpaka [Not to Merge] #41117
- File HeterogeneousCore/AlpakaInterface/test/BuildFile.xml modified in PR(s): Porting Pixel Tracks to Alpaka [Not to Merge] #41117, Ports prefixScan, OneToManyAssoc and HistoContainer from CUDAUtilities. #43064

cmsbuild · 2023-12-21T11:17:04Z

Pull request #43624 was updated. @cmsbuild, @fwyzard, @makortel can you please check and sign again.

fwyzard · 2023-12-21T13:37:15Z

please test

fwyzard · 2023-12-21T13:37:24Z

+heterogeneous

cmsbuild · 2023-12-21T13:37:37Z

This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

fwyzard · 2023-12-21T13:37:49Z

enable gpu

fwyzard · 2023-12-21T13:37:53Z

please test

fwyzard · 2023-12-21T13:38:36Z

@antoniovilela @rappoccio, if @makortel does not have any comments, can you pick this for pre2 ?

cmsbuild · 2023-12-21T16:39:55Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81e25c/36634/summary.html
COMMIT: d5e466b
CMSSW: CMSSW_14_0_X_2023-12-21-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/43624/36634/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially added 80 lines to the logs
Reco comparison results: 22 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3247277
DQMHistoTests: Total failures: 1216
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3246039
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 200 log files, 161 edm output root files, 48 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 35 differences found in the comparisons
DQMHistoTests: Total files compared: 3
DQMHistoTests: Total histograms compared: 39740
DQMHistoTests: Total failures: 1411
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 38329
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
Checked 8 log files, 10 edm output root files, 3 DQM output files
TriggerResults: no differences found

antoniovilela · 2023-12-21T17:09:59Z

+1

makortel

A few remarks that can be addressed later.

I'd also suggest to provide a concise recipe-like documentation, e.g. in HeterogeneousCore/AlpakaInterface/README.md, that would list the (most common) use cases, and what loop structure to use in each case (with link(s) to more details, e.g. the comments in this header).

makortel · 2023-12-21T14:52:58Z

HeterogeneousCore/AlpakaInterface/interface/workdivision.h

+   * If the work division has more than 63 blocks, the first 63 will perform one iteration of the loop, while the other
+   * blocks will exit immediately.
+   * If the work division has less than 63 blocks, some of the blocks will perform more than one iteration, in order to
+   * cover then whole problem space.


Further above had

All threads in a block see the same loop iterations

I think it would be good to clarify what this means for the behavior of the elements of the last group (given how important this is for syncBlockThreads()).

In this example I'd guess the threads 0-7 of the block executing the group 62 would run the loop body, and the threads 8-15 would not (i.e. in a way the threads 8-15 do not see the same loop iterations)

No, it's as the comment says: all threads (0-15) will execute the loop body.

It's up to the inner loop (or the user, if they don't use an inner loop) to run only on a subset of the threads.

In fact,

for (auto group: uniform_groups(acc, 1000) { ... }

should be exactly the same as

for (auto group: independent_groups(acc, round_up_by(1000, alpaka::getWorkDiv<alpaka::Grid, alpaka::Elems>)) { ... }

Ah right, this loop is about the groups/blocks, and not about elements, thanks.

So if one would do

for (auto group : uniform_groups(acc, 1000) { for (auto element : uniform_group_elements(acc, group, 1000) { ... } }

it's the inner loop (uniform_group_elements) where one needs to be careful with alpaka::syncBlockThreads(), right?

And there it would happen that the threads 0-7 of the block executing group 62 would run the inner loop, and there threads 8-15 would not, right?

If this is the case, how about adding a warning in the comment of uniform_group_elements (perhaps also for elements_in_block) that the alpaka::syncBlockThreads() is not generally safe to be called within that loop?

it's the inner loop (uniform_group_elements) where one needs to be careful with alpaka::syncBlockThreads(), right?

Yes.

In fact, it does not make sense to call alpaka::syncBlockThreads() within the inner loop.
One should always split the inner loop in multiple loops, and synchronise between them:

for (auto group : uniform_groups(acc, 1000) { for (auto element : uniform_group_elements(acc, group, 1000) { // no synchronisations here ... } alpaka::syncBlockThreads(); for (auto element : uniform_group_elements(acc, group, 1000) { // no synchronisations here ... } alpaka::syncBlockThreads(); }

And there it would happen that the threads 0-7 of the block executing group 62 would run the inner loop, and there threads 8-15 would not, right?

All threads would execute the for (auto element : uniform_group_elements(acc, group, 1000) loop.
However, the number of iterations of the inner loop will be different for the different threads: 0-7 will execute 1 iteration, 8-15 will execute 0 iterations.

If this is the case, how about adding a warning in the comment of uniform_group_elements (perhaps also for elements_in_block) that the alpaka::syncBlockThreads() is not generally safe to be called within that loop?

Sure, see #43632 .

However, the number of iterations of the inner loop will be different for the different threads: 0-7 will execute 1 iteration, 8-15 will execute 0 iterations.

Thanks, this is what I was after, but written more accurately.

makortel · 2023-12-21T14:58:24Z

HeterogeneousCore/AlpakaInterface/interface/workdivision.h

+          stride_{alpaka::getWorkDiv<alpaka::Grid, alpaka::Blocks>(acc)[0u]},
+          extent_{groups} {}
+
+    class iterator {


In principle this class could be named as const_iterator as in practice it provides only const access. Such name would avoid wondering of why begin() const returns a non-const iterator, before realizing the iterator itself gives only const access.

I don't mind the type name, since I don't expect anybody will ever use the nested iterator types explicitly.
For the same reason I don't mind renaming them - and I assume the same comment applies to the other range-loop types.

Done in #43632 .

makortel · 2023-12-21T14:59:21Z

HeterogeneousCore/AlpakaInterface/interface/workdivision.h

+   *
+   * Iterating over the range yields the local element index, between `0` and `elements - 1`. The threads in the block
+   * will perform one or more iterations, depending on the number of elements per thread, and on the number of threads
+   * per block, ocmpared with the total number of elements.


Suggested change

* per block, ocmpared with the total number of elements.

* per block, compared with the total number of elements.

Fixed in #43632 .

makortel · 2023-12-21T22:02:05Z

HeterogeneousCore/AlpakaInterface/interface/workdivision.h

+          stride_{alpaka::getWorkDiv<alpaka::Block, alpaka::Threads>(acc)[0u] * elements_},
+          extent_{extent} {}
+
+    class iterator {


Also here I'd find const_iterator clearer name.

Renamed in #43632 .

fwyzard added 3 commits December 20, 2023 14:32

Add uniform_groups and uniform_group_elements type aliases

e25884b

Add a test for independent_groups and independent_group_elements

3dfb1a9

cmsbuild added this to the CMSSW_14_0_X milestone Dec 21, 2023

cmsbuild added pending-signatures orp-pending tests-started code-checks-pending heterogeneous-pending labels Dec 21, 2023

cmsbuild added code-checks-approved and removed code-checks-pending labels Dec 21, 2023

cmsbuild added tests-approved and removed tests-started labels Dec 21, 2023

Add support for shifted elements_with_stride loop in range for

ce64e0a

cmsbuild added tests-pending code-checks-pending and removed tests-approved code-checks-approved labels Dec 21, 2023

cmsbuild added code-checks-approved and removed code-checks-pending labels Dec 21, 2023

Add tests for the shifted elements_with_stride

d5e466b

cmsbuild added code-checks-pending and removed code-checks-approved labels Dec 21, 2023

cmsbuild added code-checks-approved and removed code-checks-pending labels Dec 21, 2023

cmsbuild added fully-signed tests-started heterogeneous-approved and removed pending-signatures tests-pending heterogeneous-pending labels Dec 21, 2023

cmsbuild added tests-approved and removed tests-started labels Dec 21, 2023

cmsbuild added orp-approved and removed orp-pending labels Dec 21, 2023

cmsbuild merged commit 9fca027 into cms-sw:master Dec 21, 2023
14 checks passed

makortel reviewed Dec 21, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement independent_groups and independent_group_elements for Alpaka kernels #43624

Implement independent_groups and independent_group_elements for Alpaka kernels #43624

fwyzard commented Dec 21, 2023 •

edited

Loading

fwyzard commented Dec 21, 2023

cmsbuild commented Dec 21, 2023 •

edited

Loading

cmsbuild commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

fwyzard commented Dec 21, 2023

fwyzard commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

fwyzard commented Dec 21, 2023

fwyzard commented Dec 21, 2023

fwyzard commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

antoniovilela commented Dec 21, 2023

makortel left a comment

makortel Dec 21, 2023

fwyzard Dec 21, 2023

makortel Dec 21, 2023

fwyzard Dec 21, 2023

fwyzard Dec 21, 2023

fwyzard Dec 21, 2023

makortel Dec 22, 2023

makortel Dec 21, 2023

fwyzard Dec 21, 2023

fwyzard Dec 21, 2023

makortel Dec 21, 2023

fwyzard Dec 21, 2023

makortel Dec 21, 2023

fwyzard Dec 21, 2023

	* per block, ocmpared with the total number of elements.
	* per block, compared with the total number of elements.

Implement independent_groups and independent_group_elements for Alpaka kernels #43624

Implement independent_groups and independent_group_elements for Alpaka kernels #43624

Conversation

fwyzard commented Dec 21, 2023 • edited Loading

PR description:

PR validation:

fwyzard commented Dec 21, 2023

cmsbuild commented Dec 21, 2023 • edited Loading

cmsbuild commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

Comparison Summary

cmsbuild commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

fwyzard commented Dec 21, 2023

fwyzard commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

fwyzard commented Dec 21, 2023

fwyzard commented Dec 21, 2023

fwyzard commented Dec 21, 2023

cmsbuild commented Dec 21, 2023

Comparison Summary

GPU Comparison Summary

antoniovilela commented Dec 21, 2023

makortel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fwyzard commented Dec 21, 2023 •

edited

Loading

cmsbuild commented Dec 21, 2023 •

edited

Loading